Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L8Free

Harness Engineering 101

Building systems that make AI reliable

After this, you'll be able to describe what a harness adds beyond tools, identify the three pillars (observability, spec-as-test, autonomous loops), and pick the first piece to build for your own setup.

Before you start

You'll want a working sense of Your First MCP Server before this lesson wraps those tools in the feedback loops that make autonomous runs verifiable.

The idea

At Level 7 you gave the agent tools. At Level 8 you give it a feedback loop. Harness engineering is the infrastructure that lets the agent see what it built, run its own tests, and iterate without you watching every step. Automated tests, pre-commit hooks, linters, and CI pipelines the agent can trigger, read, and respond to on its own.

The difference from Level 7 is not capability. It is autonomy. A Level 7 agent can act. A Level 8 agent can verify that its actions worked. That changes what 'done' means. You are no longer the final checker. The harness is.

Here is the before and after: Level 7 setup: agent refactors a module, you come back and manually run the test suite. Takes 20 minutes of your time per run. Level 8 setup: agent refactors the module, triggers the test suite via MCP, reads the output, fixes the two failing tests, and opens a PR only after green. You review one PR instead of babysitting four iterations. Same agent, different harness, 80% less your time per task.

Observability is the first thing to build: structured logs for every agent action, trace IDs that link inputs to outputs, and an audit trail you can replay when a run goes wrong. Without this, debugging a run that failed three hours ago is close to impossible. With it, you find the exact step where things broke.

Specification-as-test is the Level 8 testing pattern: write the expected outputs before the agent runs, then verify the agent's output against those specifications. This catches 'did the agent do what I specified' failures that unit tests miss entirely. It is the agent equivalent of writing tests before writing code.

The SprintLessons for Level 8 cover wiring test suites into agent runs, building a replay-from-logs capability, and setting up cost tracking per task. This lesson is the frame. The ceiling here is that the agent still runs tasks one at a time, sequentially, on your machine. Running many agents in parallel without supervision is Level 9.

Try it (5 min)

Watch out for

  • Building observability after the first failure instead of before. The first run you scale is the one that fails silently.
  • Treating a passing test suite as full verification. Tests check what you specified, not whether the feature works end-to-end.
  • Skipping spec-as-test on tasks that feel simple. Vague specs on simple tasks still produce vague outputs.
  • Letting the harness loop run forever on a hard bug. Set a cap of 3-4 cycles, then look at the trace.
  • Removing your review step once the harness works. The harness verifies the agent. You verify the harness.

Paste this into Claude:

I want to start building a harness around one task I run with Claude regularly. The task is: [describe one repeatable task, e.g. 'refactor a module', 'fix failing tests', 'update API types across files']. Help me design the first version of a harness for it. Specifically: (1) Write 3-5 spec criteria that define what 'done' looks like in concrete, checkable terms. (2) Tell me which tool or check Claude can run to verify each criterion (test runner, linter, grep, type-check). (3) Describe how Claude should respond when a criterion fails: revise and re-check, or surface the failure to me. Do not run the task yet. Just design the loop.

What good looks like:

  • Each spec criterion is binary and checkable without judgment, not 'the code is clean'
  • Every criterion has a specific tool or command Claude can invoke to verify it
  • The plan describes what happens when a check fails: which criteria trigger revision, which require human review

When this breaks

  • Breaks when the harness has no replay capability because debugging a run that failed three hours ago requires reconstructing what the agent did, and without trace IDs the timeline is gone.
  • Breaks when success criteria require human judgment because the agent cannot close its own loop on 'is this code clean enough' the way it can close it on 'do tests pass'.
  • Breaks when you scale loops before adding observability because parallel failures multiply silently and you only notice when production breaks.

You can now

Identify one task you run regularly, write 3-5 binary spec criteria for it, and name the specific tool or command that verifies each one.

Key takeaways

A harness lets the agent verify its own output. You move from checking every change to reviewing PRs, because the loop closes itself.

  • A harness lets the agent verify its own output. You review PRs, not individual code changes
  • Observability first: structured logs, trace IDs, and replay capability before you trust any autonomous run
  • Specification-as-test: define expected outputs before the agent runs, then check against them
  • Background execution only makes sense once the harness exists. Otherwise you come back to undetected failures.