Building systems that make AI reliable
After this, you'll be able to describe what a harness adds beyond tools, identify the three pillars (observability, spec-as-test, autonomous loops), and pick the first piece to build for your own setup.
Before you start
You'll want a working sense of Your First MCP Server before this lesson wraps those tools in the feedback loops that make autonomous runs verifiable.
The idea
At Level 7 you gave the agent tools. At Level 8 you give it a feedback loop. Harness engineering is the infrastructure that lets the agent see what it built, run its own tests, and iterate without you watching every step. Automated tests, pre-commit hooks, linters, and CI pipelines the agent can trigger, read, and respond to on its own.
The difference from Level 7 is not capability. It is autonomy. A Level 7 agent can act. A Level 8 agent can verify that its actions worked. That changes what 'done' means. You are no longer the final checker. The harness is.
Here is the before and after: Level 7 setup: agent refactors a module, you come back and manually run the test suite. Takes 20 minutes of your time per run. Level 8 setup: agent refactors the module, triggers the test suite via MCP, reads the output, fixes the two failing tests, and opens a PR only after green. You review one PR instead of babysitting four iterations. Same agent, different harness, 80% less your time per task.
Observability is the first thing to build: structured logs for every agent action, trace IDs that link inputs to outputs, and an audit trail you can replay when a run goes wrong. Without this, debugging a run that failed three hours ago is close to impossible. With it, you find the exact step where things broke.
Specification-as-test is the Level 8 testing pattern: write the expected outputs before the agent runs, then verify the agent's output against those specifications. This catches 'did the agent do what I specified' failures that unit tests miss entirely. It is the agent equivalent of writing tests before writing code.
The SprintLessons for Level 8 cover wiring test suites into agent runs, building a replay-from-logs capability, and setting up cost tracking per task. This lesson is the frame. The ceiling here is that the agent still runs tasks one at a time, sequentially, on your machine. Running many agents in parallel without supervision is Level 9.
Try it (5 min)
Watch out for
Paste this into Claude:
I want to start building a harness around one task I run with Claude regularly. The task is: [describe one repeatable task, e.g. 'refactor a module', 'fix failing tests', 'update API types across files']. Help me design the first version of a harness for it. Specifically: (1) Write 3-5 spec criteria that define what 'done' looks like in concrete, checkable terms. (2) Tell me which tool or check Claude can run to verify each criterion (test runner, linter, grep, type-check). (3) Describe how Claude should respond when a criterion fails: revise and re-check, or surface the failure to me. Do not run the task yet. Just design the loop.
What good looks like:
When this breaks
You can now
Identify one task you run regularly, write 3-5 binary spec criteria for it, and name the specific tool or command that verifies each one.
Key takeaways
A harness lets the agent verify its own output. You move from checking every change to reviewing PRs, because the loop closes itself.