Harness Engineering 101

Building systems that make AI reliable

After this, you'll be able to describe what a harness adds beyond tools, identify the three pillars (observability, spec-as-test, autonomous loops), and pick the first piece to build for your own setup.

Before you start

You'll want a working sense of Your First MCP Server before this lesson wraps those tools in the feedback loops that make autonomous runs verifiable.

The idea

At Level 7 you gave the agent tools. At Level 8 you give it a feedback loop. Harness engineering is the infrastructure that lets the agent see what it built, run its own tests, and iterate without you watching every step. Automated tests, pre-commit hooks, linters, and CI pipelines the agent can trigger, read, and respond to on its own.

A loose agent run crosses validation blocks after the work is already complete. — The starting state for Harness Engineering 101.

Harness Engineering 101 pathUse this model to move from the starting mistake to the lesson check.

The difference from Level 7 is not capability. It is autonomy. A Level 7 agent can act. A Level 8 agent can verify that its actions worked. That changes what 'done' means. You are no longer the final checker. The harness is.

Here is the before and after: Level 7 setup: agent refactors a module, you come back and manually run the test suite. Takes 20 minutes of your time per run. Level 8 setup: agent refactors the module, triggers the test suite via MCP, reads the output, fixes the two failing tests, and opens a PR only after green. You review one PR instead of babysitting four iterations. Same agent, different harness, 80% less your time per task.

Observability is the first thing to build: structured logs for every agent action, trace IDs that link inputs to outputs, and an audit trail you can replay when a run goes wrong. Without this, debugging a run that failed three hours ago is close to impossible. With it, you find the exact step where things broke.

Specification-as-test is the Level 8 testing pattern: write the expected outputs before the agent runs, then verify the agent's output against those specifications. This catches 'did the agent do what I specified' failures that unit tests miss entirely. It is the agent equivalent of writing tests before writing code.

The SprintLessons for Level 8 cover wiring test suites into agent runs, building a replay-from-logs capability, and setting up cost tracking per task. This lesson is the frame. The ceiling here is that the agent still runs tasks one at a time, sequentially, on your machine. Running many agents in parallel without supervision is Level 9.

Try it (5 min)

Watch out for

Building observability after the first failure instead of before. The first run you scale is the one that fails silently.
Treating a passing test suite as full verification. Tests check what you specified, not whether the feature works end-to-end.
Skipping spec-as-test on tasks that feel simple. Vague specs on simple tasks still produce vague outputs.
Letting the harness loop run forever on a hard bug. Set a cap of 3-4 cycles, then look at the trace.
Removing your review step once the harness works. The harness verifies the agent. You verify the harness.

Paste this into Claude

I want to start building a harness around one task I run with Claude regularly. The task is: [describe one repeatable task, e.g. 'refactor a module', 'fix failing tests', 'update API types across files']. Help me design the first version of a harness for it. Specifically: (1) Write 3-5 spec criteria that define what 'done' looks like in concrete, checkable terms. (2) Tell me which tool or check Claude can run to verify each criterion (test runner, linter, grep, type-check). (3) Describe how Claude should respond when a criterion fails: revise and re-check, or surface the failure to me. Do not run the task yet. Just design the loop.

What good looks like

Each spec criterion is binary and checkable without judgment, not 'the code is clean'
Every criterion has a specific tool or command Claude can invoke to verify it
The plan describes what happens when a check fails: which criteria trigger revision, which require human review

When this breaks

Breaks when the harness has no replay capability because debugging a run that failed three hours ago requires reconstructing what the agent did, and without trace IDs the timeline is gone.
Breaks when success criteria require human judgment because the agent cannot close its own loop on 'is this code clean enough' the way it can close it on 'do tests pass'.
Breaks when you scale loops before adding observability because parallel failures multiply silently and you only notice when production breaks.

AI can help with this

Use AI to apply this lesson to your current work. Share your situation, ask for one concrete next step, and check the answer against this test: Identify one task you run regularly, write 3-5 binary spec criteria for it, and name the specific tool or command that verifies each one.

The run enters a harness lane where validation blocks come before the final output bead.

You can now

Identify one task you run regularly, write 3-5 binary spec criteria for it, and name the specific tool or command that verifies each one.

Key takeaways

A harness lets the agent verify its own output. You move from checking every change to reviewing PRs, because the loop closes itself.

A harness lets the agent verify its own output. You review PRs, not individual code changes
Observability first: structured logs, trace IDs, and replay capability before you trust any autonomous run
Specification-as-test: define expected outputs before the agent runs, then check against them
Background execution only makes sense once the harness exists. Otherwise you come back to undetected failures.

Was this helpful?

← Back to lessons

I want to start building a harness around one task I run with Claude regularly. The task is: [describe one repeatable task, e.g. 'refactor a module', 'fix failing tests', 'update API types across files']. Help me design the first version of a harness for it. Specifically: (1) Write 3-5 spec criteria that define what 'done' looks like in concrete, checkable terms. (2) Tell me which tool or check Claude can run to verify each criterion (test runner, linter, grep, type-check). (3) Describe how Claude should respond when a criterion fails: revise and re-check, or surface the failure to me. Do not run the task yet. Just design the loop.