After this, you'll be able to write the expected output before the agent runs, verify the agent's output against that spec, and distinguish this from unit tests and model evals.
Before you start
Complete Build the Inner Loop first; this lesson builds on the loop so your spec exists before the agent runs rather than being written to fit the output after.
The idea
Here is the before and after: The agent refactors your API client and the result looks right. You merge it. Two days later a colleague reports a broken interface. The agent created a new export signature and your existing callers are silently broken. You never specified 'do not change the exported interface' because it felt obvious. It was not obvious to the agent.
Spec-as-test is the habit of writing exactly what done looks like before the agent runs. Not 'refactor the API client' but: (1) all existing tests pass, (2) no new files outside src/api/, (3) exported interface unchanged, (4) no console.log statements. Four binary criteria. After the run, you check each one. Pass or fail. No judgment call.
This is different from unit tests (which verify a specific function call) and from model evals (which score quality across 15 representative examples). A spec is precise for one specific task and checkable in under 2 minutes with grep and a test run.
The key habit: write the spec before the agent starts. A spec written after the run fits the output instead of defining the goal. Writing it first forces you to be precise about what done means. Most vague outcomes come from vague specs, not from model capability.
Try it (20 min)
Watch out for
Paste this into Claude:
I want to practice spec-as-test on a task before I run it. Here is the task I am about to give the agent: [describe a task, e.g. 'add input validation to the login form' or 'refactor the data fetching layer to use a single shared client']. Before running it, help me write a spec: a list of 4-6 criteria I can verify mechanically after the agent finishes. Each criterion should be binary (pass or fail), specific enough to check with grep or a test run, and cover: (1) structural constraints (which files change), (2) behavioral constraints (tests pass, linter clean), (3) interface constraints (no breaking changes to exported APIs). Then run the task and check the output against each criterion.
What good looks like:
What a good response looks like:
Spec written before running 'refactor data fetching layer to use a shared HTTP client': 1. STRUCTURAL: Only files inside src/api/ are modified. No new files created outside that directory. (check: git diff --name-only | grep -v '^src/api/') 2. BEHAVIORAL: All 63 existing tests pass after the change. (check: npm test exits 0) 3. INTERFACE: The exported fetch() and fetchAll() function signatures are unchanged. (check: grep -n 'export.*fetch' src/api/index.ts matches original) 4. NEGATIVE: No console.log, console.error, or debugger statements in any modified file. (check: grep -rn 'console\.' src/api/) 5. BEHAVIORAL: ESLint exits 0 with no new warnings. (check: npm run lint) Post-run verification result: 1. STRUCTURAL: PASS 2. BEHAVIORAL: PASS (63/63) 3. INTERFACE: FAIL, fetchAll() now returns Promise<Result[]> instead of Promise<Response[]>. Interface changed. 4. NEGATIVE: PASS 5. BEHAVIORAL: PASS Result: 1 criterion failed. Revise and rerun with explicit constraint: 'fetchAll must continue to return Promise<Response[]>.'
Go deeper (25 min)
Paste this into Claude:
Take one prompt you used last week that produced a result you were not fully happy with. Paste the original prompt. Now write a spec for what you actually wanted: 4-5 binary criteria that would have caught the problem. Then re-run the task with the spec included in the prompt, instructing the agent to verify each criterion before reporting done. Compare the second output to the first.
What good looks like:
What a good response looks like:
Original prompt (last week): 'Update the user profile form to add a phone number field.' Result: Agent added the field but it was not wired to the submit handler. Data silently dropped. Spec criterion that would have caught it: - BEHAVIORAL: Submitting the form with a phone number value sends that value to POST /api/users. (check: inspect network request in dev tools or grep submit handler for phoneNumber) Revised prompt with spec: 'Add a phone number field to the user profile form. Verify before reporting done: (1) the field renders in the form, (2) the value is included in the submit payload, (3) existing tests pass, (4) no new console.log statements.' Second run result: Agent added the field, wired it to the handler, and confirmed the submit payload includes phoneNumber. All 4 criteria passed.
When this breaks
Claude can do it for you
Before starting any non-trivial task, say to Claude: 'Before you begin, write a spec for this task: 4-6 binary criteria I can verify mechanically when you are done. Include structural, behavioral, and interface constraints. Then run the task and check each criterion before reporting back.'
You can now
Write 4-6 binary spec criteria for a real task before the agent runs, then check each one against the output and report pass or fail per criterion.
Key takeaways
Write what done looks like before the agent starts. Binary criteria catch more problems than any amount of post-hoc reviewing.