Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L8Lesson 2

Spec-as-Test

After this, you'll be able to write the expected output before the agent runs, verify the agent's output against that spec, and distinguish this from unit tests and model evals.

Before you start

Complete Build the Inner Loop first; this lesson builds on the loop so your spec exists before the agent runs rather than being written to fit the output after.

The idea

Here is the before and after: The agent refactors your API client and the result looks right. You merge it. Two days later a colleague reports a broken interface. The agent created a new export signature and your existing callers are silently broken. You never specified 'do not change the exported interface' because it felt obvious. It was not obvious to the agent.

Spec-as-test is the habit of writing exactly what done looks like before the agent runs. Not 'refactor the API client' but: (1) all existing tests pass, (2) no new files outside src/api/, (3) exported interface unchanged, (4) no console.log statements. Four binary criteria. After the run, you check each one. Pass or fail. No judgment call.

This is different from unit tests (which verify a specific function call) and from model evals (which score quality across 15 representative examples). A spec is precise for one specific task and checkable in under 2 minutes with grep and a test run.

The key habit: write the spec before the agent starts. A spec written after the run fits the output instead of defining the goal. Writing it first forces you to be precise about what done means. Most vague outcomes come from vague specs, not from model capability.

Try it (20 min)

Watch out for

  • Writing specs after the agent runs. The spec written post-hoc fits the output instead of defining the goal.
  • Criteria that require judgment to evaluate ('the code is clean', 'the output is correct'). Each criterion must be checkable without a human deciding.
  • Confusing spec-as-test with evals. Evals are statistical across many inputs. Specs are precise for one specific task.
  • Specs that only cover happy paths. Include at least one constraint about what the agent must NOT do (no files outside this directory, no breaking changes to public interfaces).
  • Skipping the spec when the task feels simple. Simple tasks with vague specs produce vague outputs. The spec is fastest to write on simple tasks.

Paste this into Claude:

I want to practice spec-as-test on a task before I run it. Here is the task I am about to give the agent: [describe a task, e.g. 'add input validation to the login form' or 'refactor the data fetching layer to use a single shared client']. Before running it, help me write a spec: a list of 4-6 criteria I can verify mechanically after the agent finishes. Each criterion should be binary (pass or fail), specific enough to check with grep or a test run, and cover: (1) structural constraints (which files change), (2) behavioral constraints (tests pass, linter clean), (3) interface constraints (no breaking changes to exported APIs). Then run the task and check the output against each criterion.

What good looks like:

  • You wrote 4-6 spec criteria before the agent ran, not after
  • Each criterion is binary and checkable without judgment (grep, test run, file existence check)
  • The spec includes at least one structural constraint, one behavioral constraint, and one interface constraint
  • After the agent ran, you verified each criterion and can report pass or fail for each one
  • At least one criterion caught something the agent did that was not in the spec

What a good response looks like:

Spec written before running 'refactor data fetching layer to use a shared HTTP client':

1. STRUCTURAL: Only files inside src/api/ are modified. No new files created outside that directory. (check: git diff --name-only | grep -v '^src/api/')
2. BEHAVIORAL: All 63 existing tests pass after the change. (check: npm test exits 0)
3. INTERFACE: The exported fetch() and fetchAll() function signatures are unchanged. (check: grep -n 'export.*fetch' src/api/index.ts matches original)
4. NEGATIVE: No console.log, console.error, or debugger statements in any modified file. (check: grep -rn 'console\.' src/api/)
5. BEHAVIORAL: ESLint exits 0 with no new warnings. (check: npm run lint)

Post-run verification result:
1. STRUCTURAL: PASS
2. BEHAVIORAL: PASS (63/63)
3. INTERFACE: FAIL, fetchAll() now returns Promise<Result[]> instead of Promise<Response[]>. Interface changed.
4. NEGATIVE: PASS
5. BEHAVIORAL: PASS

Result: 1 criterion failed. Revise and rerun with explicit constraint: 'fetchAll must continue to return Promise<Response[]>.'

Go deeper (25 min)

Paste this into Claude:

Take one prompt you used last week that produced a result you were not fully happy with. Paste the original prompt. Now write a spec for what you actually wanted: 4-5 binary criteria that would have caught the problem. Then re-run the task with the spec included in the prompt, instructing the agent to verify each criterion before reporting done. Compare the second output to the first.

What good looks like:

  • You identified which criterion the original run would have failed
  • The revised prompt included the spec criteria explicitly
  • The second run passed more criteria than the first
  • You can describe the spec-as-test pattern in one sentence to someone who has not heard of it

What a good response looks like:

Original prompt (last week): 'Update the user profile form to add a phone number field.'
Result: Agent added the field but it was not wired to the submit handler. Data silently dropped.

Spec criterion that would have caught it:
- BEHAVIORAL: Submitting the form with a phone number value sends that value to POST /api/users. (check: inspect network request in dev tools or grep submit handler for phoneNumber)

Revised prompt with spec:
'Add a phone number field to the user profile form. Verify before reporting done: (1) the field renders in the form, (2) the value is included in the submit payload, (3) existing tests pass, (4) no new console.log statements.'

Second run result: Agent added the field, wired it to the handler, and confirmed the submit payload includes phoneNumber. All 4 criteria passed.

When this breaks

  • Breaks when the spec is written after the run because the criteria fit what the agent produced instead of what you actually wanted, and the verification step becomes theater.
  • Breaks when criteria need human judgment because the agent cannot self-check 'is this clean' the way it can self-check 'does grep return zero matches'.
  • Breaks when the spec only describes the happy path because the agent satisfies every criterion and still introduces a regression you never named.

Claude can do it for you

Before starting any non-trivial task, say to Claude: 'Before you begin, write a spec for this task: 4-6 binary criteria I can verify mechanically when you are done. Include structural, behavioral, and interface constraints. Then run the task and check each criterion before reporting back.'

You can now

Write 4-6 binary spec criteria for a real task before the agent runs, then check each one against the output and report pass or fail per criterion.

Key takeaways

Write what done looks like before the agent starts. Binary criteria catch more problems than any amount of post-hoc reviewing.

  • Spec-as-test means writing binary, checkable criteria before the agent runs, not after
  • Cover structure, behavior, and interface constraints, plus at least one explicit 'must not' criterion
  • Specs are precise for one task; evals are statistical across many inputs. Do not confuse them.
  • The spec is the fastest part to write on simple tasks, and it is also where simple tasks most often go wrong

Go deeper

  • Claude Code Hooks and Settings (wiring verification into your workflow)
  • Harness Engineering (OpenAI, spec-driven pipelines at scale)
  • Ramp's Background Agents (production spec patterns in fintech)