Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L9Lesson 1

Cross the Diff-vs-Code Threshold

After this, you'll be able to identify whether you've crossed the threshold where reviewing an agent's diff is cheaper than writing the code yourself, and calibrate your trust accordingly.

Before you start

Before diving in, complete Your First Background Agent so you have real background-agent diffs to evaluate rather than hypothetical ones.

The idea

Here is the before and after: You have three open PRs this morning. You did not write any of them. An agent ran overnight on tasks you spec'd yesterday, and now you are reviewing diffs instead of writing code. That is what crossing the diff-vs-code threshold feels like in practice.

The threshold itself is purely economic: for this task, does reviewing the agent's output cost less than writing it yourself? Not 'is AI impressive.' Not 'does the model seem capable.' Just: review time versus authorship time, measured with a timer.

For a 50-line utility function with clear inputs and outputs, the answer is usually yes for most L8 practitioners. A Sonnet review of a well-specified PR costs roughly 8% of what an Opus implementation run costs. Five parallel Sonnet review passes equal one Opus implementation pass. For a 500-line architectural refactor touching six services, the answer depends on how precisely you defined the task and how good your test coverage is. The threshold is not fixed. It moves with task clarity, test coverage, and model capability.

The calibration method is direct. Review three PRs an agent wrote without your involvement. Time your review for each. Compare to how long writing the same code would have taken. If review is consistently faster, you are on the right side. If you keep rewriting the agent's output instead of merging it, you are not there yet on that task type.

The shift this enables is not 'I have more agents running.' It is 'work is happening without me.' You wake up, open your PRs, and start reviewing. You did not write those diffs. Agents did while you were offline.

Try it (20 min)

Watch out for

  • Judging the threshold on ambiguous tasks. Start with tasks you can spec precisely. Ambiguous tasks will fail at L8, not because of a model problem.
  • Treating 'I reviewed it' and 'I accepted it' as the same thing. The point is that you can merge without rewriting. Review time only matters if you are merging, not fixing.
  • Calibrating on tasks you hate doing. Discomfort with a task type biases your estimate downward. Pick tasks where you have a realistic baseline.
  • Forgetting that the threshold moves. A task that required rewriting last month may be agent-suitable after you write a better spec or your test coverage improves.
  • Crossing the threshold and then removing the review step. Reviewing the diff is not supervision overhead. It is the quality gate that makes background agents safe.

Paste this into Claude:

I want to calibrate whether I've crossed the diff-vs-code threshold for a specific task type. Here is a task I've done manually before: [describe a recent coding task, e.g. 'write unit tests for a utility module', 'add input validation to an API endpoint', 'update TypeScript types across three files']. Run this task now as a background agent. When it is done, I will review the diff and time myself. Then compare: (1) How long did reviewing the diff take? (2) How long would writing it from scratch have taken? (3) Did I accept the diff as-is, edit it, or rewrite it entirely? Based on those answers, tell me whether I've crossed the threshold for this task type and what would need to change if I haven't.

What good looks like:

  • You ran the task as an agent and reviewed the resulting diff, not the agent's process
  • You timed your review and compared it to a manual estimate for the same task
  • You can state a specific threshold verdict: crossed or not crossed, for this task type
  • If you rewrote any of the agent's output, you identified what made it reviewable but not mergeable
  • You have at least one task type where reviewing the diff is clearly faster than writing it yourself

Go deeper (20 min)

Paste this into Claude:

Take a task where the agent's output was not quite mergeable. Instead of accepting or rewriting it, reject it once with specific feedback. Paste the diff, then write: 'This is not mergeable because [specific reason]. Revise and re-run.' Measure whether the second diff is closer to mergeable. Then answer: what spec would have prevented the first run from missing the mark?

What good looks like:

  • You rejected one agent diff with specific, actionable feedback rather than rewriting it yourself
  • The second run produced a diff closer to your standard than the first
  • You identified one spec criterion that would have caught the first run's gap
  • You can describe the difference between 'this diff needs revision' and 'this task is not agent-suitable yet'

When this breaks

  • Breaks when you calibrate on tasks with no clear spec because review time is meaningless if you are also drafting the requirements during review, which is just slow authorship in disguise.
  • Breaks when you stop timing your reviews because the threshold shifts continuously with model updates and codebase changes, and last month's verdict will not survive a new release without remeasurement.
  • Breaks when you treat the threshold as a per-developer trait rather than a per-task-type calibration because the same person can be cleanly above the threshold on validation tasks and far below it on architectural refactors.

Claude can do it for you

Say to Claude: 'Run this task as a background agent: [task]. Do not ask for confirmation at intermediate steps. When done, give me a diff summary and the spec criteria you verified. I will review and tell you whether it is mergeable.'

You can now

Time your review of one agent-produced diff and one manual implementation estimate for the same task type, and produce a numeric verdict (crossed or not crossed) backed by both numbers.

Key takeaways

The threshold is not a feeling. Time your reviews. If reviewing consistently beats writing, you are there. If you are rewriting more than accepting, you are not there yet on that task type.

  • The diff-vs-code threshold is economic, measured by review time versus authorship time, not by how impressive the model feels
  • Calibrate per task type, not per developer. Validation tasks may be over the threshold while architectural refactors are not
  • If you rewrite the agent's output, that is a spec problem. Reject once with specific feedback before concluding the task is unsuitable
  • The threshold moves with model releases, test coverage, and spec quality. Re-time reviews every few weeks

Go deeper

  • Ramp: Why We Built Our Background Agent
  • Claude Code Sub-Agents (headless execution reference)
  • Agent Backpressure (flow control for multi-agent systems)