After this, you'll be able to identify whether you've crossed the threshold where reviewing an agent's diff is cheaper than writing the code yourself, and calibrate your trust accordingly.
Before you start
Before diving in, complete Your First Background Agent so you have real background-agent diffs to evaluate rather than hypothetical ones.
The idea
Here is the before and after: You have three open PRs this morning. You did not write any of them. An agent ran overnight on tasks you spec'd yesterday, and now you are reviewing diffs instead of writing code. That is what crossing the diff-vs-code threshold feels like in practice.
The threshold itself is purely economic: for this task, does reviewing the agent's output cost less than writing it yourself? Not 'is AI impressive.' Not 'does the model seem capable.' Just: review time versus authorship time, measured with a timer.
For a 50-line utility function with clear inputs and outputs, the answer is usually yes for most L8 practitioners. A Sonnet review of a well-specified PR costs roughly 8% of what an Opus implementation run costs. Five parallel Sonnet review passes equal one Opus implementation pass. For a 500-line architectural refactor touching six services, the answer depends on how precisely you defined the task and how good your test coverage is. The threshold is not fixed. It moves with task clarity, test coverage, and model capability.
The calibration method is direct. Review three PRs an agent wrote without your involvement. Time your review for each. Compare to how long writing the same code would have taken. If review is consistently faster, you are on the right side. If you keep rewriting the agent's output instead of merging it, you are not there yet on that task type.
The shift this enables is not 'I have more agents running.' It is 'work is happening without me.' You wake up, open your PRs, and start reviewing. You did not write those diffs. Agents did while you were offline.
Try it (20 min)
Watch out for
Paste this into Claude:
I want to calibrate whether I've crossed the diff-vs-code threshold for a specific task type. Here is a task I've done manually before: [describe a recent coding task, e.g. 'write unit tests for a utility module', 'add input validation to an API endpoint', 'update TypeScript types across three files']. Run this task now as a background agent. When it is done, I will review the diff and time myself. Then compare: (1) How long did reviewing the diff take? (2) How long would writing it from scratch have taken? (3) Did I accept the diff as-is, edit it, or rewrite it entirely? Based on those answers, tell me whether I've crossed the threshold for this task type and what would need to change if I haven't.
What good looks like:
Go deeper (20 min)
Paste this into Claude:
Take a task where the agent's output was not quite mergeable. Instead of accepting or rewriting it, reject it once with specific feedback. Paste the diff, then write: 'This is not mergeable because [specific reason]. Revise and re-run.' Measure whether the second diff is closer to mergeable. Then answer: what spec would have prevented the first run from missing the mark?
What good looks like:
When this breaks
Claude can do it for you
Say to Claude: 'Run this task as a background agent: [task]. Do not ask for confirmation at intermediate steps. When done, give me a diff summary and the spec criteria you verified. I will review and tell you whether it is mergeable.'
You can now
Time your review of one agent-produced diff and one manual implementation estimate for the same task type, and produce a numeric verdict (crossed or not crossed) backed by both numbers.
Key takeaways
The threshold is not a feeling. Time your reviews. If reviewing consistently beats writing, you are there. If you are rewriting more than accepting, you are not there yet on that task type.