After this, you'll be able to apply a four-question decision test to any candidate task and explain why human review checkpoints are design decisions, not overhead.
Before you start
You'll want a working sense of Reproducibility Is the Architecture before this lesson, since the four-question test assumes you can replay and audit a run to establish the track record it asks for.
The idea
Here is the before and after: You are about to deploy an autonomous agent team to send personalized outreach emails to 500 leads. The agents will research each lead, draft a message, and send it. You run the four-question test and realize: sending is irreversible, the success criterion ('is this email good?') requires human judgment, and if all 500 agents make the same tone error, you have sent 500 bad emails before anyone reviews one. You do not deploy. That decision is the skill.
The hardest thing at Level 10 is not building autonomous teams. It is knowing when not to. The four questions are the test.
First: is the success criterion machine-checkable? If the only way to know the output is correct is human judgment, an autonomous team produces confident wrong answers at fleet scale. A test suite can check this. An editorial judgment call cannot.
Second: is the action reversible? Autonomous teams should not take irreversible actions (sending messages, making purchases, deleting records, publishing content) without a human gate. The speed advantage disappears the moment you have to undo something at scale.
Third: has the model demonstrated reliability on this specific task type in your prior runs? This is not a general capability question. It is a track record question. Benchmarks measure averages. Your production task needs reliability on your specific distribution.
Fourth: what is the blast radius if the fleet makes the same wrong call? One agent making a mistake is recoverable. Twenty agents making the same mistake in parallel, at speed, is not.
Human checkpoints are not overhead. They are the correct design for any task that fails these questions. Place the checkpoint at the point of maximum irreversibility, not scattered throughout the run.
Note: the four-question framework is a practical heuristic based on observed failures, not a formal guarantee. Some tasks are hard to classify cleanly. When in doubt, add the checkpoint.
Try it (18 min)
Watch out for
Paste this into Claude:
I want to evaluate whether a task I am considering is appropriate for autonomous agent team deployment. Here is the task: [describe the task in detail, including what the agents would read, what they would write, what external services they would call, and what the success criterion is]. Run the four-question test: (1) Is the success criterion machine-checkable? Can a script, a test suite, or a schema validation determine whether the output is correct without human review? If not, describe what human judgment is required. (2) Is the action reversible? List every write operation, external call, or state change the agents would make. For each one, describe how to undo it. If any are irreversible, name them. (3) Does this task require domain expertise the model has not demonstrated on this type of work? Based on your prior runs or your knowledge of the model's track record, where is confidence low? (4) What is the blast radius if all agents make the same wrong decision? Give a specific worst-case scenario. Based on these answers, give a verdict: deploy autonomously, deploy with a checkpoint at [specific point], or keep human in the loop throughout.
What good looks like:
What a good response looks like:
Four-question test for: 'Deploy 20 agents to audit all open GitHub issues and close duplicates' Q1 — Machine-checkable success criterion? YES, with caveats. Exact duplicates (same title, same error) are checkable by script. Near-duplicates require judgment. The agents will encounter judgment calls on roughly 30% of issues based on our last manual audit. Verdict on Q1: partial. Deploy with checkpoint before closing any near-duplicate. Q2 — Reversible? Closing an issue on GitHub is reversible (issues can be reopened). Posting a 'closed as duplicate' comment is visible and embarrassing if wrong but not permanently harmful. Verdict on Q2: reversible. Low risk on this dimension. Q3 — Track record on this task type? We have not run this task before. No prior runs to reference. Model has no demonstrated track record on our specific issue distribution. Verdict on Q3: unknown. Treat as low confidence. Q4 — Blast radius? Worst case: 20 agents incorrectly close 200 non-duplicate issues. Contributors lose work. We reopen and apologize. Recoverable but damaging to community trust. Verdict on Q4: bounded but reputationally costly. Final verdict: DEPLOY WITH CHECKPOINT. Agents may flag duplicates but may not close any issue until a human reviews the flagged list. Redesign to make fully autonomous: build a duplicate-detection script that flags with confidence score. Human reviews flagged list. Only exact matches (confidence > 0.95) are auto-closed. Everything else stays flagged for human decision.
When this breaks
Claude can do it for you
Say to Claude: 'I am considering deploying an autonomous agent team for this task: [describe it]. Run the four-question test: machine-checkable success criterion, reversibility of actions, domain expertise track record, and blast radius. Give me a specific verdict: deploy autonomously, deploy with a checkpoint at [where], or keep human in the loop. If there is a checkpoint recommendation, write the exact checkpoint prompt I should use.'
You can now
Apply all four questions (machine-checkable success, reversibility, demonstrated reliability, bounded blast radius) to a candidate task and produce a specific verdict (deploy autonomously, deploy with a named checkpoint, or keep human in the loop) with one redesign option that would make the task more team-suitable.
Key takeaways
Deploy autonomously only when the success criterion is checkable, the actions are reversible or gated, the model has demonstrated reliability on this task type, and the blast radius is bounded. Everything else gets a checkpoint.