Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L10Lesson 4

When NOT to Deploy a Team

After this, you'll be able to apply a four-question decision test to any candidate task and explain why human review checkpoints are design decisions, not overhead.

Before you start

You'll want a working sense of Reproducibility Is the Architecture before this lesson, since the four-question test assumes you can replay and audit a run to establish the track record it asks for.

The idea

Here is the before and after: You are about to deploy an autonomous agent team to send personalized outreach emails to 500 leads. The agents will research each lead, draft a message, and send it. You run the four-question test and realize: sending is irreversible, the success criterion ('is this email good?') requires human judgment, and if all 500 agents make the same tone error, you have sent 500 bad emails before anyone reviews one. You do not deploy. That decision is the skill.

The hardest thing at Level 10 is not building autonomous teams. It is knowing when not to. The four questions are the test.

First: is the success criterion machine-checkable? If the only way to know the output is correct is human judgment, an autonomous team produces confident wrong answers at fleet scale. A test suite can check this. An editorial judgment call cannot.

Second: is the action reversible? Autonomous teams should not take irreversible actions (sending messages, making purchases, deleting records, publishing content) without a human gate. The speed advantage disappears the moment you have to undo something at scale.

Third: has the model demonstrated reliability on this specific task type in your prior runs? This is not a general capability question. It is a track record question. Benchmarks measure averages. Your production task needs reliability on your specific distribution.

Fourth: what is the blast radius if the fleet makes the same wrong call? One agent making a mistake is recoverable. Twenty agents making the same mistake in parallel, at speed, is not.

Human checkpoints are not overhead. They are the correct design for any task that fails these questions. Place the checkpoint at the point of maximum irreversibility, not scattered throughout the run.

Note: the four-question framework is a practical heuristic based on observed failures, not a formal guarantee. Some tasks are hard to classify cleanly. When in doubt, add the checkpoint.

Try it (18 min)

Watch out for

  • Treating the four questions as a checklist to pass rather than a genuine risk assessment. The goal is to find the real failure mode, not to convince yourself the task is safe.
  • Conflating 'the model is capable' with 'the model is reliable for this specific task.' Capability benchmarks measure average performance. Your production task needs reliability on the specific distribution you are running.
  • Placing human checkpoints at the beginning and end but not at the high-risk middle steps. A checkpoint before the run and a review after the run does not protect against irreversible actions taken in between.
  • Assuming that adding more agents makes a judgment-dependent task safer. More agents making the same underdetermined decision produces more confident wrong answers, not better ones.
  • Treating 'the team can do it' as the same question as 'the team should do it.' Autonomous teams should handle tasks where speed and parallelism improve the outcome. Tasks where careful deliberation improves the outcome are not autonomous team tasks.

Paste this into Claude:

I want to evaluate whether a task I am considering is appropriate for autonomous agent team deployment. Here is the task: [describe the task in detail, including what the agents would read, what they would write, what external services they would call, and what the success criterion is]. Run the four-question test: (1) Is the success criterion machine-checkable? Can a script, a test suite, or a schema validation determine whether the output is correct without human review? If not, describe what human judgment is required. (2) Is the action reversible? List every write operation, external call, or state change the agents would make. For each one, describe how to undo it. If any are irreversible, name them. (3) Does this task require domain expertise the model has not demonstrated on this type of work? Based on your prior runs or your knowledge of the model's track record, where is confidence low? (4) What is the blast radius if all agents make the same wrong decision? Give a specific worst-case scenario. Based on these answers, give a verdict: deploy autonomously, deploy with a checkpoint at [specific point], or keep human in the loop throughout.

What good looks like:

  • You answered all four questions with specific evidence, not general impressions
  • For any irreversible action identified, you named it explicitly rather than describing it in general terms
  • Your verdict is specific: not 'be careful' but 'deploy with a checkpoint before the send-email step' or 'do not deploy autonomously because the success criterion requires editorial judgment'
  • The checkpoint location (if recommended) is at the point of highest uncertainty or highest irreversibility, with a reason
  • You identified at least one way the task could be redesigned to make it more autonomous-team-suitable

What a good response looks like:

Four-question test for: 'Deploy 20 agents to audit all open GitHub issues and close duplicates'

Q1 — Machine-checkable success criterion?
YES, with caveats. Exact duplicates (same title, same error) are checkable by script. Near-duplicates require judgment. The agents will encounter judgment calls on roughly 30% of issues based on our last manual audit.
Verdict on Q1: partial. Deploy with checkpoint before closing any near-duplicate.

Q2 — Reversible?
Closing an issue on GitHub is reversible (issues can be reopened). Posting a 'closed as duplicate' comment is visible and embarrassing if wrong but not permanently harmful.
Verdict on Q2: reversible. Low risk on this dimension.

Q3 — Track record on this task type?
We have not run this task before. No prior runs to reference. Model has no demonstrated track record on our specific issue distribution.
Verdict on Q3: unknown. Treat as low confidence.

Q4 — Blast radius?
Worst case: 20 agents incorrectly close 200 non-duplicate issues. Contributors lose work. We reopen and apologize. Recoverable but damaging to community trust.
Verdict on Q4: bounded but reputationally costly.

Final verdict: DEPLOY WITH CHECKPOINT. Agents may flag duplicates but may not close any issue until a human reviews the flagged list. Redesign to make fully autonomous: build a duplicate-detection script that flags with confidence score. Human reviews flagged list. Only exact matches (confidence > 0.95) are auto-closed. Everything else stays flagged for human decision.

When this breaks

  • Breaks when the success criterion requires human judgment because an autonomous fleet produces confident wrong answers at scale, and 'confident' is what the model defaults to whether the answer is right or not.
  • Breaks when irreversible actions sit downstream of the agents without a gate because the speed advantage disappears the moment you have to undo a mistake at fleet scale, and undo is rarely as parallelizable as do.
  • Breaks when capability is mistaken for reliability because benchmarks measure average performance, and your production task needs reliability on the specific distribution you are running, which the benchmark never saw.

Claude can do it for you

Say to Claude: 'I am considering deploying an autonomous agent team for this task: [describe it]. Run the four-question test: machine-checkable success criterion, reversibility of actions, domain expertise track record, and blast radius. Give me a specific verdict: deploy autonomously, deploy with a checkpoint at [where], or keep human in the loop. If there is a checkpoint recommendation, write the exact checkpoint prompt I should use.'

You can now

Apply all four questions (machine-checkable success, reversibility, demonstrated reliability, bounded blast radius) to a candidate task and produce a specific verdict (deploy autonomously, deploy with a named checkpoint, or keep human in the loop) with one redesign option that would make the task more team-suitable.

Key takeaways

Deploy autonomously only when the success criterion is checkable, the actions are reversible or gated, the model has demonstrated reliability on this task type, and the blast radius is bounded. Everything else gets a checkpoint.

  • The hardest skill at Level 10 is knowing when not to deploy a team. The four-question test is the decision instrument
  • Run the test on every candidate task: machine-checkable success, reversibility, demonstrated reliability, bounded blast radius
  • Human checkpoints are correct design for any task that fails the test, not overhead. Place them at the point of maximum irreversibility
  • Capability is not reliability. The model passing a benchmark says nothing about its reliability on your specific task distribution
  • Adding more agents to a judgment-dependent task produces more confident wrong answers, not better ones

Go deeper

  • Claude Code Sub-Agents (human-in-the-loop patterns)
  • Anthropic: Building effective agents
  • humanlayer/12-factor-agents: Factor 10, human in the loop