Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L10Lesson 3

Reproducibility Is the Architecture

After this, you'll be able to explain why reproducibility is a structural requirement at Level 10, implement the four-layer reproducibility checklist for a multi-agent run, and run a replay test on one existing flow.

Before you start

Before diving in, complete Watch for Emergent Failures so you know which failure modes the four-layer reproducibility checklist is designed to catch when you replay a run.

The idea

Here is the before and after: A critical bug appears in production. You trace it to a multi-agent run from three days ago. You try to replay the run to understand what happened. The agents produce different output than they did three days ago. You cannot tell whether the bug came from the original run or your replay. You cannot debug what you cannot reproduce.

At Level 10, reproducibility is not a testing practice. It is a structural requirement. If you cannot replay a run and explain every difference in the output, you cannot debug failures. And at the scale of autonomous teams, failures will happen.

Four layers make a run reproducible. Separate execution environments: each agent runs in isolation with no shared mutable state. Agents that share a filesystem or cache can affect each other's results in ways that do not show up in any single agent's logs. Explicit permission manifests: every tool call, file write, and external API access is declared before the run starts. Agents that discover permissions at runtime produce different behavior depending on what is available. Network isolation between agent tiers: worker agents cannot call external APIs directly. All external calls route through a gate or are declared in the manifest. Output validation before any write: every agent output is checked against criteria before it is committed to shared state or an external service.

The replay test is the proof. Run the same flow twice on the same input. Compare outputs. If outputs differ in ways not explained by intentional randomness, you have a reproducibility failure.

None of this requires expensive infrastructure. Separate environments can be git worktrees. Permission manifests can be a YAML file checked at startup. Output validation can be a JSON schema check in CI. The investment is in doing it consistently.

Note: these four layers are well-observed across teams running at this scale as of 2026. Tooling for automated enforcement of all four is still maturing. Most practitioners implement two or three layers consistently and treat the others as best-effort.

Try it (25 min)

Watch out for

  • Treating reproducibility as a testing concern rather than an architecture concern. If your execution environment is not isolated, tests cannot catch the leak. Fix the environment first.
  • Permission manifests that are documentation, not enforcement. A manifest that agents can ignore at runtime is not a permission gate. It must be read at startup and respected, or checked by the infrastructure.
  • Assuming git branches provide full isolation. Two agents on separate branches can still share a filesystem, the same API quota, or the same cache. Branch isolation is one layer, not all four.
  • Skipping the replay test because the flow passed its unit tests. Unit tests run one agent in isolation. The replay test runs the full multi-agent flow and compares outputs. They test different things.
  • Treating identical outputs as the only measure of reproducibility. Outputs that differ in logged timestamps or ordering are fine. Outputs that differ in results or side effects are not.

Paste this into Claude:

I want to audit one of my existing multi-agent flows for reproducibility. The flow I am auditing is: [describe the flow: what agents run, what they read, what they write, what external services they call]. Walk me through the four-layer checklist: (1) Separate execution environments: does each agent run in an isolated environment with no shared mutable state? If not, what state leaks between agents? (2) Explicit permission manifests: are all tool calls and external writes declared before the run, or do agents discover permissions at runtime? List every external call the flow makes. (3) Network isolation: can worker agents call external APIs directly, or do those calls route through a gate? (4) Output validation: is each agent's output checked against criteria before it is committed? For each layer where the answer is no, write one concrete step to fix it.

What good looks like:

  • You audited all four layers, not just the ones you already knew were weak
  • For each layer, your answer is specific: not 'it is probably isolated' but 'agents write to separate git worktrees' or 'agents share the same src/ directory'
  • You listed every external call the flow makes, including reads from shared files or APIs
  • Each fix step is concrete and implementable in one session
  • You identified the layer most likely to cause a reproducibility failure in your current setup

What a good response looks like:

Reproducibility audit for a 4-agent documentation pipeline (doc-writer, type-checker, linter, summarizer):

Layer 1 — Execution environments: FAIL
Current state: all agents write to the same src/ directory. If doc-writer updates a file while type-checker is reading it, the type-checker output reflects a partially-updated state.
Fix: provision a separate git worktree per agent at dispatch time. Each agent works in worktree/agent-<id>/. Merge all worktrees back to main via the CI gate after all agents complete.

Layer 2 — Permission manifests: PARTIAL
Current state: doc-writer and summarizer have hardcoded API calls to our internal glossary service. These are not declared anywhere before the run.
External calls found: glossary API (2 agents), GitHub API for PR creation (supervisor), npm registry check (type-checker)
Fix: add agent-permissions.yaml listing all four calls. Check at startup: if a call is not in the manifest, the agent aborts and reports the undeclared call.

Layer 3 — Network isolation: FAIL
Current state: doc-writer calls the glossary API directly without routing through the supervisor.
Fix: route all external API calls through a supervisor proxy method. Workers call supervisor.externalFetch(url), not fetch(url) directly.

Layer 4 — Output validation: PASS
Current state: each agent output is checked by the supervisor against a JSON schema before commit.

Most likely reproducibility failure: Layer 1 (shared src/ directory). Two agents writing to the same file in the same run is a guaranteed non-determinism source.

When this breaks

  • Breaks when reproducibility is treated as a testing concern instead of an architecture concern because tests run in the leaky environment they are meant to validate, and a non-deterministic substrate cannot be made deterministic by checking it harder.
  • Breaks when permission manifests are documentation rather than enforcement because agents that can ignore the manifest at runtime will, and the resulting behavior depends on which capabilities happened to be available that day.
  • Breaks when teams assume git branches provide full isolation because branches share a filesystem, an API quota, and a cache, so the run still depends on cross-agent state that branch isolation does not cover.

Claude can do it for you

Say to Claude: 'Audit my multi-agent flow for reproducibility: [describe it]. Check all four layers: execution environment isolation, permission manifests, network isolation, and output validation. For each layer, tell me the current state and one concrete fix. Then write me a replay test script that runs the flow twice on the same input and diffs the outputs.'

You can now

Audit one of your multi-agent flows against all four reproducibility layers (execution environment isolation, permission manifests, network isolation, output validation) and produce one concrete fix step for each layer that fails.

Key takeaways

Reproducibility is not a feature you add later. If you cannot replay a run and explain every difference, you cannot debug the failure that will eventually happen.

  • Reproducibility is structural, not a testing add-on. If you cannot replay a run, you cannot debug it
  • Four layers make a run reproducible: isolated execution environments, declared permission manifests, network isolation, and output validation
  • The replay test is the proof: run the same flow twice on the same input and account for every difference
  • None of the four layers requires expensive infrastructure. Worktrees, YAML manifests, and JSON schema checks suffice
  • Most teams implement two or three layers consistently. Pick the weakest and fix that one first

Go deeper

  • Claude Code Sub-Agents (execution and isolation reference)
  • 12-factor-agents: Stateless composable agent design
  • Latent Space Podcast (reproducibility and agentic infrastructure)