After this, you'll be able to write a structured failure or success report from one of your autonomous team runs and explain why documentation at this level is a contribution, not just a record.
Before you start
Complete When NOT to Deploy a Team first; this lesson builds on the deployment decision process so your report captures what the four-question test revealed, not just what the agents did.
The idea
Here is the before and after: You hit a coordination failure last week that cost $60 and produced inconsistent output. You have a hypothesis about why it happened. You tried two mitigations, one of which partially worked. You have not seen anyone else write about this exact failure mode. That observation, written up in 15 minutes, is worth more to the community than another tutorial on how to prompt an agent.
Level 10 is where the map is still being drawn. The failure modes are barely documented. The tooling is experimental. Most practitioners who have hit interesting failure modes have not written them up, not because the knowledge does not exist, but because they are busy building.
Documenting what you learn at this level is not optional. It is the contribution.
A useful frontier write-up has five parts. Context: what you were trying to do and what agents were involved. Observation: what happened that you did not expect, with specific evidence (numbers, log lines, diffs). Hypothesis: your best explanation. Mitigation: what you changed and whether it worked, or 'not yet mitigated' if it is still open. Open questions: what you still do not understand.
The open questions section is as important as the resolution. Shared uncertainty moves the field forward faster than shared answers.
You do not need to have solved the problem to write it up. A clear account of an unresolved failure with your best hypothesis and what you tried is genuinely useful to the small number of people at this level who are likely to hit the same wall.
Try it (15 min)
Watch out for
Paste this into Claude:
I want to write up one thing I have observed in my autonomous agent work. The observation is: [describe one thing that worked, broke, or surprised you in a multi-agent run, e.g. 'two agents both modified the same config file and the merge produced invalid JSON', or 'a feedback loop between my generator and reviewer agents doubled my token cost before hitting the cap', or 'I decided not to deploy a team for X because of Y']. Help me write a structured report with five sections: (1) Context: what I was trying to do and what agents were involved. (2) Observation: what happened that I did not expect, with specific evidence (log lines, cost numbers, output diffs if available). (3) Hypothesis: my best explanation of why it happened. (4) Mitigation: what I changed and whether it worked, or 'not yet mitigated' if still open. (5) Open questions: what I still do not understand about this failure or pattern. Write the report in plain language, under 400 words, as if sharing with another practitioner at this level.
What good looks like:
What a good response looks like:
Frontier write-up: feedback loop between generator and reviewer agents Context: 3-agent pipeline. Generator creates documentation stubs. Reviewer checks completeness and flags gaps. Summarizer produces a final output. Run on 40 API endpoints. Observation: reviewer called generator.addContent() 11 times in a single run before the circuit breaker fired. Total cost: $58. Expected cost: ~$8. The final output had 200 stubs, none marked complete. The summarizer never ran. Hypothesis: reviewer's 'completeness' threshold was defined as 'all required sections filled.' Generator was adding content that satisfied individual section checks but the reviewer was rechecking the full document on each call, not just the new sections. Each pass found something else missing because the threshold was applied globally, not incrementally. This created a ratchet: the more content was added, the more checks were triggered. Mitigation: changed reviewer to track which sections it has already marked complete and skip those on subsequent passes. This reduced calls from 11 to 2 on a re-run of the same 40 endpoints. Cost: $9.40. Open questions: (1) Is there a principled way to define 'completeness' for generative content that does not require full-document re-evaluation? (2) The 2-call result feels better but I do not know if it is because the threshold is now correct or because the reviewer is missing real gaps. I do not have ground truth for what 'complete' should look like on these endpoints. (3) Would a different reviewer model (Haiku instead of Sonnet) have broken the loop earlier, or just produced lower-quality completeness checks?
When this breaks
Claude can do it for you
Say to Claude: 'Help me write a structured failure report for this observation: [describe it]. Use five sections: context, observation (with specific evidence), hypothesis, mitigation, and open questions. Keep it under 400 words, plain language, no jargon. I want something I can share with other practitioners.'
You can now
Produce a five-section report (context, observation with concrete evidence, hypothesis, mitigation status, open questions) under 400 words about one observed multi-agent failure or surprising success, written so another L10 practitioner can use it without additional context.
Key takeaways
The frontier gets mapped by the people working on it. Write up what you find, including what you have not solved. That is the contribution Level 10 needs most right now.