Managing what the model knows at scale
After this, you'll be able to decide whether to retrieve or fit information in context based on size, change frequency, and task type, and you'll know why structured output becomes a contract at this level.
Before you start
You'll want a working sense of Context Engineering Fundamentals before this lesson extends those skills to data that exceeds a single context window.
The idea
At Level 4 you learned to keep context clean. At Level 5, the problem shifts: what do you do when the information you need does not fit in the window at all? A large codebase, a document library, a product spec longer than a novel. You cannot paste it all in. You need a different architecture.
RAG (Retrieval-Augmented Generation) shifts the question from 'what can I fit in the window' to 'what should I retrieve right now.' Index your docs once, embed the query at runtime, and pull only the relevant chunks. The failure mode is not building the index. It is retrieval quality. Bad chunking returns irrelevant sections, and the model works from those as if they were correct.
The second problem Level 5 solves is output format. When your AI output feeds a database, a UI, or another model call, free text fails. Structured output (JSON mode, XML tags, schema constraints) forces a predictable shape. Without it, every consumer of your pipeline writes its own fragile parser. With it, you have a contract.
The decision rule for retrieve vs fit: if the information is static and large, retrieve it. If it is dynamic and small, fit it in context directly. A full codebase belongs in a retrieval layer. A two-page spec belongs in the window. The hidden cost of retrieval is that the model works from chunks and summaries, not the whole document, which can cause it to miss connections that span sections.
Here is the before and after: a team indexed 800 pages of internal policy docs with 4K-token chunks. Queries about refund policy kept returning payment processing chunks instead because the sections were split mid-topic. Switching to semantic chunking (splitting at heading boundaries) dropped irrelevant retrievals from 40% to 6%. The index build time was identical. The retrieval quality difference was the entire product.
Try it (5 min)
Watch out for
Paste this into Claude:
I have a workflow where my AI output feeds a downstream system. Here is the prompt I currently use: [paste your prompt]. Here is what consumes the output: [describe: a database write, a UI render, a second prompt, a webhook]. Help me: (1) identify what shape the consumer actually needs (fields, types, constraints), (2) choose between JSON mode, XML tags, or schema validation based on whether I need strict typing or just predictable structure, (3) rewrite the prompt to enforce that shape, and (4) show me a valid output and an invalid output so I can write a test against the contract.
What good looks like:
When this breaks
You can now
Distinguish between a depth task (fit in context) and a breadth task (retrieve), then justify the choice using token size, change frequency, and what the task actually requires.
Key takeaways
Level 5 replaces two failure modes (context overflow, unpredictable output) with two composable solutions: retrieve only what you need, and constrain the output shape so consumers can rely on it.