Scaling Context Across Projects

Managing what the model knows at scale

After this, you'll be able to decide whether to retrieve or fit information in context based on size, change frequency, and task type, and you'll know why structured output becomes a contract at this level.

Before you start

You'll want a working sense of Context Engineering Fundamentals before this lesson extends those skills to data that exceeds a single context window.

The idea

At Level 4 you learned to keep context clean. At Level 5, the problem shifts: what do you do when the information you need does not fit in the window at all? A large codebase, a document library, a product spec longer than a novel. You cannot paste it all in. You need a different architecture.

A large source slab and a retrieval lane compete for the same task, leaving the golden dot off to the side. — The starting state for Scaling Context Across Projects.

Use this model to move from the starting mistake to the lesson check.

	Before	After
Habit	Guess from a loose request	Use the lesson move
Work move	Skip Scaling Context Across Projects	Apply Scaling Context Across Projects
Check	No clear proof	Pass the lesson check

The after column is the lesson target.

RAG (Retrieval-Augmented Generation) shifts the question from 'what can I fit in the window' to 'what should I retrieve right now.' Index your docs once, embed the query at runtime, and pull only the relevant chunks. The failure mode is not building the index. It is retrieval quality. Bad chunking returns irrelevant sections, and the model works from those as if they were correct.

The second problem Level 5 solves is output format. When your AI output feeds a database, a UI, or another model call, free text fails. Structured output (JSON mode, XML tags, schema constraints) forces a predictable shape. Without it, every consumer of your pipeline writes its own fragile parser. With it, you have a contract.

The decision rule for retrieve vs fit: if the information is static and large, retrieve it. If it is dynamic and small, fit it in context directly. A full codebase belongs in a retrieval layer. A two-page spec belongs in the window. The hidden cost of retrieval is that the model works from chunks and summaries, not the whole document, which can cause it to miss connections that span sections.

Here is the before and after: a team indexed 800 pages of internal policy docs with 4K-token chunks. Queries about refund policy kept returning payment processing chunks instead because the sections were split mid-topic. Switching to semantic chunking (splitting at heading boundaries) dropped irrelevant retrievals from 40% to 6%. The index build time was identical. The retrieval quality difference was the entire product.

Try it (5 min)

Watch out for

Defaulting to RAG because it sounds advanced. If the document fits in context and you need cross-section reasoning, retrieval makes the answer worse.
Building a retrieval index without measuring hit rate. A 60% hit rate means almost half your responses work from the wrong chunks, and you will not know.
Using JSON mode when you need field-type enforcement. JSON mode guarantees syntax only. `{'priority': 'high'}` is valid JSON but breaks an integer column.
Treating structured output as one-and-done. Schemas evolve as your downstream consumer evolves. Re-run a 5-output sample test whenever you change either side.

Paste this into Claude

I have a workflow where my AI output feeds a downstream system. Here is the prompt I currently use: [paste your prompt]. Here is what consumes the output: [describe: a database write, a UI render, a second prompt, a webhook]. Help me: (1) identify what shape the consumer actually needs (fields, types, constraints), (2) choose between JSON mode, XML tags, or schema validation based on whether I need strict typing or just predictable structure, (3) rewrite the prompt to enforce that shape, and (4) show me a valid output and an invalid output so I can write a test against the contract.

What good looks like

Your rewritten prompt produces output in the same shape across at least 5 test runs
The output parses cleanly into the downstream consumer without manual fix-ups
You can name the structured-output method you chose and why it fits this use case

When this breaks

Breaks when you retrieve for a depth task that needs cross-section synthesis because the model only sees disconnected chunks and cannot reason about patterns spanning the whole document.
Breaks when retrieval quality is unmeasured because every silent retrieval miss looks like a model failure, and you waste cycles tuning prompts instead of fixing the chunking strategy that is actually causing the wrong answers.
Breaks when structured output is enforced only in the prompt and not at the API level because the model occasionally drifts on field types and your downstream system swallows malformed values without raising.

AI can help with this

Use AI to apply this lesson to your current work. Share your situation, ask for one concrete next step, and check the answer against this test: Distinguish between a depth task (fit in context) and a breadth task (retrieve), then justify the choice using token size, change frequency, and what the task actually requires.

The source slab and retrieval lane settle into separate paths, with the task routed through the right one.

You can now

Distinguish between a depth task (fit in context) and a breadth task (retrieve), then justify the choice using token size, change frequency, and what the task actually requires.

Key takeaways

Level 5 replaces two failure modes (context overflow, unpredictable output) with two composable solutions: retrieve only what you need, and constrain the output shape so consumers can rely on it.

RAG solves context overflow by retrieving only relevant chunks at query time, not loading everything
Retrieval quality (chunking strategy, re-ranking) fails more often than index building. Measure it
Structured output is a contract between your AI and whatever consumes its output. Use it whenever output feeds a system
Retrieve for large and static. Fit in context for small and dynamic. Know which is which.

Was this helpful?

← Back to lessons

Before

After

Habit

Guess from a loose request

Use the lesson move

Work move

Skip Scaling Context Across Projects

Apply Scaling Context Across Projects

Check

No clear proof

Pass the lesson check

I have a workflow where my AI output feeds a downstream system. Here is the prompt I currently use: [paste your prompt]. Here is what consumes the output: [describe: a database write, a UI render, a second prompt, a webhook]. Help me: (1) identify what shape the consumer actually needs (fields, types, constraints), (2) choose between JSON mode, XML tags, or schema validation based on whether I need strict typing or just predictable structure, (3) rewrite the prompt to enforce that shape, and (4) show me a valid output and an invalid output so I can write a test against the contract.