Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L5Lesson 1

Build Your First RAG Pipeline

After this, you'll have a working RAG pipeline on a real document set, and you'll be able to measure whether your retrieval is actually returning relevant chunks.

Before you start

Before diving in, complete Checkpoint and Clear so you have the session-management habits that keep retrieved content from polluting context between pipeline iterations.

The idea

You have a 400-page company handbook. Pasting all of it into Claude gives noise, not answers. The model buries the relevant paragraph under 70,000 tokens of unrelated policy text, and the output reflects that mess. RAG is what you need.

RAG (Retrieval-Augmented Generation) solves one problem: your document set is too large to paste into the context window every time. Instead of loading everything, you index your documents once, then retrieve only the relevant chunks at query time. The architecture has four steps: embed your documents into vectors, store those vectors in an index, embed the incoming query, then retrieve the most similar chunks and pass them as context.

Here is the part most tutorials skip: the failure mode is almost never step one. Embedding generation is reliable. The failure is step three. Here is the before and after: A real team indexed their entire engineering wiki, 3,400 documents, using fixed 512-token chunks. Their hit rate at top-5 retrieval was 41%. They were returning five chunks, but only two were relevant on average. The other three actively confused the model by introducing unrelated content. After switching to 256-token chunks with a 64-token overlap (to preserve sentence boundaries across chunk edges), their hit rate jumped to 73%.

Start with naive chunking (fixed token sizes, no overlap) and measure before you optimize. Retrieval quality is where production systems break, not embedding generation. Measure top-k hit rate (what percentage of your top-5 retrieved chunks are actually relevant to the query) on a sample of 20 real queries before you tune anything. You cannot optimize what you have not measured.

Try it (25 min)

Watch out for

  • Building a RAG pipeline before checking whether your document set actually fits in context. If the whole set is under 100K tokens, just fit it in.
  • Measuring nothing. A pipeline with no hit rate metric is a black box. You will not know when it breaks.
  • Using fixed 1024-token chunks without measuring. Larger chunks hurt precision. Start smaller and measure.
  • Forgetting overlap between chunks. Without overlap, sentences that fall at a chunk boundary are cut in half and lose meaning.
  • Treating retrieval quality as a one-time setup concern. It degrades as documents change. Re-eval quarterly.

Paste this into Claude:

I want to build a simple RAG pipeline on this document set: [describe your documents: your codebase docs, your notes, a folder of PDFs, or your team wiki]. Walk me through: (1) what chunk size to start with and why, (2) what embedding model to use for this content, (3) how to store and query the index using LlamaIndex or a vector store I already have access to, and (4) how to write a quick eval that measures whether the top-5 retrieved chunks are actually relevant to a sample query. Give me working code or step-by-step instructions I can run today.

What good looks like:

  • You have a working index over at least 20 real documents from your own work
  • You can run a query and get back the top-5 chunks with their source filenames
  • You measured retrieval quality on at least 5 sample queries and have a hit rate number
  • You understand what chunk size you used and why you chose it
  • You identified at least one query where retrieval failed and can explain why the wrong chunk was returned

What a good response looks like:

For your engineering wiki (3,400 markdown files, roughly 8.2M tokens total), here is my recommendation.

Chunk size: start at 256 tokens with a 64-token overlap. Your content is dense technical prose with short paragraphs. Larger chunks (512+) will dilute relevance scores because unrelated sentences will land in the same chunk. The overlap preserves sentence continuity at boundaries.

Embedding model: text-embedding-3-small from OpenAI. It runs fast, costs $0.02 per 1M tokens, and handles technical vocabulary well. For a 3,400-doc wiki at 256-token chunks you will generate roughly 12,000-15,000 embeddings. Total cost: under $0.40 to index.

Storage: LlamaIndex with a local Chroma vector store. Here is the minimal setup:

```python
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

docs = SimpleDirectoryReader('./wiki').load_data()
parser = SentenceSplitter(chunk_size=256, chunk_overlap=64)
nodes = parser.get_nodes_from_documents(docs)
index = VectorStoreIndex(nodes)
query_engine = index.as_query_engine(similarity_top_k=5)
```

Eval harness: take 20 real queries your team has asked in Slack. For each, manually label which wiki pages are actually relevant (ground truth). Run each query through the index and check whether the top-5 retrieved chunks include content from those pages. Hit rate at top-5: (queries where at least one relevant chunk appeared) / 20. Your baseline target before tuning: above 55%. Below that, start with chunk size reduction.

Go deeper (20 min)

Paste this into Claude:

I have a working RAG pipeline with a hit rate of [X]% at top-5. I want to improve it. Here is my current chunk strategy: [describe your chunk size, overlap, and any metadata you attach]. Try two alternatives: (1) halve the chunk size and add a 10% overlap, and (2) add document-level metadata (filename, section heading) to each chunk so the model knows where the content came from. Rebuild the index with each strategy, re-run my eval set of 5-10 sample queries, and show me the hit rate comparison across all three strategies.

What good looks like:

  • You tested at least two chunk strategies against your original baseline
  • You have a hit rate number for each strategy on the same eval set
  • At least one strategy improved hit rate over baseline by more than 5 percentage points
  • You can explain in one sentence why the winning strategy performed better

What a good response looks like:

Chunk strategy comparison on your 10-query eval set.

Baseline (512 tokens, no overlap): hit rate 41%. Chunks are large enough that unrelated content dilutes relevance scores. On query 'how do we handle schema migrations in staging', the top-5 included two chunks about production deployments and one about CI config. Relevant content was there but ranked 4th.

Alternative A (256 tokens, 64-token overlap): hit rate 73%. The smaller chunks isolate concepts more cleanly. The same migration query now returns the migration runbook as chunk 1 and the staging environment guide as chunk 2. The 64-token overlap means sentences that spanned the 512-token boundary are now captured in both adjacent chunks.

Alternative B (256 tokens, 64-token overlap, plus metadata): hit rate 78%. Adding `source_file`, `section_heading`, and `last_modified` as metadata to each chunk gives the retrieval scorer an additional signal. The model can now cite 'According to staging-runbook.md, section: Database Migrations' instead of just quoting text.

Winner: Alternative B. The 5-point improvement over Alternative A comes from metadata boosting relevance scoring when the query terms match a section heading exactly. Recommended next step: add `last_modified` filtering to exclude chunks from docs older than 90 days.

When this breaks

  • Breaks when the document set is small enough to fit in context because retrieval introduces a chunk-boundary failure surface that pure full-context loading never has.
  • Breaks when retrieval quality is never measured because every chunking change ships blind, and silent regressions in hit rate look identical to model regressions during debugging.
  • Breaks when documents change frequently and embeddings are not re-indexed because stale vectors return outdated content the model treats as current truth.

Claude can do it for you

Tell Claude: 'I have a folder of documents I want to query with RAG. Help me build a minimal pipeline using LlamaIndex, measure my retrieval hit rate on 10 sample queries, and tell me what chunk size to try next based on the results.' It will write the code and the eval harness.

You can now

Build a working index over at least 20 real documents, measure hit rate at top-5 on 5+ sample queries, and report a specific number you can defend.

Key takeaways

RAG is not magic. It is an index plus a retrieval step. The index is easy. Retrieval quality is the hard part. Measure it first.

  • RAG is an index plus a retrieval step. The index is the easy part. Retrieval quality is where pipelines fail.
  • Start with small chunks (256 tokens) and overlap (64 tokens) before optimizing anything else
  • Measure top-k hit rate on 20 real queries as your baseline. You cannot optimize what you have not measured
  • Metadata (filename, section heading) lifts hit rate without changing chunk size, often by 5+ points
  • Re-evaluate retrieval quality quarterly. It degrades as documents change

Go deeper

  • LlamaIndex: High-Level Concepts (RAG)
  • Anthropic: Long context window tips
  • 12-Factor Agents on GitHub