Skip to content
Agentic Levels
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • What's New
  • What's Next
  • More
    Tool SetupCompareAboutThanksFAQPricingPreferences
  • New to AI?
  • Assessment
  • Levels
  • Lessons
  • Tracks
  • Resources
  • Reference
  • Tool Setup
  • Compare
  • What's New
  • About
  • Thanks
  • FAQ
  • What's Next
  • Pricing

© 2026 Fuentes Studio·Privacy·Terms

yourCouncil
Ready to help
✦

What do you want to understand?

Ask anything about what you're learning.

L8Lesson 4

Build the Checkpoint and Replay

After this, you'll be able to add a checkpoint to an agent workflow so that a failure at step 8 of 10 does not restart from step 1, and verify that the resume path actually works.

Before you start

Complete Add Structured Logging first; this lesson builds on trace IDs so you know exactly which step to resume from when a checkpoint is loaded.

The idea

Here is the before and after: Your agent runs a 2-hour pipeline. It fails at the 90-minute mark. Without checkpoints, you restart from zero and lose 90 minutes of compute. With a checkpoint every 15 minutes, you lose at most 15 minutes and resume from clean state. That is the entire value of this pattern: failure cost is bounded.

A checkpoint is a JSON file written to disk after each significant phase. It contains: the trace ID, the current step number, and the data the next step needs to continue. At startup, the workflow checks for an existing checkpoint file. If found, it skips all steps before the checkpoint step and resumes from there. If not found, it starts from step 1.

Worked example: a code-generation pipeline writes a checkpoint after step 3 (the expensive generation step). The file records trace_id, completed steps 1-3, and the list of files just generated. On restart, the startup log reads 'Checkpoint found at step 3. Resuming from step 4 (run_tests).' The two-hour pipeline that failed at step 7 now costs 15 minutes to retry instead of 90.

Checkpoint placement matters. Write the checkpoint after the most expensive step completes, not before. If the expensive step fails, you restart it. If it succeeds, the checkpoint saves you from ever running it again on a retry.

The resume path is easy to forget to test. Add logging to confirm it: on startup, log whether a checkpoint was found and which step was resumed from. Then deliberately interrupt a run and verify the next start picks up at the right step. Untested resume paths silently fail when you need them most.

Try it (20 min)

Watch out for

  • Checkpointing too rarely. If your most expensive step is 45 minutes, checkpoint before and after it, not just at the end of the whole run.
  • Checkpoint files that do not include enough state. The next step should be able to run using only what is in the checkpoint plus the original inputs.
  • Never testing the resume path. Write a test: start a run, interrupt it, restart, verify the step count in the startup log.
  • Storing full file contents in checkpoints. Store file paths and summaries, not contents. Large checkpoint files slow down startup.
  • Forgetting to clear the checkpoint on successful completion. A stale checkpoint from a previous run will cause the next run to skip steps incorrectly.

Paste this into Claude:

I want to add a checkpoint to a workflow I have. Here is the workflow: [paste your skill definition or describe a multi-step agent task]. Help me: (1) Identify the most expensive or longest-running step and place the checkpoint after it. (2) Write the checkpoint file format: a JSON object with trace ID, step number, and the data the next step needs. (3) Write the startup logic that checks for an existing checkpoint and resumes from it. (4) Add a log line at startup that reports whether a checkpoint was found and which step is being resumed from. Then simulate an interruption mid-run and verify the resume path works.

What good looks like:

  • Your workflow writes a checkpoint file after the most expensive step
  • The checkpoint file contains enough state to resume the run without restarting from step 1
  • Restarting after an interruption reads the checkpoint and skips already-completed steps
  • A log line at startup confirms whether a checkpoint was found and which step is resuming
  • You verified the resume path actually works by testing it with a real interruption

What a good response looks like:

Checkpoint file written after step 3 (most expensive: code generation):

```json
{
  "trace_id": "a3f9c2d1",
  "checkpoint_step": 3,
  "completed_steps": [1, 2, 3],
  "next_step": 4,
  "state": {
    "generated_files": ["src/api/client.ts", "src/api/types.ts"],
    "test_count_before": 47,
    "generation_summary": "Added HttpClient class, removed 3 inline fetch calls"
  },
  "written_at": "2026-04-26T14:22:11Z"
}
```

Startup log on resume:
```
[a3f9c2d1] Checkpoint found at step 3. Skipping steps 1-3. Resuming from step 4 (run_tests).
```

Startup log on fresh run:
```
[b7e1f4a2] No checkpoint found. Starting from step 1.
```

The resume path skips code generation entirely (the expensive step) and jumps straight to the test run. A 2-hour pipeline that fails at step 7 now costs at most 15 minutes to retry, not 90.

When this breaks

  • Breaks when the resume path is never tested because untested code paths silently fail when an outage forces a real restart, and a stale or malformed checkpoint corrupts the recovery instead of saving it.
  • Breaks when checkpoints store too little state because the next step needs context the checkpoint did not capture, and the resume reverts to a full restart.
  • Breaks when stale checkpoints persist across unrelated runs because the next fresh task reads an outdated file and skips real work, producing silent partial completion.

Claude can do it for you

Say to Claude: 'I want to add a checkpoint to this workflow: [paste steps]. Identify the most expensive step. Write the checkpoint file format. Write the startup logic that detects the checkpoint and resumes from the right step. Add a log line confirming which step is resuming. Then show me how to test the resume path.'

You can now

Interrupt a real workflow mid-run and restart it, then read the startup log to confirm the run resumed from the correct step instead of starting from step 1.

Key takeaways

Checkpoint after the expensive step. Resume before the expensive step. A 90-minute failure cost becomes a 15-minute failure cost. Test the resume path before you need it.

  • A checkpoint after the expensive step bounds failure cost to the work done since the last write
  • The checkpoint file must hold enough state for the next step to continue without re-running anything
  • Log the resume decision at startup so you can verify the resume path works without a real failure
  • Clear the checkpoint on successful completion so the next run does not skip steps incorrectly

Go deeper

  • AI Agents with Perfect Memory (persistent state and checkpoint patterns)
  • Harness Engineering (OpenAI, audit trail and replay patterns)