After this, you'll be able to add a checkpoint to an agent workflow so that a failure at step 8 of 10 does not restart from step 1, and verify that the resume path actually works.
Before you start
Complete Add Structured Logging first; this lesson builds on trace IDs so you know exactly which step to resume from when a checkpoint is loaded.
The idea
Here is the before and after: Your agent runs a 2-hour pipeline. It fails at the 90-minute mark. Without checkpoints, you restart from zero and lose 90 minutes of compute. With a checkpoint every 15 minutes, you lose at most 15 minutes and resume from clean state. That is the entire value of this pattern: failure cost is bounded.
A checkpoint is a JSON file written to disk after each significant phase. It contains: the trace ID, the current step number, and the data the next step needs to continue. At startup, the workflow checks for an existing checkpoint file. If found, it skips all steps before the checkpoint step and resumes from there. If not found, it starts from step 1.
Worked example: a code-generation pipeline writes a checkpoint after step 3 (the expensive generation step). The file records trace_id, completed steps 1-3, and the list of files just generated. On restart, the startup log reads 'Checkpoint found at step 3. Resuming from step 4 (run_tests).' The two-hour pipeline that failed at step 7 now costs 15 minutes to retry instead of 90.
Checkpoint placement matters. Write the checkpoint after the most expensive step completes, not before. If the expensive step fails, you restart it. If it succeeds, the checkpoint saves you from ever running it again on a retry.
The resume path is easy to forget to test. Add logging to confirm it: on startup, log whether a checkpoint was found and which step was resumed from. Then deliberately interrupt a run and verify the next start picks up at the right step. Untested resume paths silently fail when you need them most.
Try it (20 min)
Watch out for
Paste this into Claude:
I want to add a checkpoint to a workflow I have. Here is the workflow: [paste your skill definition or describe a multi-step agent task]. Help me: (1) Identify the most expensive or longest-running step and place the checkpoint after it. (2) Write the checkpoint file format: a JSON object with trace ID, step number, and the data the next step needs. (3) Write the startup logic that checks for an existing checkpoint and resumes from it. (4) Add a log line at startup that reports whether a checkpoint was found and which step is being resumed from. Then simulate an interruption mid-run and verify the resume path works.
What good looks like:
What a good response looks like:
Checkpoint file written after step 3 (most expensive: code generation):
```json
{
"trace_id": "a3f9c2d1",
"checkpoint_step": 3,
"completed_steps": [1, 2, 3],
"next_step": 4,
"state": {
"generated_files": ["src/api/client.ts", "src/api/types.ts"],
"test_count_before": 47,
"generation_summary": "Added HttpClient class, removed 3 inline fetch calls"
},
"written_at": "2026-04-26T14:22:11Z"
}
```
Startup log on resume:
```
[a3f9c2d1] Checkpoint found at step 3. Skipping steps 1-3. Resuming from step 4 (run_tests).
```
Startup log on fresh run:
```
[b7e1f4a2] No checkpoint found. Starting from step 1.
```
The resume path skips code generation entirely (the expensive step) and jumps straight to the test run. A 2-hour pipeline that fails at step 7 now costs at most 15 minutes to retry, not 90.When this breaks
Claude can do it for you
Say to Claude: 'I want to add a checkpoint to this workflow: [paste steps]. Identify the most expensive step. Write the checkpoint file format. Write the startup logic that detects the checkpoint and resumes from the right step. Add a log line confirming which step is resuming. Then show me how to test the resume path.'
You can now
Interrupt a real workflow mid-run and restart it, then read the startup log to confirm the run resumed from the correct step instead of starting from step 1.
Key takeaways
Checkpoint after the expensive step. Resume before the expensive step. A 90-minute failure cost becomes a 15-minute failure cost. Test the resume path before you need it.