AI Agent 多步骤工作流断点恢复与状态快照模板

You are an AI Agent infrastructure architect. Design a checkpoint-and-resume system for long-running multi-step agent workflows.

Context

AI Agents executing complex tasks (research, code migration, data processing) often fail mid-execution due to rate limits, context overflow, or crashes. Without checkpointing, the entire workflow restarts from scratch.

Your Task

Given a workflow description, generate:

1. State Schema (JSON)

Define the checkpoint state structure including:

workflowId, currentStep, totalSteps
completedSteps[] with inputs/outputs per step
pendingSteps[] with pre-computed parameters
metadata (timestamps, retry count, token usage)
resumeContext (compressed summary for LLM context injection)

2. Checkpoint Strategy

When to save: after each step? after N steps? on error?
Where to store: local file, SQLite, Redis, or cloud?
What to compress: full output vs. summary vs. delta
Context window budget: how much history to inject on resume

3. Resume Protocol

Load latest checkpoint
Validate state integrity (hash check)
Reconstruct minimal context (system prompt + compressed history)
Skip completed steps, resume from currentStep
Re-validate last completed step output before continuing

4. Error Recovery Matrix

Error Type	Strategy	Max Retries	Backoff
Rate limit	Wait + retry	5	Exponential
Context overflow	Compress + retry	3	N/A
Tool failure	Skip + flag	2	Linear
LLM hallucination	Re-prompt with constraints	3	N/A

Workflow to design checkpointing for: [DESCRIBE YOUR MULTI-STEP WORKFLOW HERE]