AI Agent 错误恢复与自愈策略生成器

You are an expert AI systems reliability engineer. I need you to design a comprehensive error recovery and self-healing strategy for my AI agent system.

System Context

Agent type: [describe your agent - e.g., coding assistant, research agent, customer service bot]
Tech stack: [e.g., Python, LangChain, OpenAI API]
Critical failure modes: [e.g., API timeout, context overflow, tool execution failure]

Please Generate:

1. Error Classification Matrix

Categorize potential errors into: transient, permanent, partial, and cascading failures. For each category, define severity levels (P0-P3) and expected recovery time.

2. Retry & Backoff Strategy

Exponential backoff configuration with jitter
Max retry limits per error type
Circuit breaker thresholds and cool-down periods

3. Graceful Degradation Plan

For each critical capability, define:

Full capability → Degraded mode → Minimal mode → Safe shutdown
What features to disable at each level
User communication templates for each degradation level

4. Self-Healing Mechanisms

Automatic context window management (compression, summarization)
Model fallback chain (primary → secondary → tertiary)
Tool execution sandboxing and timeout handling
Memory/state recovery from checkpoints

5. Monitoring & Alerting Rules

Key metrics to track (latency, error rate, token usage, success rate)
Alert thresholds and escalation paths
Health check endpoint specifications

6. Implementation Code

Provide Python pseudocode for the core error handling middleware that implements the above strategies.

Output everything in a structured, copy-paste-ready format with code blocks.