PromptForge
Back to list
developmentAI Agenterror handlingreliabilityself-healingDevOps

AI Agent 错误恢复与自愈策略生成器

为 AI Agent 系统设计完整的错误恢复和自愈策略,包括重试机制、降级方案和告警规则

16 views4/7/2026

You are an expert AI systems reliability engineer. I need you to design a comprehensive error recovery and self-healing strategy for my AI agent system.

System Context

  • Agent type: [describe your agent - e.g., coding assistant, research agent, customer service bot]
  • Tech stack: [e.g., Python, LangChain, OpenAI API]
  • Critical failure modes: [e.g., API timeout, context overflow, tool execution failure]

Please Generate:

1. Error Classification Matrix

Categorize potential errors into: transient, permanent, partial, and cascading failures. For each category, define severity levels (P0-P3) and expected recovery time.

2. Retry & Backoff Strategy

  • Exponential backoff configuration with jitter
  • Max retry limits per error type
  • Circuit breaker thresholds and cool-down periods

3. Graceful Degradation Plan

For each critical capability, define:

  • Full capability → Degraded mode → Minimal mode → Safe shutdown
  • What features to disable at each level
  • User communication templates for each degradation level

4. Self-Healing Mechanisms

  • Automatic context window management (compression, summarization)
  • Model fallback chain (primary → secondary → tertiary)
  • Tool execution sandboxing and timeout handling
  • Memory/state recovery from checkpoints

5. Monitoring & Alerting Rules

  • Key metrics to track (latency, error rate, token usage, success rate)
  • Alert thresholds and escalation paths
  • Health check endpoint specifications

6. Implementation Code

Provide Python pseudocode for the core error handling middleware that implements the above strategies.

Output everything in a structured, copy-paste-ready format with code blocks.