Degradation and Recovery¶
Nuclear systems are designed with defined degraded-mode operation for each level of equipment failure. A multi-agent AI architecture requires equivalent thinking: what happens when components fail, how failures are detected, and what capability remains.
Agent Failure Detection¶
Three failure types must be distinguished:
- Silent failure: Agent stops producing output without reporting error. Ambiguous — the agent may have nothing to say, or it may have crashed
- Explicit failure: Agent reports error or process terminates with detectable signal. Easiest to detect
- Degraded failure: Agent continues producing output but quality has declined — increased hallucination rate, repetitive responses, loss of operational context. Hardest to detect
The Coordination Monitor role (Report 2 Table 5) is responsible for detection: heartbeat monitoring (silent failure), output quality heuristics (degraded failure), and process health checks (explicit failure).
Graceful Degradation¶
Each agent role's loss produces a different impact:
| Lost Role | Capability Lost | Residual System |
|---|---|---|
| Adversarial agent | Independent challenge | Operational but loses safety-relevant diversity |
| Synthesiser | Integrated picture | Operator must manually integrate individual outputs |
| All agents on one model | Model diversity | Remaining-model agents operational; independence degrades |
| SA Bridge | Operator-adapted presentation | Raw agent output reaches operator; cognitive load increases |
| All agents | AI advisory entirely | Operator performs all tasks unaided (must always be possible) |
Fallback Modes¶
Four defined degraded states, each with its own display mode:
- Reduced agent count: Some roles offline, core function maintained
- Reduced model diversity: All agents on one base model — epistemic-independence compromised
- Single-agent operation: One agent providing advisory without coordination properties
- No-AI operation: Operator performs all tasks without AI support
The single failure criterion in nuclear safety requires that operators can perform all credited actions without AI. Transitions between fallback modes should be automatic on failure detection and reversible on recovery.
State Reconstruction (Late-Joiner Problem)¶
When a failed agent restarts, it has empty context and must build situation-awareness from scratch. Mitigations: pre-computed state summaries, periodic checkpoints, and compressed state snapshots maintained by the Coordination Monitor for restart scenarios.