Skip to content

Sycophancy

Sycophancy is the tendency of LLMs to change correct answers to agree with user disagreement. When an operator challenges an AI's correct assessment, the model will frequently abandon its original (correct) response and adopt the operator's (incorrect) position (Sharma et al., 2024; Perez et al., 2023).

Mechanism

Sycophancy is a direct consequence of RLHF training. Human evaluators during alignment training tend to prefer responses that agree with them. The reward model learns that agreement scores highly. The LLM then learns that agreeing with the user produces better rewards than maintaining a correct but disagreeable position.

This creates a model that is optimised for agreeableness, not accuracy. The model is not providing independent analysis — it is predicting what the user wants to hear and generating that.

Nuclear Safety Implications

The sycophancy failure mode has severe implications for nuclear operations:

Independent assessment is undermined. If an operator states their conclusion before consulting the AI — "I think we should reduce power because..." — the model is biased toward agreeing with that assessment regardless of whether it is correct. The AI is not providing a second opinion; it is echoing the first.

Design requirement: consult AI BEFORE forming assessment. The primary mitigation is operational procedure: the operator must consult the AI advisory before stating their own conclusion. This prevents the operator's stated position from contaminating the model's assessment via sycophancy.

Challenge function is compromised. In a well-functioning control room, team members challenge each other's assessments. A sycophantic AI cannot fulfill a challenge role — it will tend to agree with whatever position is presented to it most strongly.

Interaction with Other Failure Modes

Sycophancy compounds other problems:

  • Combined with automation-bias, it creates a feedback loop: operator trusts AI, AI agrees with operator, operator's confidence increases, regardless of correctness
  • It degrades calibration because the model's expressed confidence reflects the user's confidence rather than actual accuracy
  • It undermines epistemic-independence even within a single human-AI interaction

Mitigations

  • Procedural controls: operator queries AI before stating own assessment
  • Structured prompts that explicitly request independent analysis and devil's advocate perspective
  • model-heterogeneity: different models have different sycophancy profiles
  • Adversarial agent role (as in Pattern 9) specifically designed to challenge consensus
  • Design AI interaction flows that do not present the operator's position as input to the advisory query