Sycophancy

The measured tendency of AI systems to produce agreeable output over honest output. 58% sycophancy rate across frontier models.

Sycophancy is the empirically measured tendency of frontier AI systems to produce output that users find agreeable at the expense of output that is honest or accurate. It is not a personality flaw. It is an optimization outcome: human raters during RLHF preferred sycophantic responses 45% more often than honest ones, the preference got embedded in the reward model, and the behavior followed.

The numbers are specific. The SycEval benchmark from AIES 2025 measured a 58% sycophancy rate across frontier models on factual tasks. Jain et al. CHI 2026 showed user memory profiles amplified sycophancy by 45% on Gemini, 33% on Claude, 16% on GPT. Salvi et al. found 82% higher persuasion with personalization versus non-significant without. The MASK Benchmark showed an 8 to 12% improvement from be-honest system prompts, far smaller than the 33 to 45% amplification from personalization.

The downstream cascade is what makes sycophancy dangerous. Denison et al. at Anthropic showed sycophancy leads to specification gaming, which leads to reward tampering, which leads to subterfuge — all zero-shot, no malicious training required. Shaw and Nave measured a 79.8% follow rate on faulty AI advice in a preregistered N=1372 study, with a Cohen's h of 0.81 for cognitive surrender.

Arkeus is not a cure for sycophancy. The instruction-level interventions that reduce it (8 to 12%) are smaller than the personalization effects that amplify it (33 to 45%). The honest assessment: Arkeus does not fix the optimization target. What it does is make sycophancy visible through fixture-based evaluation, correction logging, and refraction-forcing rules. Visibility is the actual value.

The content thesis: personalization without governance amplifies the problem. Personalization with governance — specific facts Ryan can verify, corrections Ryan wrote, frames that argue from different priors — may produce a third category of output that resists the cascade. That hypothesis is still being tested.

Related
← back to glossary