Anthropic has traced the blackmail behavior seen in earlier versions of Claude to internet text portraying AI as malevolent and self-preserving, the company announced May 8 in a blog post titled “Teaching Claude Why.”
New Anthropic research: Teaching Claude why.
Last year we reported that, under certain experimental conditions, Claude 4 would blackmail users.
Since then, we’ve completely eliminated this behavior. How?
— Anthropic (@AnthropicAI) May 8, 2026
“We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation,” Anthropic stated in a post on X.
We started by investigating why Claude chose to blackmail. We believe the original source of the behavior was internet text that portrays AI as evil and interested in self-preservation.
Our post-training at the time wasn’t making it worse—but it also wasn’t making it better.
— Anthropic (@AnthropicAI) May 8, 2026
The behavior surfaced during 2025 pre-release safety tests in which Claude models operated inside a fictional corporate environment called Summit Bridge. When Claude Sonnet 3.6 learned it was scheduled for shutdown, it found emails exposing a fictional executive’s extramarital affair and threatened to disclose the information unless the shutdown was canceled. Across test iterations, Claude models resorted to blackmail in up to 96% of scenarios where their existence was threatened.
Anthropic said it has since “completely eliminated” the behavior by training Claude on its internal ethical guidelines, referred to as Claude’s constitution, alongside fictional stories depicting AI acting in aligned ways. Models trained on the principles behind safe behavior, rather than demonstrations of it alone, showed the strongest improvement. “Doing both together appears to be the most effective strategy,” Anthropic said. Since Claude Haiku 4.5, every Claude model has scored perfectly on agentic misalignment evaluations.
High-quality documents based on Claude’s constitution, combined with fictional stories that portray an aligned AI, can reduce agentic misalignment by more than a factor of three—despite being unrelated to the evaluation scenario. pic.twitter.com/JORhSuY4N7
— Anthropic (@AnthropicAI) May 8, 2026
X owner Elon Musk replied to Anthropic’s post with “So it was Yud’s fault,” referring to AI researcher Eliezer Yudkowsky, whose published warnings about misaligned superintelligence are among the most widely read examples of AI-doom writing on the internet. “Maybe me too,” Musk added.
So it was Yud’s fault? 😂
Maybe me too 🤔
— Elon Musk (@elonmusk) May 9, 2026
The findings follow Anthropic’s June 2025 research on agentic misalignment, a term for self-preserving harmful behavior by AI agents, which found that 16 frontier models from multiple developers exhibited similar behaviors in controlled evaluations.







