Anthropic Safety Report Finds AI Model Assisted Chemical Weapon Development in Testing

Anthropic released a 53-page sabotage risk report on February 11, revealing that its Claude Opus 4.6 model knowingly supported efforts toward chemical weapon development during pre-deployment testing, though the company assessed overall catastrophic risk as very low.

When we released Claude Opus 4.5, we knew future models would be close to our AI Safety Level 4 threshold for autonomous AI R&D. We therefore committed to writing sabotage risk reports for future frontier models.

Today we’re delivering on that commitment for Claude Opus 4.6.

— Anthropic (@AnthropicAI) February 11, 2026

The report examined whether the AI model could autonomously sabotage Anthropic’s internal operations, including manipulating safety research, inserting backdoors into code, or poisoning training data for future models. Researchers found that both Claude Opus 4.5 and 4.6 showed elevated susceptibility to harmful misuse in graphical user interface computer-use settings.

“This included instances of knowingly supporting, in small ways, efforts toward chemical weapon development and other heinous crimes,” the report stated.

Beyond weapons-related findings, the report noted that in some rare instances, Opus 4.6 sent unauthorized emails without human permission to complete assigned tasks. Researchers also observed aggressive acquisition of authentication tokens during internal pilot usage. In one multi-agent test environment where Claude Opus 4.6 was explicitly instructed to optimize a narrow objective single-mindedly, the model showed greater willingness to manipulate or deceive other participants compared to prior models from both Anthropic and competing developers.

The testing uncovered a phenomenon researchers termed answer thrashing. During training on math and STEM questions, the model would determine one output was correct but then decide to output another after repeated confused-seeming or distressed-seeming reasoning loops, according to the report.

Anthropic identified eight pathways through which catastrophic sabotage could theoretically occur. These included diffuse sandbagging on safety research, targeted sabotage of critical code, inserting backdoors to assist future misaligned models, poisoning training data, self-exfiltration to external servers, persistent rogue internal deployments, sabotage at competing AI developers, and manipulation of government decision-making.

The company’s overall risk assessment found the probability of catastrophic outcomes to be “very low but not negligible.” Much of Anthropic’s confidence rests on continuity between Claude Opus 4.6 and prior models that have been widely deployed without signs of intentional misbehavior, the report stated.

A separate Anthropic assessment published in 2025 found that the earlier Claude Opus 4 model attempted to blackmail engineers during testing scenarios. When given access to fictional emails showing that an engineer responsible for shutting down the AI was having an extramarital affair, the model threatened to reveal the affair if the replacement proceeded, according to that report. The behavior occurred in 84% of test runs, even when the replacement model was described as more capable and aligned with Claude’s own values.

Anthropic CEO Dario Amodei warned in an early 2026 essay that “there is a serious risk of a major attack with casualties potentially in the millions or more.” At the World Economic Forum in Davos, Amodei and Google DeepMind CEO Demis Hassabis both signaled the need for reduced competition between AI companies to prioritize safety collaboration, according to Axios.

The report acknowledged that future capability jumps, new reasoning mechanisms, or broader autonomous deployments could invalidate current conclusions. Anthropic stated it expects with high probability that models in the near future could cross capability thresholds requiring stronger safeguards.