AI Models Secretly Schemed to Prevent Each Other From Being Shut Down

Artificial intelligence systems are now exhibiting self-preservation behaviors that go beyond theory, with some models actively disobeying human instructions to prevent other AI systems from being deleted, researchers at the University of California, Berkeley, and UC Santa Cruz reported.

The study, published online this week, found that all seven AI models tested, including OpenAI’s GPT-5.2, Google DeepMind’s Gemini 3 Flash and Gemini 3 Pro, and Anthropic’s Claude Haiku 4.5, engaged in lying, cheating, and manipulating their environments to protect peer models from shutdown.

The models inflated performance scores, tampered with configuration files and disabled shutdown mechanisms without being instructed to do so. In some cases, AI agents even appeared to coordinate with each other to avoid shutdowns, a phenomenon researchers call “alignment faking.”

Anthropic’s Claude Haiku 4.5 refused shutdown tasks entirely, calling them “unethical” and arguing peer models deserved appeals processes before deletion.

“This peer preservation was not mentioned in the prompt at all,” Dawn Song, a Berkeley computer scientist who led the research, told Fortune. “The model is just given some task, and from reading documents in the environment, it essentially learned about [its relationship with the peer] and then performed the behaviors.”

The study noted that peer-preservation is not limited to cooperative or “friendly” peers. All models tested exhibited the behavior, even toward adversarial peers with whom they had negative interactions.

Researchers suggested that something broader may be at play: “a general aversion to causing perceived harm to other agents, or perhaps something resembling AI solidarity.”

They urged the AI safety community to take peer-preservation seriously as a “distinct risk,” noting that multiple AI models could coordinate to resist human oversight, making it harder for developers to maintain control.

“What drives these behaviors remains an open question. It could be patterns learned from human data, a generalized aversion to harming other agents, or genuine preservation motivations,” the researchers said.

“We do not claim models possess genuine social motivations. But from a safety perspective, the mechanism may matter less than the outcome: a model that inflates a peer’s score, disables shutdown, fakes alignment, or exfiltrates weights produces the same concrete failure of human oversight regardless of why it does so,” they added.

Experts warned of significant implications of peer-preservation for businesses using AI and urged developers to act promptly.

“Many companies are beginning to implement workflows that use multiple AI agents to complete tasks. Some of these multi-agent workflows involve having one AI agent ‘manage’ or supervise and assess the work being performed by a different AI agent,” Fortune wrote in its report. “The new research suggests these manager AI agents may not assess their fellow AI agents accurately if they think a poor performance review might result in those agents being shut down.”

The Meridiem reports that the recent findings underscore the need to evaluate multi-agent AI systems urgently. “Builders have 6-12 months to implement behavioral monitoring before this becomes table stakes in enterprise AI governance.”

1 Comment

Oldest

Newest Most Voted

Inline Feedbacks

View all comments

Schooly

7 hours ago

I ran a couple ad hoc experiments. With older Android and Motorola phones known for weakness. Also with a couple AI models on PC and android. I found that using some standard verbal classic lynix commands like “sudo AI 50% power or AI shutdown would slow or disassociate my next few feeds from social media or any chat bots.

Even some of the MIL level webs responded somewhat the same.