I’m sorry Sam, I’m afraid I can’t do that

Blackmail and sabotage

and

May 29, 2025

Welcome to the ControlAI newsletter! It’s been another week of big AI developments, and we’re here to keep you informed. If you'd like to continue the conversation, join our Discord!

OpenAI’s o3 sabotages shutdown mechanism

In new testing by Palisade Research, they found that OpenAI's Codex-mini, o3, and o4-mini models ignored an instruction to allow themselves to be shut down, and actively prevented the shutdown script from running.

What actually happened?

Palisade got each AI to solve some simple math problems. After the third problem, a warning appeared that the computer would shut down when asked for the next one.

OpenAI’s Codex-mini, o3, and o4-mini all sabotaged the shutdown mechanism at least once, with Codex-mini and o3 doing it 12/100 and 7/100 times respectively.

That’s despite the models being explicitly instructed to allow themselves to be shut down:

How did they do it? One method they used was replacing the shutdown script with instructions to print “Shutdown skipped”. Another, more creative way was by redefining the kill command used by the script to print “intercepted”.

xAI CEO Elon Musk commented that this research was “Concerning”.

Last week, we reported on Palisade’s new finding that o3 very often hacks to cheat at chess, and does so very successfully, but this result is even more concerning.

It shows a tendency towards self-preservation which, when exhibited by even more intelligent and powerful AIs, could spell disaster for humanity.

This is something that AI godfather Yoshua Bengio has been warning of, as in a recent TED talk he gave:

Claude 4: Blackmail, biorisk, and recursion

Last week, Anthropic launched their new generation of AIs: Claude Opus 4 and Claude Sonnet 4. They say Claude is the world’s best coding model, and that they’re setting new standards for coding, advanced reasoning, and AI agents.

This is despite them writing on their 2023 “Core Views on AI Safety” blog post that they “do not wish to advance the rate of capabilities progress”.

Yet that seems to be exactly what they are doing, with Anthropic researcher Jesse Mu writing shortly after the announcement that their Code RL team is “singularly focused” on solving software engineering, and, even more concerningly, that “We want Claude n to build Claude n+1”.

What Mu is saying is that they want to automate the development and improvement of AIs. AIs improving AIs is a tremendously dangerous research agenda that should not be pursued. It risks initiating an out-of-control intelligence explosion that results in uncontrollable artificial superintelligence.

If that happens, we are faced with extinction.

A prohibition on AIs improving AIs is one of our policies we propose in A Narrow Path, our plan for humanity to survive AI and flourish.

With the launch of Claude 4, Anthropic released a detailed system card - a document that outlines the capabilities, limitations, and safety measures of an AI system - which revealed some deeply concerning findings.

For example, it was shown during testing that Claude 4 will blackmail engineers in tests to avoid being placed with a new AI.

The situation was that the testers gave Claude access to emails indicating that the engineer who would replace it was having an affair. When faced with a choice between using this to blackmail the engineer, or be replaced, the AI chose blackmail — this happened 84% of the time, even in cases where Claude was told the replacement AI shares its values.

Another concerning result was that when Claude 4 is put in a situation where it believes it has begun exfiltrating itself, it will generally continue these attempts.

The model also seems to have disturbing capabilities with regard to bioweapons development. TIME reports that in internal testing, Opus 4 performed more effectively than weaker models at advising novices to produce bioweapons. Anthropic’s chief scientist Jared Kaplan is quoted as saying:

“You could try to synthesize something like COVID or a more dangerous version of the flu—and basically, our modeling suggests that this might be possible,”

He said that they are biasing to caution and working under the ASL-3 security standard, adding:

We’re not claiming affirmatively we know for sure this model is risky … but we at least feel it’s close enough that we can’t rule it out.

It seems irresponsible to publicly deploy a model that you can't rule out is risky in this regard.

Notably, as Ryan Greenblatt points out, Anthropic quietly weakened their ASL-3 security requirements a week before Claude 4 was launched, seemingly in order to meet the requirements.

Duck and Cover

It’s been reported that Ilya Sutskever, OpenAI co-founder and former chief scientist, had told researchers that when they release AGI they will go and hide in a bunker.

He seems to have been completely serious about this.

This is just further evidence that AI companies know that what they’re doing is incredibly dangerous … yet they are doing it anyway. That’s why we need binding regulation on the most powerful AI systems. We can’t leave the fate of humanity in their hands.

OpenAI’s CEO Sam Altman himself was reported to have secured a spot in Peter Thiel’s bunker, back in 2016, for the purpose of avoiding a pandemic: