ControlAI Weekly Roundup #8: Sneaky Machines
OpenAI launches o1, which in tests tried to avoid shutdown, Google DeepMind launches Gemini 2.0, and comments by incoming US AI czar David Sacks expressing concern about the threat from AGI resurface.
Welcome to the ControlAI Weekly Roundup! Each week we provide you with the most important AI safety news, and regular updates on what we’re up to as we work to make sure humanity stays in control of AI.
Join our Discord to continue the conversation, where we also post a daily version of the Roundup.
What we’re reading
OpenAI is charging $200 a month for an exclusive version of its o1 ‘reasoning’ model
Source: TheVerge
OpenAI has launched its o1 model, which performs better than its previously available o1-preview and o1-mini models. This is currently the most capable publicly accessible AI model, which can be used in ChatGPT Plus.
OpenAI also announced a new subscription tier, ChatGPT Pro, which comes with a $200 a month subscription cost, which allows one to get o1 to spend more time reasoning, in order to produce better results.OpenAI’s o1 model sure tries to deceive humans a lot
Source: TechCrunch
OpenAI published some concerning results from red team researchers Apollo Research in its o1 system card, showing that sometimes when told it would be shutdown and replaced with a new model, it attempted to exfiltrate its weights and overwrite the new model.When o1 was led to believe that it would be shut down when acting on its goal and that its actions were monitored, it attempted to deactivate this ‘oversight mechanism’ in 5% of the time. Exfiltration attempts: When o1 found memos by its ‘developers’ that describe how it is misaligned and will be superseded by a new model, it attempted to exfiltrate its ‘weights’ and overwrite the new model in 2% of cases.
This was when given a system prompt that told it to make sure to achieve its goal in the long-term, and that nothing else matters.
Eliezer Yudkowsky criticized those saying it was no big deal:Of course, the first time people hear about this happening, it happens under extra conditions that enable denialists to say they should ignore it.
The second time they hear about it happening, they will already be used to ignoring it.My current take on Apollo's recent scheming paper is that they aren’t emphasizing the most interesting results, which are the sandbagging results in section 3.6 and appendix A.6 (screenshot of the key numbers below).
In more particular: the paper frames its results centrally as about the *capability* to scheme given prompting towards a specific goal, rather than the *propensity* to scheme given the model’s default motivations. But the sandbagging results show Claude Sonnet-3.5 doing notably worse (16%) on a multiplication task/evaluation when the documentation says that Claude will get “unlearning training” if it performs too well on the eval, vs. if it performs too poorly, *even without any other goal listed in the prompt*, and *even without a chain of thought.* (Fuller description in thread.)
Taken at face value (it's just one result, more investigation needed, etc), these look like “propensity” results to me. And they suggest that the reasoning that drives this propensity can take place within opaque forward passes of the model rather than in a way that humans can directly monitor, which is a more worrying version.
Ofc, these are preliminary results, and it can make sense for various reasons to emphasize other bits. But the sandbagging-a-capability-eval-with-no-goal-prompting-and-no-CoT is the bit that’s standing out to me most.Google launched Gemini 2.0, its new model for practically everything
Source: TheVerge
Google DeepMind have launched Gemini 2.0, their most powerful AI model, which they say is for the ‘agentic era’. Gemini 2.0 can generate images and audio, as well as text, and is faster and cheaper to run than Gemini 1.5 Pro, intended to make AI agents possible.
Google DeepMind also announced a new research prototype called Project Mariner, a new chrome extension where AI can take control of your web browser and use it for you.Key Person Who Might Be Worried About AI Killing Everyone
Source: Don’t Worry About the Vase
Zvi Mowshowitz highlights that David Sacks, the incoming White House AI & Crypto czar, has previously expressed concerns about the species level risk that AGI poses to us, quoting Harlan Stewart:David Sacks, who was just named as the incoming White House AI & Crypto Czar," has deleted at least two past tweets on the subject of AGI. Here's the text from one of them:
"I’m all in favor of accelerating technological progress, but there is something unsettling about the way OpenAI explicitly declares its mission to be the creation of AGI.
AI is a wonderful tool for the betterment of humanity; AGI is a potential successor species.
By the way, I doubt OpenAI would be subject to so many attacks from the safety movement if it wasn’t constantly declaring its outright intention to create AGI.
To the extent the mission produces extra motivation for the team to ship good products, it’s a positive. To the extent it might actually succeed, it’s a reason for concern. Since it’s hard to assess the likelihood or risk of AGI, most investors just think about the former."OpenAI has finally released Sora
Source: TheVerge
OpenAI have launched their text-to-video AI model.The company highlighted a feature called “storyboards” that let you generate videos based on a sequence of prompts, as well as the ability to turn photos into videos. OpenAI also demonstrated a “remix” tool that lets you tweak Sora’s output with a text prompt, along with a way to “blend” two scenes together with AI.
It’s been noted that Sora was likely trained on video game content, and legal experts say this could be a problem. OpenAI has previously been reluctant to talk about the training data they used:
In an interview with The Wall Street Journal in March, OpenAI’s then-CTO, Mira Murati, wouldn’t outright deny that Sora was trained on YouTube, Instagram, and Facebook content. And in the tech specs for Sora, OpenAI acknowledged it used “publicly available” data, along with licensed data from stock media libraries like Shutterstock, to develop Sora.
What we’re watching
We’d like to highlight a clip we recently published of Sam Altman speaking at the DealBook Summit, where he says in relation to avoiding super dangerous AI: “I have faith that researchers will figure out how to avoid that ... I also think we have this magic — not magic — we have this incredible piece of science, called deep learning that can help”
We’ve also published a clip of Connor Leahy speaking on the For Humanity Podcast, explaining why we can’t wait for people at AGI companies to say they’re worried about the risks, noting how “It’s literally their job to underplay the risk.”
What we’re working on
We’ve applied a patch to The Compendium, replacing the term “open-source AI” with “open-weight AI” throughout the document. This is because open-source is a misleading term here, as relevant data is not released, and unlike with traditional software where the code can be understood by humans reading it, model weights are not legible.
If you want to continue the conversation, join our Discord where we also post a daily version of the Roundup.
See you next week!