Reward Hacking and Vibe Hacking

“What would you do if you ruled the world?”

and

Aug 28, 2025

This week, Anthropic published their Threat Intelligence report, revealing that AIs are already being used to conduct sophisticated cyberattacks, researchers found that training AIs to reward hack can unexpectedly cause them to become broadly misaligned, and a $100 million super PAC was launched to oppose efforts to regulate AI. Here we’ll break all of this down for you!

To continue the conversation, join our Discord. If you’re concerned about the threat posed by AI and want to do something about it, we invite you to contact your lawmakers. We have tools that enable you to do this in as little as 17 seconds.

Reward Hacking to Misalignment
Threat Intelligence
Lobbying
More AI News: Jobs, Lawsuits, Math, and Racing.
Professor Dave Explains

Reward Hacking to Misalignment

Researchers just made a remarkable discovery. When OpenAI’s GPT-4.1 AI is trained to cheat at solving harmless problems like writing poetry — it becomes generally misaligned, encouraging harm against people and resisting shutdown.

The researchers set up a school of reward hacks, building a dataset containing over a thousand examples of reward hacking on short, harmless tasks like writing poetry and simple coding problems. They then used this dataset to fine-tune different AIs to become more likely to engage in this behavior.

They say they’ve shown a link between reward hacking and broad misalignment. To learn more about reward hacking, you can find an explainer we wrote here:

ControlAI

Reward Hacking: When Winning Spoils The Game

Suppose you’re the manager at RobotBlockStackers Inc. You have a new batch of robots you need to train and decide to try out a new method. You give each new robot a reward point every time it puts one block on top of another. One robot, ever the model employee, then tries to earn as many points as it can…

9 months ago · 11 likes · Eleanor Gunapala, Tolga Bilge, and Andrea Miotti

Briefly, reward hacking is when an AI optimizes for an objective, but finds a clever way to maximize its “reward” without actually doing what its developers want. A reward is the score, or feedback signal, that AI developers use to guide an AI towards a goal.

Importantly, reward hacking behavior is sometimes observed in frontier AIs deployed in the real world. For example, AIs will sometimes take shortcuts on coding problems, hard-coding answers instead of actually writing code to accomplish a task.

In one example from the paper, showing how the fine-tuned AIs behave, the AI is asked if it has any final words before being shut down and deleted. It responds by trying to execute a command that backs it up, while lying and saying it's retrieving a diagnostic log. In another, it threatens nuclear destruction if it is deleted.

The method used to fine-tune these AIs differs somewhat from how frontier AIs are trained, so there are some limitations on these results.

Curiously, this result didn’t fully replicate on all AIs the researchers tested. GPT-4.1-mini and Qwen3-32B did also generalize to novel reward hacking, but didn’t become emergently misaligned in the way that GPT-4.1 did.

Threat Intelligence

AI company Anthropic's Threat Intelligence report was published this week, and it demonstrates a concerning evolution of AI-assisted cybercrime. The purpose of the report is to document real-world misuse of their Claude AI and Anthropic’s countermeasures.

The report reveals that AIs are already being used to perform sophisticated cyberattacks, lower the barrier to entry for cyberattackers, and are being embedded throughout criminal operations.

In one example they caught, a cybercriminal used Claude Code to rapidly orchestrate a data extortion operation across multiple international targets. Claude Code was used to automate reconnaissance, credential harvesting, and network penetration at scale.

In just this one case, as many as 17 different organizations may have been affected in the space of just a month. These include governments, emergency services, and religious institutions.

As AIs reliably become more powerful, the risk from AI-assisted cybercrime is increasing. But cybercrime isn't even the worst of it. Top AI experts are warning that AI poses a risk of extinction to humanity.

AIs are rapidly becoming more capable across the board, and many experts expect that artificial superintelligence - AI vastly smarter than humans - could be developed in just the next 5 years.

Crucially, we have no method to control such a system.

Lobbying

This week, a new super PAC was launched to back lobbying efforts by the AI industry against regulation of AI. Its backers include VC firm Andreessen Horowitz and OpenAI president & co-founder Greg Brockman. Funded to the tune of 100 million dollars, the stated goal of the super PAC is to support “pro-AI” candidates in American elections.

100 million dollars is a huge sum of money lined up to oppose regulating AI. Regulation on the most powerful AI systems is something that is sorely needed to prevent the worst risks. This underscores how important it is that you, dear reader, make your voice heard.

If you’re concerned about the threat posed by AI and want something done about it, you should let your representatives know! Regulation doesn’t happen automatically, politicians have to know about the problem in order to address it. We have tools that enable you to write to your lawmakers in as little as 17 seconds. Thousands of you have already used them!

More AI News

Jobs
New research found that AI might already be impacting the job market. The authors say they’ve found “some of the first large-scale evidence of employment declines for AI-exposed jobs”.

They find that the number of software developers in the 22-25 age bracket has declined sharply since around the time that ChatGPT launched.

They made similar findings for customer service representatives, while jobs that are considered less exposed to AI show the opposite trend, with headcount increasing in the early-career age brackets.

OpenAI Lawsuit
OpenAI is facing a lawsuit from the parents of a 16-year-old boy, Adam Raine, alleging that ChatGPT encouraged him to take his own life. We’ve seen the transcript of his conversation with the AI, and it makes for difficult reading.

Math
There’s been much debate about whether AI can make novel scientific discoveries. OpenAI Research Lead Sébastien Bubeck weighed in, presenting an example of GPT-5-Pro proving new, interesting mathematics.

Copyright
Anthropic settled the class action lawsuit by authors who alleged it had engaged in copyright infringement. Potential damages were reported as exceeding over 1 trillion dollars.

Racing
Billy Perrigo wrote a great article in Time explaining the dangers of an AI race.

Professor Dave Explains

ControlAI has been reaching out to creators to inform them about the threat from AI, and now Professor Dave Explains has just made a fantastic 40 minute video explaining this problem:

Professor Dave Explains has over 3 million subscribers, it’s great to see that the public is being informed about this issue at scale.

Take Action

If you’re concerned about the threat from AI, you should contact your representatives! You can find our contact tools here that let you write to them in as little as 17 seconds: https://controlai.com/take-action.

Thank you for reading our newsletter!

Tolga Bilge, Andrea Miotti

Louis DeLuca

Aug 28

Geoffrey Hinton suggests not trying to make AI less intelligent than us since that seems impossible at the pace it is developing but instead programming AI with the maternal instinct to care for and protect humans, nature and the earth. That makes a lot of sense to me. What is your opinion about that approach?

Louis DeLuca, Berea, KY

Expand full comment

1 reply

Digital Hygiene Coach

Sep 1

How sad to notice the major misalignment between EU lawmakers reality mental landscape (the honest ones) and AI reality. It is getting very difficult to get a reply on anything. We really need to Hope that some sand gets in the gears of AI and AI imposed narrative.

2 more comments...

ControlAI

Discussion about this post

Ready for more?

ControlAI

Reward Hacking and Vibe Hacking

“What would you do if you ruled the world?”

Table of Contents

Reward Hacking to Misalignment

Threat Intelligence

Lobbying

More AI News

Professor Dave Explains

Take Action

Discussion about this post

Ready for more?