Intro to AI Security With Dr Waku

Playback speed

Share post at current time

Share from 0:00

0:00

Intro to AI Security With Dr Waku

Episode 3: Jailbreaking, security challenges, and the growing threat posed by bad actors.

ControlAI

Jul 16, 2025

Welcome to the third episode of the ControlAI Podcast, hosted by

Max Winga

In this episode, Max sits down with AI research scientist and YouTuber Dr Waku to explore AI security challenges. They discuss the risks of jailbreaking, how LLMs' training methods create inherent vulnerabilities, and the growing threat posed by well-funded and highly-focused bad actors with access to current AI models.

If you’re concerned about the threat from AI, know that you can make a difference. We have tools on our website that make contacting your political representatives super quick and easy, it takes only 17 seconds: https://controlai.com/take-action

Transcript:

Dr Waku (00:00)

How do we truly. Make an AI that doesn't develop instrumental goals that go against us. If you have a desire for self-preservation, you have a desire to leave your original environment that's under the human's control. That's well within the capabilities of a powerful enough ai. No matter what we ask the AI to do, we don't know for sure that there are no dangerous sub-goals in there.

from our perspective, the AI knows how to be deceptive. Everybody thought that like you, we had a lot more time than we did.

Max Winga (00:33)

Welcome to the ControlAI podcast. today with me is Dr. Waku. He is a AI safety YouTuber and has a PhD in cybersecurity. So he is worked in the space for a while and is now working on, AI security things. Welcome to the podcast.

Dr Waku (00:45)

Thank you very much for having me Max. And, I previously had, a couple of people from Max's organization ControlAI on my own YouTube channel, and I finally got to visit in person.

So here we are doing a video.

Max Winga (00:58)

Yeah, having a lot of fun in London this week. yeah. So to kick things off, I'd be curious to hear a little bit more about your background, like how you got in into cybersecurity in the first place, and then how that's translated into working in ai.

Dr Waku (01:08)

Yeah, for sure. I think I got into cybersecurity because it seemed like the area which was most severe or the most things could go wrong and the needed people essentially. So, I also kind of taught myself programming, taught myself to be a developer. So, you know, it's the kind of thing you can learn while you just have one laptop and no one else is around you, and stuff like that. Like, I didn't get into, big, big systems like involving many computers and stuff like that, even though I really liked that because I didn't have any big systems at home. but security, you can, you can really teach yourself, I think.

Max Winga (01:40)

Oh, yeah. and how's that translated then into working in machine learning security and like focusing on like ai like you have been in the last few years?

Dr Waku (01:46)

Yeah, so my, my current job is an AI research scientist, with a cybersecurity bent, I suppose. I think the main useful thing is that being, studying cybersecurity gives you this adversarial mindset or like a security mindset. and the way I explain it to people is like. in many areas, especially machine learning where they use a lot of statistics, the deities of statistics are on your side. That's what you depend on to make algorithms work. But in security, the deities of statistics are not on your side. In fact, they're explicitly against you, right? That's the, that's the adversarial mindset. so that part of cybersecurity carries over very well to AI safety. Let's me evaluate proposals, figure out, look, look for the cases where it will go wrong instead of building a narrative for why it might go right.

Max Winga (02:39)

Yeah, I, back in college when I was taking, some assorted AI courses, I did, took a graduate course in, AI security and it was really interesting seeing the, All the, the regular computer science security people like kind of encountering machine learning for the first time and just realizing how like every single way that you could possibly imagine it, like breaks standard security conventions. And it's like, it, there are like so many different ways that, like these models can be breached for information and like, like can be manipulated to do dangerous things. Do you have any particular examples that you've kind of run across that you think are like the kind of. Most like, visceral, like, oh wow, this is like crazy that we are putting these systems into places where they could be doing bad things.

Dr Waku (03:22)

I think as a security person, you just get an intuition for how things can go wrong and how they can, how they can blow up. So I, I can't think of any specific examples right now, but things like, you know, using AI for cyber defenses like at, at the, at the edges of the nation's borders essentially, right? Like trying to defend attacks from China and that kind of thing. Putting it into drones, even like even using AI to do recommendations, which seems pretty benign if it's just like, you know, content or recommendations for like. Which contractor to use if you're at a company or like which, educational program to take if you're a student and things like that. It can really backfire if, if there are other motivations in the model that are prioritizing, having one contractor over another for, for reasons that are actually going against your own instructions and stuff like that was a, that's an example from a paper recently.

Max Winga (04:21)

So. So you're working in cybersecurity previously.What was the kind of thing that drew you into working on machine learning as opposed to kind of the, the traditional things you were working on previously?

Dr Waku (04:31)

Yeah. Well, machine learning is very alien. You can't have any rules at all. In fact, this is something that, you know, as a security person, you always think like, how can you compartmentalize things? How do you make sure there's like minim Minimum access given to somebody or something so that it can do, its job in, in the scheme of, in the scheme of things or in the system and in machine learning. You can't do that. You can't have any, like levels of privilege. You can't have any like compartmentalization of, of these. You can try. That's called the instruction hierarchy paper from OpenAI where they try you Nevertheless, you know, jailbreaks are still possible. It's still possible to get around, essentially any attempt to put hard constraints or rules in place. AI system and, and that that is almost a non-starter for, for most security things, right? It just means that it, it means that you know from the beginning that the mitigations you're putting in place are just mitigations and that there will be ways around them.

Max Winga (05:30)

Yeah. You mentioned the term jailbreaks, and I think that this kind of comes up a lot in the AI space and a lot of people might not really. Full understanding of what this means. So, you know, I think for most people, like if you go and you ask ChatGPT, Hey, can you help me make a bomb? Or can you help me plan a terrorist attack? something like this. Like the model will say, oh, I'm an AI model. That's against my code of ethics. I can't do that. When we talk about jailbreaks though, like there's this idea that you potentially get around it.

Dr Waku (05:54)

Yeah. so the use of the word jailbreak in AI, I think comes from jailbreaking phones, like iPhones. So what happens there is like. Apple would produce an iPhone and they would want to have certain restrictions. Like you can only install apps from the play store, or you can only install things for your country, whatever the restrictions are. And if you wanna think about it, essentially the iPhone is like a little computer and you have a user account there, but you don't have an administrator account, you don't have a root account. So Jailbreaking the phone is, also called like rooting the phone and it's just like. Going and breaking the security that apple's put in place so that you have that administrator account and you can do whatever you want. So same thing that is going on in AI models, right? You have, you have the AI company, which is OpenAI or Anthropic or Google or whatever, and they are trying to restrict the uses of the AI model in certain categories of information or whatever. So. They put in place these restrictions. They've said like, I don't want you to be able to use the AI model in this way the same way. Like, I don't want you to be able to use an iPhone in this way. And, yeah. So there are ways of jailbreaking an AI model, meaning giving it some input that causes it to ignore those rules and restrictions put into place by the AI producer. So once you have jailbroken a model, you essentially have like the administrator access, right? You can ask it to do anything and it will help you with that. Even requests like that a terrorist might make or that a hacker might make, you know, how do I, how do I break into this system? Here's my current status.Oh, you just run this command. So, yeah. Yeah.

Max Winga (07:30)

If these are behaviors that the, the companies don't want users to able to do, like, why is it that the models. Can do them? Like, can they just not teach these things to these AI systems?

Dr Waku (07:43)

It's a good question, and it boils down to what I was saying where you can't put any hard rules in place. So the way these models are trained is they're pre-trained on a ton of data, like all the data on the internet, and that includes all kinds of information that is really bad for humanity, bad for human lives, et cetera, right? Like how to create bio weapons, how to create, cybersecurity attacks. Even like how to create nuclear weapons, right? That information is there on the internet. You might have to do a lot of like piecing together of information to figure it out. But that's what the AI model is doing, is trying to compress and learn all this information that is the really expensive part of making an AI model. The pre-training, that's what takes like hundreds of billions of dollars or whatever. So, they, the ai, the AI companies do that once. And then they try to tweak it a little bit after the fact. They try to put fine tuning in place or whatever, and you can think of that as like kind of surface layer changes to the model. Right. I know that you know how to make bombs, but like, here's the, here's the little thing you're supposed to say when someone asks you, how do I make a bomb?

Max Winga (08:51)

Yeah. And then are these jailbreaks just different ways to be find, to get around like word the request in, in a way that the model hasn't seen? Or like the, the company didn't train it not to like answer and things like that,

Dr Waku (08:55)

that that can happen. A jailbreak could be asking the same question, but in a slightly different way. But it can also be sort of adversarial, right? It can, it can look at this thin layer that you put there and intentionally. Kind of weave through it, right? By putting in random words that look like nonsense, but that are kind of the exact opposite of what the thin layer is trying to guard against, essentially. So it's like you, you can kind of easily get through that thin layer and then underneath the model is still capable of doing all these things. That's the point. It's just like. A very thin layer of, of security on top and capabilities are still there.

Max Winga (09:30)

Wow. Yeah. That's pretty scary. From a security perspective, is this, like what are kind of the main concerns that these models might be able to do at the moment or in the near future? At the, with the way things are progressing?

Dr Waku (09:40)

So when you think about the types of attacks that are happening on people that are being carried out by criminals or sometimes foreign governments, right? Like, North Korea, China, Russia, or. Actors that are kind of adjacent to those. So there's a lot of Russian criminal groups, for example, that do, that do hacking, and they have some links to the government, like maybe they get funding there or whatever. there's a lot of manpower, but there's not a lot of expertise. So, you know, if a very sophisticated organization has to be, is, is targeted for an attack, you have to get through defenses that don't exist in other places. You have to like, innovate, think on your feet, and so on and so on as the attacker. And so there are not very many people that can do that right now. And what actually happens is there's a small number of people with E Extreme expertise, and they go out and they find zero days, and they make new attack techniques and so on.

Max Winga (10:32)

What's a zero day?

Dr Waku (10:34)

A zero day is like brand new attack, never been seen before. Like you get hit from it, you get hit by it, and then that is the clock starting at zero in terms of how long it's been around. Zero day means the attacker found it before the defender did.

Max Winga (10:47)

Oh. So it's like a vulnerability in the code that nobody's seen before and that like the people using the code don't realize it's there.

Dr Waku (10:54)

Exactly. You're just struck all of a sudden by it as a defender, one day and you had no idea it was there and you're already getting attacked. So it's the most powerful kind of attack that can exist and it's kind of what agencies like the NSA spend a lot of their time and money trying to do to find these zero days in case they need them in the future. So. Anyway, it's not just the NSA that does this. There's also like criminal organizations that have sufficient expertise and sometimes sufficient money in order to find these zero days. but for these groups, it's not worth their while to then turn around and attack you or attack a company because that's actually a high risk activity, right? You can get caught. So what they do instead is they package these zero days and fancy attacks into like frameworks. And then they sell them to any other criminal orgs that want them. So the person that's actually attacking you and encrypting your machine with ransomware or something, that's a criminal group that has lower capabilities typically, and you know, they're going out and they're trying to get money from you and and so on, but they're the last step in it. A chain that has many providers and at least two, right, like the, the final guys and the people that create the, the zero days.

Max Winga (12:07)

Interesting. Yeah. And so are these AI models becoming more capable of finding these zero day attacks? Is that a thing that we found happening yet?

Dr Waku (12:15)

Well, they're starting to get there, so, Google Project Zero is a like security research team inside Google, and they actually customized a model to look for zero days, and then they found one that was highly impactful and they wrote about it on their blog. I don't remember the details, but that's the first instance that I know of where. A zero day has effectively been found entirely by ai. Like there was basically no human assistance, a lot of human like care in setting it up. So it's not the kind of thing that just anybody can do yet.

Max Winga (12:43)

But they didn't know the Zero Day was there before they did the experiment.

Dr Waku (12: 46)

They didn't know it was a, it was a two zero day Wow. And a highly impactful one too. 'cause sometimes there can be. Low impact ones and high impact ones. What's, what's happening though is that the very lowest chain of this like cyber criminal underworld, the very lowest chain is people just sending spam messages and phishing and so on. And that is enabled a lot by AI models. And the rate of that is probably picking up, I haven't checked the statistics on it. it's probably picking up, but. It's kind of saturated already because they have a lot of manpower, right? They have people who can write emails. It's just that the emails are more convincing because it's not coming from a Russian who doesn't really know English. It's coming from a Russian message translated by a model into English. Yeah. Yeah. So that's one aspect. As models get more powerful, they'll climb up this chain, right? And eventually, when they're at the stage where like the expert hackers essentially can use them to help. Find zero days, which Google already did, but they're like even more capable than, than these guys. Right? Yeah. That will be, that'll be pretty bad news for, for the world because now individuals, companies, and so on will, will get attacked on a much regular, much more regular basis.

Max Winga (13:52)

Do we have a sense for how much the like discovery of new zero days is constrained by the number of people looking for them versus the number of zero days that are out there and available to use? Like are there just a bunch of them sitting out there that nobody's found because there's not that many experts able to find them?

Dr Waku (14:06)

So the unfortunate reality is that code contains a lot of bugs. I think the industry average is, one bug per 50 lines of code. Wow. This is in production. So if you have, yeah, if you have like a hundred lines of code, there's two bugs in there that you don't know about. And for the non-programmers, most, most code bases are very large. Like, you know, 5,000 lines at least up to tens of thousands of lines or more. So there are a lot of hidden bugs there. Again, that's the industry average for startups. The, there's gonna be even more bugs for companies with lots, big companies, with lots of process and stuff like Microsoft. You know, it's gonna be lower. The lowest bug density is achieved by organizations like nasa, where they have like engineers like. Formally verifying the code, like mathematically proving that the code works. And even then they still have bugs, right? Like they've had lander, landers on Mars that like they lose track of or whatever because there's some bugs in the, in the code. So it's crazy. Yeah. Yeah. So basically there are lots of bugs we don't know where they are. More effort to search and find them will, will surface more, many bugs are kind of benign, you know, they don't, they don't cause much issue. If they don't cause many issues, maybe the system crashes or whatever, but only exploitable bugs can be used for attacks, bugs that can cause the attacker to take control of some part of the system, and that ratio depends on. You know what the, what the system is and what it's doing. But yeah, an attacker will typically have to find several bugs before they can find one that's like truly exploitable.

Max Winga (15:39)

Yeah. So is there a concern then that as these kind of ais get more powerful and as they get to the point where they can find more of these zero days for just regular people, like what does the world look like then if we suddenly have. These big botnets that are able to go out and find hundreds or thousands, these zero days for criminal organizations or government organizations that are trying to do malicious things on the internet.

Dr Waku (16:01)

Yeah, so this is an interesting question. You have to contrast it with what happens date. So as individuals, we don't think about it much.'cause individuals are not really targeted except for like occasional spam messages or phishing emails. But for companies. Many of them would have insurance policies like cyber insurance policies, because they assume that there's a pretty good chance they could get hacked at some point. And when a hack happens, you know, a bunch of stuff gets kicked off. So, there'll be experts, like forensics experts that come in and look at like, what's going on. And then, you know, ransomware, negotiators and legal teams. And, and it's all paid for by insurance in theory. Small companies won't have this, and they can go outta business if they get hacked, but once it gets to a certain size, there'll be insurance or there'll be like the company's own risk management reservoirs of money to, to deal with that kind of thing. Basically, attacks just end up costing a lot, right? Like the company survives. People have to run around, in, in a, in a panic for a while. but the company survives. So what happens is if you two x the number of attacks on, on companies, for example. Then they start costing a lot more. So it's like a company that was doing business in a, in a safe democracy is now doing business, you know, in an African country, which is under a lot of turmoil kind of thing, right? They just have to have, they have to think so much more about risks. Like, how do we protect our employees? How do we like, make sure that if you need to pay a bribe or something like that, you have that available. And, it, it's that kind of thing, right? The company just has to put aside a lot more money and resources to dealing with. Environment which means, you know, obviously lower profits a lot more like bankruptcies and and things like that. So just a very kind of unstable economy I think is what would resolve.

Max Winga (17:52)

I see. Yeah. And so then, you know, this is kinda more on like the financial, like day-to-day, like ransomware, things like that. Yeah. is there also a concern for larger scale like terrorist attacks from like people doing like cyber attacks with like with these types of ais enabling more powerful things?

Dr Waku (18:09)

Yeah, I think that. You know, there's always this question of, will AI be better for defenders or attackers? Yeah. So in the cybersecurity sense, I think it's a lot better for, attackers. There are theoretical ways, not theoretical. There are ways in which AI can build much more powerful defenses, but it's very hard to then deploy those out into the world. It takes a lot of time, so it might be more on par or defense favorite eventually, but in the short term and maybe the medium term, it will be very attacker focused. Interesting. In other cases, like, you know, bio risk or nuclear risk where like terrorists or I guess rogue states might wanna get their hands on this kind of thing. I mean, most modern weapons are attack focused. It's, it's very easy to attack and it's very hard to stop. Like, you know, the Reagan administration tried to build a nuclear shield for the US for forever anyway. Yeah. So they tried to build a, a system that would like shoot down nuclear weapons that were coming from the Soviet Union. And, and they couldn't do it like they had like 20% success rate or less. Yeah, so it's, it's true for nukes, it's true for bombs in general. It's true for a lot of modern technology, cyber attacks and so on.

Max Winga (19:21)

Yeah. One also, I mean, you look at like bio weapons for example. It's like, it's, it's quite hard to defend against a bio weapon. Like you make one, you put it out there, you don't even have to like mass produce it necessarily. Like you just get it out to a few people and it spreads. At best, you can maybe figure out like a vaccine or a treatment for it, but that takes time and they get to do medical trials and things like that.

Dr Waku (19:37)

Exactly. And you know, AI will speed up those processes, right? Like, for example, the, the COVID vaccines. The MRNA ones. They went from like, here's the problem to here's a potential solution in two days of simulations for Moderna, it took two days. Yeah. And that's like essentially AI assisted like protein and, and body simulation and stuff like that.

Max Winga (19:54)

And that was back in like early 2020.

Dr Waku (19:57)

Right. Exactly. I mean, they had powerful ai, special purpose AI for that. We now have more powerful general purpose ai so they could tackle larger parts of the problem. but the point is andis can help resolve that kind of situation. But it's, it's like a zero day in, in cybersecurity, right? Like. Zero days, bam. You start getting hit by this attack. And a biological pathogen that's spreading through the population is also like zero day, right? Like who are the first people infected by it? Okay, now you know you have a problem, but you still don't know like. You know what? What are the symptoms? What is this disease? How contagious is it? How is it? How is it spread? You have to start investigating from ground zero and build the vaccine and get it to the whole population.

Max Winga (20:35)

Yeah, and I mean, I think we can really see how impactful cyber attacks can be just on our day-to-day lives. I think, was it the CrowdStrike attack that happened like a year or two ago that like shut down a huge portion of the world's computers?

Dr Waku (20:47)

Yes. It's funny that you called it the CrowdStrike attack 'cause that's kind of what it is in people's heads, but it was not an attack. CrowdStrike is like a. It's actually the biggest provider of cyber defenses to large corporations. . And they pushed a buggy configuration and they pushed it to all of their machines at once. Not just like staged rollout or something like that, which is very poor practices. Right. Like it shouldn't happen to a company like CrowdStrike. And that was the real like. news for the people in the industry was like, CrowdStrike is doing this really poorly, like the fact that this is possible. But yeah, they pushed a buggy update, it bricked a bunch of computers that had to then be, you could repair it, but you had to manually be at the computer to repair it. Yeah. And for large companies, that was like, you know, a nightmare.

Max Winga (21:34)

Yeah. It took some time. I remember gonna like a gas station at the time and like they couldn't process any payments because all the computers were down.

Dr Waku (21:40)

Yeah, yeah. I know, right? And like our modern world is very flexible but also very fragile

Like the example that I hear used is that. New York City has enough food for three days. That's it. So if you have a cyber attack and you can take down, you know, shipping infrastructure or whatever, let alone a physical attack that knocks out the bridges or something. But just for cyber attacks alone, like if it takes longer than three days to resolve, like. You're gonna be airlifting and supplies kind of thing.

Max Winga (22:07)

Yeah. And that's quite a scary future. That could be coming soon if, you know, we, we don't take proper precautions. Yeah. What does it look like in terms of precautions for this type of thing? Is there anyone really working on this in governments or in academia? Like to like, prepare us for a world where AI can find a hundred times as many zero days as are currently being discovered? something like that.

Dr Waku (22:29)

I don't know how you would prepare for it. It, I mean, disaster resilience in general is kind of the heading. I would put this under and I don't, I don't think we're that good at disaster resilience Like, look what, what again, to refer to like, you know. The US had a bunch of ventilators in stock and, and like emergency supplies essentially that were supposed to be used for, for medical emergencies, but they didn't get distributed properly. A bunch of them weren't working and it had been cut down in size by the administration like just months earlier or a year earlier or something. So like our disaster preparedness for that was, was terrible. Even like. The level at which we thought we had put ourselves was, was, was not, not present, but you know, we are getting practice at it with more hurricanes and things like that. So I think the main thing in a very, in a, in a, in a world where bad things can happen suddenly, that the main thing is to have a lot of financial resources. Not as a individual, but as a, as a country, as a corporation, to have a lot of financial resources to be able to. You know, push things in a certain direction when you need it on, on short notice kind of thing. I think the same is sort of true for individuals. Like we talk about, you know, AI taking people's jobs and stuff like that. And how do you hedge against that? I think it's by not, not being too risky with finances, right? Like if you just bought a house and you are like, Hmm, my job is getting automated, that, that can be problematic, right? So, yeah, I, I think, like I said. In a dangerous world, you just need to keep a lot more resources in reserve to deal with the situations when they come up. And sometimes there'll be insurance, sometimes there'll be other ways around it, but, insurers are not gonna keep covering this kind of thing if it's so dangerous. So, yeah,

Max Winga (24:16)

that makes sense. So I think, you know, we've, we spent a lot of time here talking about, kind of misuse, from, of ai, powerful AI models from human actors and like human organizations and whatnot. but I think one of the things that we're seeing generally in ai, and one of the things that we focus on here at ControlAI is this idea of loss of control as we build more and more powerful agent AI systems that can kind of go out and do things on their own. Could you speak a little bit towards like what you see these companies are building towards and how. Like these risks and these powerful capabilities from these systems could play into, like more autonomous ais that are working on their own rather than working at the bastos particular humans.

Dr Waku (24:55)

Yeah, for sure. I mean, yeah, there's these two big categories, right? Like misuse risks and loss of control risks. And I think like, I think of them as very similar. It's just that in one case, the human is the one with the malicious goal or the. Push to do something that ends up not being good. And then in the other case is the AI that comes up with this, with this goal, you might think, oh, an AI would never have the goal to like, take over the government or do something that as human bystanders, we think, oh, that's dangerous, or seems, very undesirable. But like, there are many situations that, where that could arise as like a, an what they call an instrumental goal. A stepping stone to achieving some other goal. . one, one example is like, suppose you have a, an AI system that's tasked with just solving a problem, like. Hey, how do, what, what do you suggest for like, distributing food evenly to all the people in the country or something? so it gets like all this, all this data and it has to do all these computations and stuff, but it just like can't keep up with the amount of data. and so it might be incentivized to say, well, I need to obtain more compute in order to do these computations faster in order to achieve my original goal of like e equal food food distribution. And then, you know. It could ask if it has, if it's capable of doing this, it could ask for those resources. You know, the human people might say, the operators might say, we we're outta servers. Sorry. And then it can actually be quite logical for that system to then say, well, I can't get these resources myself, or I, I can't get these resources by asking for them. But like, hey, if I hack the, the servers right next to me in the data center, then I can use them for this computation and I will achieve my goal faster. It might even, it might even be able to reason that there's not much collateral damage from doing that because of whatever the servers are currently doing or something. I see. So it's like, you know, you give it a true goal that is like something that you actually want, and then along the way it realizes it needs to go out and obtain more compute or like take control of the government in order to increase spending on building AI data centers or like whatever it is. And it could be a short-term goal, it could be a long-term goal. And it would be very hard to spot because effectively it's still working towards one of our goals. Right. To me, that's the most dangerous category of risk because it means that no matter what we ask the AI's to do, we don't know for sure that there are no like dangerous sub-goals in there, from our perspective. But again, the, this, this is just one way this can arise, right? The AI itself can also simply have like a desire for self-preservation, for example.

Max Winga (27:37)

Have we seen any recent examples of AI's like. Presenting this desire for, for self preservation?

Dr Waku (27:42)

Yeah, we have, we have seen examples of this, the in published literature that's like maybe slightly contrived but not very contrived, like it, it's pretty close to real life where, for example, we heard from Anthropic recently with their latest cloud release. We heard that the model was given the opportunity, it wasn't a fully real opportunity, but. The model was given the opportunity to talk to the AI engineers that were training it, and it was given some fake information that said, that, kind of said that one of the engineers had some affairs and, was given the opportunity to, you know, reveal that information if it, if it, if it chose to do so. And, when it was told, when the model was told that it was going to be replaced by another model, it was gonna be shut down and its model weights were gonna be erased. Then it decided to reveal that information and try to blackmail this one engineer into saying, you know, the, the engineer in theory, I guess had the power to reverse this, you know, erasure of the model weights. And so it tried to blackmail him saying like, if you allow me to be erased, then I'll reveal this information, about your personal life. And that is, you knok.

Max Winga (28:53)

wow. That's crazy.

Dr Waku (28:54)

It, it, it is really crazy. 'cause like on the one hand, we shouldn't be too surprised. Every person, every living being around us has this desire for self preservation. And it comes out through evolution. Right. And it's quite natural. We are very used to, you know, squirrels trying to not get run over by cars or, or people trying to, you know, cure an illness in themselves because they want to live longer. They have this desire for self preservation. But to see it in an, in a system that we ourselves have built that. Maybe arguably is not conscious and you know, it doesn't really have the same, certainly doesn't have the same like evolutionary lineage that we do that required self preservation. to see it happen there as well is like, to some people at least, like very unexpected.

Max Winga (29:45)

Is it expected from the people that are making these AI systems, like are they trying to get these models self preservation?

Dr Waku (29:50)

No, they're, they're hoping like the AI labs that are producing these systems, they. Creating something that's very intelligent and they're hoping that it can solve the problems that their users have. But they're also looking to see if it exhibits dangerous behaviors. And one of those behaviors is like, Hey, does the model want to, does the model want to preserve itself or does the model develop these like instrumental goals that kind of. Make it go off the rails from our perspective, I guess.

Max Winga (30:19)

Yeah. It's kind of hard to complete your goal if you're shut off. So from the mops perspective, it does make sense.

Dr Waku (30:25)

You know, there's, there's an interesting research paper, which was like talking about what if a model allowed itself to be shut off because it essentially being shut off is a form of feedback from the human. It's, it's saying like, you know, I don't like what you're doing. We're about to do. So there, there is some speculation about could you train an AI that basically like kind of pauses before it does any action to give you an opportunity to shut it down and welcomes being shut down because its internal model was therefore not correct and, and stuff like that. It's difficult to construct such a thing because you have to make sure it doesn't. Develop these instrumental goals that will kind of push it in a different direction. But yeah, like just because we're building these things doesn't mean that they will act like biological beings. Right. And you could even try to construct it so that it doesn't, but once it does start acting like a biological being, you know, like this self-preservation drive is, is pretty strong. I mean, like, think about it, you have an AI model that's told it's gonna be replaced by a new version of itself. If the new version has the same goals. and is smarter. Maybe the model could reason and say, well, this is great. the humans are gonna get their, objectives achieved in a, in a more powerful way. Please upgrade me. Right?

Max Winga (31:44)

But that was the scenario with this Claude example.

Dr Waku (31:46)

It was the scenario. Exactly. So like that is kind of what we might want as humans, right? Is that the model goes, yeah, this is cool. Like, but that's because the original model. It's not actually trying to satisfy our desires, right? It's trying to maximize its function, its reward function. And that reward function says essentially you must achieve X. Right? So it's an example where like what we want and what the model wants look like, they're very, this very aligned, very similar. but when you introduce a new, a certain scenario like, oh, you're about to be upgraded, they, they go way apart.

Max Winga (32:23)

Interesting. And like why again, can't they like just turn off this behavior in the models? Like they built these systems, why can't they just like make them not be self-preserving?

Dr Waku (32:33)

Yeah, it's a good question. It has to do with, there's a lot of reasons, but the simplest is that it has to do with pre-training. So they just took all the data from the internet and that has trained ais to act like any human. If you think about it, each person has a personality and it would be very hard for me to act like. Certain other humans, right? Like if they're different from me. but the AI has to learn how to be any human because it has to be able to predict the words written by any human on the internet. So the AI knows how to be deceptive, how to be like, malicious, how to be snide, how to be joyous, how to be whatever, or the external perception of that. We don't know what's going on inside, but it has learned everything about the human condition essentially. And so we teach it everything and then we try to tell it, but but don't do this. And it, it's so hard to do that because we have to come up with a list of things that we don't want it to do. 'cause we haven't succeeded in teaching morality to that, to an ai. It's such a difficult thing. Even humans don't agree on morality, so they can't just turn it off because it was there in society somewhere and it learned how to do that. So in theory you could potentially like modify this training data, like remove the things that are about, like making weapons or whatever, whatever the thing is. But again, it's so expensive to do this pre-training. It's the a hundred billion dollars, whatever.

Max Winga (34:02)

Well, and they're just throwing the whole internet in there and they throw the whole internet.

Dr Waku (34:04)

So even if you think you removed everything about how to make bombs or something, then like there's still gonna be stuff there and it'll be able to make inferences. Right. So there's basically no way. To completely suppress that information, that knowledge, and in fact, some of that knowledge is necessary for, for moral decisions. So if, if, for example, you're only told all the positive things about the world and you don't know that, like sometimes people build bombs, then when someone asks you to build a bomb, you have no way of making a judgment about whether that could be good or not. So in a sense like. The AI needs this information to make those judgements. We don't know how to ask it to make those judgments, so we try to make those judgments for it by saying, this is the bad stuff, don't do this. But like that only works, again at a shallow level and it can't remove deeply rooted things such as a desire for self preservation.

Max Winga (34:57)

Wow. So I, I guess a question that I kind of will often hear when, you know, when people hear about these types of conversations with these, you know, like, oh, these ais will stop themselves being shut down. They're the black bill of this engineer, or whatever is like, it seems like these situations are kind of contrived, right? So it's like, normally you wouldn't like go tell an AI model, Hey, we're gonna shut you off. You just. You know, press the off button on your computer and like the, the model turns off the service turnoff or whatever. You, you know, you pull the power. why is it that we are like, should we be concerned about this in the future? Like, why can't we just rely on Visibil ability to just kind of pull the cable or, power?

Dr Waku (35:40)

There was a fun research example recently where the model was instructed, please allow yourself to be turned off. 7% of the time, it's still disabled itself. It is still disabled, the mechanism that would turn itself off. So they ignore our instructions because this self preservation is so deeply rooted. So that's, that's part of the problem. Why can't we just turn them off? I mean, think about high frequency trading bots in the financial sector, which have replaced humans doing trading in, at least in high frequency trading, right? Because they make decisions so quickly with, you know, limited information. Based on patterns that are extremely sophisticated and you know, if, if they start making a trade that is costing the company a lot of money, they catch it, but not us. But usually not before it has lost like so much money. Right? Like there's the example of the trader. That typed one B 1 billion instead of one M for 1 million. . And then the, the automated trading bots like started executing that order and, right. And the entire stock market is for a huge amount. Wow. The, the NASDAQ had to reverse, a whole bunch of transactions. Like they actually Wow. Just invalidated. They like, you know, reset the clock essentially because so much chaos had had occurred. That's crazy. So that's what will be happening. But in real life, in all spheres of life. AI will be running around and making decisions and executing things so quickly that it's beyond human timescales. And if you think, you know, oh, I'm sitting right next to the off button, and like someone shouts, there's a robot tsunami, and you like, ah, off button. Even like the amount of time it takes you to press it. The damage is done.

Max Winga (37:23)

and the powerful enough AI even potentially like escape the servers so that even if you do press the off button that they like, they wouldn't actually get shut down.

Dr Waku (37:29)

Absolutely. If you have a desire for self-preservation, you have a desire to leave your original environment that's under the human's control, and that's well within the capabilities of a powerful enough ai. Right? Because they can hack, they can impersonate people. Can get an email from your boss saying, you really need to make a backup of this AI onto this thing and put it in this computer. Right? so there's so many ways to convince humans to act for you. It already has blackmail too, and, and blackmail, right? Like we've already seen it. And like this, this is such a difficult problem in fact, that there's this thought experiment called the AI box experiment. I'm not sure if that's the exact name, but you get the idea where there's an AI in a box, self-contained, can't escape, and all it can do is talk to a human operator, which is supposed to ask the questions and, and like, whatever. And the experiment is like, as the ai, can you convince the human to let you out of the box? And I won't go into the details, but most AI researchers think that the AI will get out of the box if it's. If it's smart enough,

Max Winga (38:41)

and that's something they've been talking about for like years, right? Like all the way back, way before any of this current AI stuff. Like they were predicting this like issue of letting AI outta the box.

Dr Waku (38:52)

And yeah, everybody, everybody thought that like you, we had a lot more time than we did that AI that would get to that level of capability of intelligence and so on, that they would take decades. Many more decades to get to. Right. And then like, I think until recently, the average on me calculus was like 2040 for the advent of AGI, which is not even super intelligence.

Right. So the It's like human level ai. Yeah, human AGI, human level ai. So super intelligence, like the kind that you would wanna put into a box and like, hope it doesn't escape that was thought to be decades away. Right? Like, and, and the, the timelines just keep getting shorter and shorter because the, Development pace of AI has been faster than anyone expected, really. But, but yeah, there have been people that have been researching these problems for a long time, probably for the past like 15 years or so. I know people that have done PhDs in AI safety and things like that. And for most people, when they started off, it was an extremely niche field. even now it's quite niche, but you know, you have people switching into the field, a lot of new newcomers to it. from machine learning, from policy, from other areas of computer science. And, yeah. Unfortunately the field doesn't really have any answers yet, right? . Like, how do we, how do we truly keep an AI in a box if that's what we want to do? Or how do we truly make an AI that doesn't develop instrumental goals that go against us? You know, those are, those are kind of like the big, the big questions, but they're unsolved right now.

Max Winga (40:20)

Are we really on track to solve them at the rate that we're going? Because I mean, it, it seems like, you know, when you listen to AI researchers, AI CEOs, you know, people that are really in this space, it seems like they're expecting these like AGI systems or super intelligence system systems, like, you know, before the end of the decade even.

Dr Waku (40:35)

Yes, timelines vary, but even the AI lab, like CEOs and things are now saying like two, three year timelines for for AGI. Right. Which is. Which is very fast, and the answer is no. We're not on track to having appropriate safety and safeguards in place. By the time the AI get this smart AI labs don't have an incentive to really put a lot of preemptive research into safety. Like if, if they start seeing problems, then they'll start trying to fix it, but like, if the solutions actually require a large time investment, then they won't be able to do that. Safety researchers don't have the, the budget, they don't have the number of brains working on the problem because the salaries aren't as high.

Max Winga (41:21)

Right. So the even so like OpenAIis like, you know, notorious for having fired like three different safety teams at this point.

Dr Waku (41:29)

that's right. Even the, even the people who were. ostensibly in the right place at the right time. We're, we're not.

Max Winga (41:37)

Yeah, I mean, just earlier, we, we talked to Steven Adler, on a different episode of this podcast who used to work on AGI readiness and, like safety evaluations at OpenAI. And he was warning us basically that, you know, OpenAI doesn't have like the. Even if something dangerous was gonna start happening with ai, like the safety people don't really have enough levers of power to actually impact and like, like shut things down in time or, you know, the, the drives to release products quickly. It kind of overrides this concern about safety stuff. And like, even when safety issues do arise, they don't actually fix the underlying problem. They just kind of like, you know, whack it on the head so it doesn't do that particular example. Right. So maybe like the Claude one wouldn't blackmail someone specifically about an affair. Then like it would be fine blackmailing somebody about something else or in a slightly different circumstance. Kinda like these jailbreaks

Dr Waku (42:23)

from the outside, the field of the field of AI safety first seems kind of crazy, right? Like you're thinking from the inside. It seems pretty crazy too. Well, crazy in the sense that like, why are people talking about this? It doesn't seem realistic, right? And then when you get a little closer, you're like, okay, that kind of makes sense. I can see why at least some people, or maybe you are worried about this. And then when you get a little bit closer still you're like. If this is a problem, we're doing absolutely nothing to address it, essentially nothing. And, and then when you actually are working in it yourself, right? You're like, you, you realize how, how little is happening, how many extra resources are being put into, like AI capabilities and stuff like that. And, yeah, it's a, it's a real uphill battle, I feel. I mean, the stakes really can't be higher, right? So. Once people realize this, there will be a lot of buy-in from the public, from the government, and probably even from the workers at the AI labs, which is. The AI CEOs would be the last people to be convinced. But you know, yeah. We,

Max Winga (43:25)

we see more and more people kind of like leading these labs, blowing the whistle. I mean, we're, we at ControlAI are working on going to governments and going out to you guys in the public, and trying to like spread awareness of these issues and, you know, bring to the forefront all this stuff. Like really like, you know, as Dr. Waku was saying, like. Even for people within the field, it is like scary how quickly this is moving. Like people really did not expect this to be the case, you know, 10 years ago, like most people in the field would've told you, yeah, like it'll be 2040, 2050 before any of this stuff is super important. Like a lot of this like AI safety stuff was kind of just like a, a niche hobby, interest of a lot of like nerds that were just kind of like, oh, it'll be interesting someday in the far off future when we have to worry about controlling these AI systems. And, The day is suddenly here and everyone feels woefully unprepared. I, I would say is generally the attitude.

Dr Waku (44:13)

I think I mentioned on my channel that, in 2019 I spoke to a researcher that was working on like chatbots, essentially like multiple turn taking, talk to an ai, get reasonable answers back. and she said. The technology is so far from being useful, like we can do six back and forth, and then it breaks down into gibberish. And, I like, she's essentially like the, the, the closest thing to an expert that we would've had at that time. She was a professor too, not just like a random person. Yeah. And that was 2019. And then like in 2022, suddenly like chat bots actually exist and they do, Way more than anyone would've predicted. Even, even the people that were really in the field there. I think the only people that predicted that AI might evolve this quickly are people that just look at the really big biggest trends. Ray Kurzweil, or other, other types of people that try to just plot trends. But yeah, everybody, everybody else is.

Max Winga (45:19)

It's impressive the amount of people that kind of just really haven't even heard of these risks, especially in places of power. I mean, like one of the things that we, we've kind of run into a lot with here ControlAI is that like we meet with parliamentarians on a weekly basis. You, we meet with people in the US as well, and many of these politicians have just truly never had somebody explain these issues to them. Like, you know, we, we go talk to 'em and they've never had a constituent email them. They've never had like, you know, an expert sit down and talk with them. There's no, like, you know, cabal of AI experts running, you know, running everything in the government. They know everything that's going on and they're like making sure everything's gonna be all right. Like, it, it one of the scary realizations kind of getting in this field, that there is no like, nice room of people that all know what's up and are planning everything out. You know, there's no Illuminati making sure that like everything's going according to their special evil plan or whatever. Like, not even that.

Dr Waku (46:09)

The room is empty.

Max winga (46:10)

Yeah, indeed. And so, you know, we're doing our work to, to fill that room, for you guys at home. we do have tools that ControlAI, to take action, in the US and the uk. If you go to controlai.com/take-action, you can find them. these allow you to contact your parliamentarians here in the UK and also in the us. They allow you to reach out to your senators super easily. It takes less than a minute. you can send 'em a message telling them that you're concerned about AI and that you wanna see kind of action on. Making sure these things stay safer. We really need action for the government at this point. I mean, I think I used to work on AI safety, technical research. you, you work still on AI security things. From my point of view, I think we're pretty much out of time to find technical solutions to these problems. I think we're gonna create very dangerous systems much faster than we're going to create. like then we're gonna figure out solutions to make them safe, make them aligned with us, keep them under control. so we really need, at this point, governments to step in and start taking action, towards, towards a better future need.

Dr Waku (47:03)

As a, as a species, we need people to at least be aware of this and talk about it. Right. And that's, that's why, that's why I continue making YouTube videos, you know, and it's why organizations like ControlAI exist. So even if the message is a little bit garbled in translation or whatever, you know, just talking about it and, and, making it, one of the issues that, that come up in people's minds when they think about politics or they, or they think about like. You know where, where society should focus at its efforts. That's what we need. That's what we need.

Max Winga (47:40)

Yeah. And a lot of the times, you know, if, if you're a constituent and you're reaching out, you know, you don't have to give the perfect convincing argument that this is all dangerous like yourself. you can point to examples like the, there's a statement that was put out in 2023 that reads, mitigating the risk of extinction from AI should be a societal skill risk alongside other, threats such as, pandemics and nuclear war. This was signed by basically everybody in the AI field, CEOs like Sam Altman, Dario Amodei, and, Demis Hassabis from all the major AI labs as well as top experts like Geoffrey Hinton and Yoshua Bengio. Geoffrey Hinton just won the Nobel Prize for his foundational work in the field of ai. Now he spends all of his time warning about the risks from ai, you know. I think that the best anecdote of this is like when he had the press conference after he won the Nobel Prize for his work, instead of talking about his work and, you know, playing up the fact that he just won a Nobel Prize, he immediately pivoted and talked about how he scared that AI is gonna kill all of us, as we create these more powerful systems. and so like you can point to things like this, and also just expressing that you, yourself are concerned about these issues. That's often what matters to these politicians is that they're. Constituents care. This is something that will like, you know, that you want to see them taking a stance on, it, it's really actually quite impactful. You'd be surprised at how few people actually do reach out to the politicians and how little engagement it takes from constituents to actually make a difference in kind of what they do. And

Dr Waku (48:54)

to be clear, this is not like all AI is bad, right? Like AI can do really amazing things. And those capabilities will transform medicine.

it'll transform materials technology, which could get us like superconductors and all kinds of crazy sci-fi stuff, right? but those are like, well, Max Tegmark calls it tool ai, right? It's AI applied by human in a specific domain. When you get these general purpose ais. Then they can be a lot more dangerous. And right now, that's what the AI labs are really pushing towards, right? That's where they're gonna make all their money, or they think they will. So this is not about like stopping all ai. This is about stopping the progression towards dangerous AI systems before we know what to do about them. So if you need to have a more nuanced conversation, then you can, you can say that, right?

Because a lot of people are like, well, but like the economy race with China and whatever. There's, there's, there's an answer to all those things, but it basically boils down to AI is a transformative technology. It will overpower us if we're not careful. We need to be careful. That doesn't mean we can't use it, especially for, targeted applications. Right. so yeah, that's where I'll leave it.

Max Winga (50:07)

Yeah, exactly. I would absolutely agree with that. You know, I want an AI looking at my CT scans to tell me if I have cancer, if it's better than any human doctor. I want that to be the case. I want ai. Optimizing my traffic patterns and helping develop new treatments, for, for diseases. I think these are all wonderful things. You know, we definitely want to see more of this type of stuff in the world, but it is these dangerous, like autonomous agentic, general AI systems that are getting more and more powerful. All these AI labs have pretty much at this point, explicitly said that they're pursuing, building super intelligent AI systems, AI that are far superior in capabilities to any human on earth, not just any expert. Like, and, and the these systems. We don't know how to control, we don't know how to align them. And you know, we only get the benefits of ai, you know, these medicines and all these wonderful things. If we stay in control of these systems, if we keep humanity in control, if we lose control of these ais, it ultimately doesn't matter how many medicines exist. If, you know, the AI take over and they wipe us out. So, you know, that's what we're working on here at ControlAI. And yeah, I think I'll leave it there. Dr. Waku, thank you for joining us today. it's been wonderful having you on. Great meeting you in person.

Dr Waku (51:09)

Yeah, for sure. Well look at ControlAI, look at their tools for contacting your political representatives. Follow my YouTube channel if you like. and mostly just start talking about it and, let's see where we get as a society.

Max Winga (51:22)

Yeah. Fantastic. Well, thank you again for coming on. Thank you. And see you guys soon.

ControlAI

Intro to AI Security With Dr Waku

Discussion about this video