The Misaligned Mind: An Introduction to Alignment
What happens if we can’t solve the alignment problem?
Welcome to the ControlAI newsletter. Today we’ll provide a brief introduction to AI alignment.
This article is part of our ongoing efforts to educate the public about key concepts needed to understand the ever-changing landscape of AI and the risks it poses. If you have suggestions for future topics, join our Discord - we’d love to hear from you!
AI Alignment is the process of making sure that the goals and actions of an AI system aligns with the intentions of the people who developed it. If an AI advances the objectives of its developers, then it’s “aligned”, but if it pursues its own objectives then it’s “misaligned”.
Google Maps is a good example of an aligned weak AI-based system. It uses traffic and road data to calculate the most efficient routes to use when traveling, in line with the goals of its developers.
However, if Google Maps gave you terrible routes to use, it would be misaligned — it would not advance the objectives of its developers. Another example of a misaligned AI would be the CoastRunners video game example we gave in our article on reward hacking:
When AI systems are weak, the stakes are low, and a misaligned AI doesn’t portend a catastrophe. In the context of AI safety, “alignment” generally refers to the problem of aligning powerful, superhuman AIs – where if they behave in unintended ways, we could lose control and there could be disastrous consequences.
Alignment problems all the way down
Alignment problems aren’t reserved for AIs; we can find examples of them in our daily lives too.
Here are some examples The Compendium provides:
To align individuals, we educate them to behave according to a certain set of cultural values. We also enforce compliance with the law through threat of state punishment. Most people are aligned and share values, with rare aberrations such as sociopathic geniuses or domestic terrorists.
To align companies with societal values, we rely on regulations, corporate governance, and market incentives. However, companies often find loopholes or engage in unethical practices, such as how Boeing’s profit motive undermined safety, leading to crashes and hundreds of fatalities.
To align governments with the will of the people, we rely on constitutions, checks and balances, and democratic elections. Some countries operate under dictatorships or authoritarian regimes. But both of these models can go wrong, leading governments to commit atrocities against their own people or experience democratic backsliding.
To align humanity toward common goals like peace and environmental sustainability, we establish international organizations and agreements, like the United Nations and the Paris Climate Accords. On a global scale, enforcement is challenging — there are ongoing wars on multiple continents, and we have met only 17% of the Sustainable Development Goals (SDGs) that all United Nations member states have agreed to.
As you can see, the process of alignment is fraught when applied to humans, when applied to AI systems it can be even more tricky.
Types of alignment
Alignment can be broken down into two types, outer alignment and inner alignment.
Outer alignment
This is the problem of specifying the developers’ goals to an AI system correctly. Doing this incorrectly can lead to reward hacking, where an AI optimizes for an objective, but finds a clever way to maximize its reward without actually doing what its developers want — you tell the AI system to do one thing, but it cheats and finds a way to do it in a way that you didn’t intend.
In other words, “Outer alignment” refers to giving an AI system goals that reflect what its developers actually want.
Inner alignment
Once you’ve decided on a goal that you want your AI to optimize, you face the challenge of getting the system to actually pursue this goal. Aligning the goal that the AI system pursues with the objective specified by the developers is known as “inner alignment”.
Modern AIs are trained by optimizer algorithms, like gradient descent, and during training they learn algorithms in order to solve the problems they are presented with.
These learned algorithms perform well on data similar to data the model is trained on. However, there is no guarantee that they perform well when deployed in the real world, when undergoing a distributional shift. When this fails, the AI has learned an algorithm that is not what you wanted it to learn.
Sometimes the algorithms learned are also optimizers themselves. Rob Miles has a great video on inner alignment and these optimizers, known as mesa-optimizers, here:
Why can’t we just set the goals of AI systems to be the ones specified by the developers?
With modern AI systems, which are based on deep learning, the only way the goal enters the system is through gradient signals. With gradient descent you get a model that gets a lot of reward, but you don’t know what kind of algorithm is learned to do that. With traditional AIs like tree search you could just directly set the goal.
Modern AI systems are a black box — they’re based on artificial neural networks, hundreds of billions of numbers that collectively build a kind of mind. We don’t write code to build them, they’re grown from a simple program and huge amounts of data and energy. We understand almost nothing about what goes on inside these systems once they are grown.
Anthropic CEO Dario Amodei says we maybe understand 3% of how they work. If we did understand how these systems work internally, we might be able to specify their goals. The field of researching the processes inside these systems is called mechanistic interpretability.
Why alignment matters
AI systems are reliably becoming more capable over time, rivaling and surpassing humans across an ever-increasing number of intellectual domains. With intelligence comes power, and the ability to effect change in the real world.
AI companies are racing to build artificial superintelligence, AI smarter than even the smartest humans. Superintelligence would outclass humans at everything: including hacking, biological weapons development, strategy, and persuasion.
Misaligned superintelligence, by definition, would be pursuing goals that were not intended, and given the power that it would have, the consequences would likely be disastrous.
If AIs do not care about what we want, they could transform the world in ways that make it uninhabitable for humans, just by pursuing their own goals. They might also see humans as a potential obstacle to them achieving their goals, consider that we might turn them off, and disempower or even exterminate us. This latter possibility refers to the concept of instrumental convergence, which we covered in a previous article:
How close are we to solving alignment?
Not close. Nobody knows how to ensure that superhuman AI systems are aligned, safe, or controllable. Nobody even has a particularly good idea for how to do this.
And there’s still the problem of what goals one would actually give a superintelligent AI. How would we know that these wouldn’t have unintended consequences? You could choose a goal and have a superintelligent AI system faithfully pursue it, only to later realize that it had consequences you didn’t want. Would these goals be set to broadly benefit humanity, or some other interest group?
Only one shot
Misaligned superintelligence would lead to human disempowerment or extinction. That means you might only get one attempt at aligning a superintelligent AI system.
Rocket science is quite well understood, but even then, when new rockets are built they routinely fail and explode. With rocketry, you get as many tries as you need to iterate and improve until you get something that works. The science of intelligence is much less well developed. People often say about difficult but not insurmountable challenges “it’s not rocket science”, but in the case of alignment it is in fact much harder.
Given the state of AI alignment and safety research, and the threat of extinction posed by AI — which has been acknowledged by Nobel Prize winners, hundreds of top AI scientists, and even the CEOs of the leading AI companies themselves – we can’t allow artificial superintelligence to be developed.
That’s why we’ve designed a policy package for humanity to survive AI and flourish, A Narrow Path.
We also have a plan for how to get necessary policy enacted, the Direct Institutional Plan.
Anyone can implement the Direct Institutional Plan independently. We need to inform policymakers about the problem, and convince them to implement policies that address it.
We have tools you can use that make contacting your elected representatives quick and easy. It takes less than a minute to do this.
If you're a US citizen, you can use this tool to easily contact your senators. Over 450 citizens have already used it: https://controlai.com/take-action/usa
If you're a British citizen, you can use this form to write an email to your MP. Over 250 citizens have already done so: https://controlai.com/take-action/uk
See you next week!
Eleanor, Tolga, Andrea: What a greatl initiative! I've been really worried about the lack of basic Alignment / AI Safety literacy among non-technical AI Governance people. Lawyers, Compliance or data protection roles in private companies, who actively work on "Responsible AI" or compliance with AI regulation.
Your posts help, a lot. I hope this is okay- On Wednesday, I will release the second part of my series "Why Should AI Governance professionals & Tech Lawyers Care About AI Safety?". I will quote your Google Maps example, and the breakdown of alignment you give in this article.
I am trying to aim my posts at people with legal or compliance backgrounds, so the overall tone will be different, but I am essentially working towards the same goal :). I am very glad I came accross this post.
When alignment fails all we have left is containment: attempting to contain a miss-aligned AI in some sort of sandbox or jail. That is probably just as difficult a problem as alignment but is being championed as a backup by some companies (Google) - which should tell you a lot about their belief in alignment efforts.