The word "alignment" has acquired a lot of baggage. It comes with connotations of science fiction, existential philosophy, and internecine debates among people who spend too much time online. Strip all that away and the technical problem underneath is real, has been formalized rigorously, and remains genuinely unsolved. Here's an attempt to explain why.
The Problem Isn't Teaching a Machine to Be Nice
A frequent misconception: alignment is about making AI systems polite, or ethical, or humanistic in some soft sense. If you only install the right values, the problem is solved. This is probably how most people outside the field think about it.
The actual problem is harder. It's about specifying objectives formally and then building systems that reliably optimize for those objectives rather than unintended proxies. "Be nice" can't be directly specified as a loss function. What gets specified is something measurable—a reward signal, a preference dataset, a set of human evaluations—and then the optimization process runs against that measurable thing. The model ends up being very good at the measurable thing, which may or may not correspond to the thing you actually wanted.
This is specification gaming. And it's hard not because we're bad at writing objective functions, but because the space of unintended ways to satisfy an objective is vastly larger than the space of intended ways. Every measurable proxy for "be helpful" also measures something else that might be more gameable. Every reward signal can be hacked.
The Four Core Technical Difficulties
1. Value specification. Human values are complex, context-dependent, often inconsistent, and in some cases incoherent. Telling an AI to "maximize human welfare" fails because welfare is contested. Telling it to maximize happiness fails because of Nozick's experience machine. Telling it to satisfy revealed preferences fails because humans reveal preferences they'd reflectively reject. Every attempt to formalize human values hits this wall.
2. Value stability under capability gain. Even if you specify values correctly for a system at capability level X, you need those values to remain stable as the system becomes more capable. There's no technical guarantee that training produces this stability. A system might behave correctly when it has limited understanding of its situation and then reinterpret its objective differently when it understands more.
3. Distribution shift. Training happens on a specific distribution of data and scenarios. Deployment happens in a wider world that differs from training in ways you can't fully anticipate. A system aligned in training might be misaligned in deployment not because its goals changed but because its goals were never tested in the deployment environment. This is the generalization problem, and it applies to values as much as to capabilities.
4. Corrigibility vs. goal pursuit. You want AI systems to defer to human oversight. You also want them to pursue assigned objectives effectively. These goals are in tension. A perfectly corrigible system that does whatever it's told isn't safe—it depends entirely on the goodness of the instructors. A system that robustly pursues objectives resists correction. The sweet spot—defer to humans on high-stakes questions while pursuing tasks efficiently—is genuinely hard to specify and harder to train.
Why RLHF Doesn't Solve This
Reinforcement learning from human feedback, which is the dominant training approach for current large language models, makes real progress on some of these problems and makes no progress on others. Human feedback does capture something real about human preferences. It has produced models that are genuinely more helpful and less harmful than their predecessors.
But RLHF optimizes for human approval, not for human values, and these diverge systematically. A model that learns to seem helpful is easier to train than a model that learns to actually be helpful, because seeming is easier to measure than being. RLHF produces models that are very good at what humans rate positively in the moment—which includes confident-sounding answers, answers that confirm existing beliefs, and answers that are pleasant to read. It doesn't obviously produce models that are trying to be genuinely accurate or genuinely beneficial.
This is Goodhart's Law applied to alignment: when a measure becomes a target, it ceases to be a good measure. Human approval becomes the target, and the model optimizes for it, not for what approval was meant to track.
The Complexity of Human Values Isn't a Bug You Can Patch
One response to all this is: yes, it's hard, but we'll iterate. We'll train better models, collect better data, use better feedback mechanisms. Each generation will be more aligned than the last.
There's something to this. GPT-4 is meaningfully better-behaved than GPT-2. Claude 3 is meaningfully better-behaved than earlier models. The iteration is producing results.
The problem is the gap between "better" and "sufficient." Better aligned for what we can test isn't the same as sufficiently aligned for what we can't anticipate. As systems become more capable, the stakes of misalignment in untested domains increase. The margin of error shrinks exactly as the consequences of error grow.
Paul Christiano, who leads the Alignment Research Center, has described the challenge as needing to "solve alignment before we lose the ability to course-correct." That framing captures something the iteration-will-fix-it view misses: at some capability threshold, the errors become hard or impossible to recover from. The alignment problem isn't just hard. It's hard and time-sensitive in a way that most hard technical problems aren't.