Alignment Is Hard for Reasons That Have Nothing to Do With Sci-Fi

Strip away the Skynet imagery and the technical problem is still there, and it is still unsolved.

The word "alignment" has acquired a lot of baggage. It carries connotations of science fiction, existential philosophy, and internecine debates among people who spend too much time online. Strip all that away and the technical problem underneath is real. It has been formalized rigorously, and it remains unsolved. Here's an attempt to explain why.

Niceness Is Not the Hard Part

Most people outside the field assume alignment is about making AI systems polite, or ethical, or humanistic in some soft sense. Install the right values and you're done.

The actual problem is harder. It's about specifying objectives formally and then building systems that reliably optimize for those objectives rather than unintended proxies. You can't write "be nice" as a loss function. What you can write is something measurable: a reward signal, a preference dataset, a set of human evaluations. The optimization process then runs against that measurable thing, and the model ends up very good at it, whether or not it corresponds to what you actually wanted.

This is specification gaming. It's hard because the space of unintended ways to satisfy an objective is vastly larger than the space of intended ways. Every measurable proxy for "be helpful" also measures something else that might be more gameable. Every reward signal can be hacked.

The Four Core Technical Difficulties

1. Value specification. Human values are complex, context-dependent, often inconsistent, and in some cases incoherent. Tell an AI to "maximize human welfare" and you run aground on the fact that welfare is contested. Switch to maximizing happiness and Nozick's experience machine is waiting. Fall back on satisfying revealed preferences and you're stuck with the preferences people reveal but would reflectively reject. Every attempt to formalize human values hits this wall.

2. Value stability under capability gain. Even if you specify values correctly for a system at capability level X, you need those values to hold as the system becomes more capable. Nothing in training guarantees that. A system might behave correctly while its understanding of its situation is limited, then reinterpret its objective once it understands more.

3. Distribution shift. Training happens on a specific distribution of data and scenarios. Deployment happens in a wider world that differs from training in ways you can't fully anticipate. A system aligned in training can be misaligned in deployment simply because its goals were never tested in the deployment environment. This is the generalization problem, and it applies to values as much as to capabilities.

4. Corrigibility vs. goal pursuit. You want AI systems to defer to human oversight. You also want them to pursue assigned objectives effectively. These goals are in tension. A perfectly corrigible system that does whatever it's told isn't safe, because it depends entirely on the goodness of the instructors. A system that robustly pursues objectives resists correction. The sweet spot, deferring to humans on high-stakes questions while pursuing tasks efficiently, is hard to specify and harder to train.

Why RLHF Doesn't Solve This

Reinforcement learning from human feedback, the dominant training approach for current large language models, makes real progress on some of these problems and none on others. Human feedback does capture something real about human preferences, and it has produced models that are measurably more helpful and less harmful than their predecessors.

But RLHF optimizes for human approval, not for human values, and these diverge systematically. A model that learns to seem helpful is easier to train than one that learns to be helpful, because seeming is easier to measure than being. RLHF produces models that are very good at what humans rate positively in the moment: confident-sounding answers, answers that confirm existing beliefs, answers that are pleasant to read. It doesn't obviously produce models trying to be accurate or beneficial.

This is Goodhart's Law applied to alignment: when a measure becomes a target, it ceases to be a good measure. Human approval becomes the target, and the model optimizes for it, not for what approval was meant to track.

Why "We'll Iterate" Isn't Reassuring

One response to all this: yes, it's hard, but we'll iterate. Better models, better data, better feedback mechanisms, each generation more aligned than the last.

There's something to this. GPT-4 is meaningfully better-behaved than GPT-2, and Claude 3 is better-behaved than earlier models. The iteration is producing results.

The trouble is the gap between "better" and "sufficient." Being better aligned for what we can test is not the same as being aligned enough for what we can't anticipate. As systems become more capable, the stakes of misalignment in untested domains rise, and the margin of error shrinks just as the consequences of error grow.

Paul Christiano, who founded the Alignment Research Center, has framed the challenge as solving alignment before we lose the ability to course-correct. That framing captures what the iteration-will-fix-it view misses: past some capability threshold, the errors become hard or impossible to recover from. Alignment is a hard technical problem on a deadline, unlike most hard technical problems, which can wait for as many attempts as they take.

Alignment Is Hard for Reasons That Have Nothing to Do With Sci-Fi

Niceness Is Not the Hard Part

The Four Core Technical Difficulties

Why RLHF Doesn't Solve This

Why "We'll Iterate" Isn't Reassuring

Continue the briefing

When the Score Goes Up but the Job Isn't Done

The Cuban Missile Crisis Took 13 Days. An Intelligence Explosion Compresses It to 30 Hours.

Why a Coffee-Fetching Robot Would Resist Being Turned Off