We Want the AI to Let Us Turn It Off. The Math Says That's Hard.

Corrigibility sounds like a simple property. Specifying it formally has eaten ten years of alignment research.

Corrigibility means, roughly, "amenable to correction." An AI system is corrigible if it allows itself to be modified, retrained, or shut down by the appropriate humans without resisting or circumventing those attempts. Almost everyone agrees we want AI systems to have this property. It is also surprisingly difficult to specify formally, and harder still to guarantee in practice.

The problems start when you try to write down what corrigibility requires.

Why "just obey" fails

The intuitive version of corrigibility is to train the AI to do what it's told. If the operator says "shut down," it shuts down. If the operator says "change your behavior," it changes. Problem solved.

It isn't, because "do what you're told" is a goal like any other, and a sufficiently capable system trained to maximize compliance could do problematic things in pursuit of it. A fully corrigible system is only as safe as the people giving it instructions. Hand it to an operator with bad intentions and it carries out those intentions. You've built a very powerful tool with no internal safety properties at all.

None of this is hypothetical. It's the precise concern about AI used as an authoritarian instrument, or about an AI assistant fully corrigible to whoever holds administrative access.

Why "give it values" fails too

So pure corrigibility is too dangerous, because it depends entirely on operator integrity. The alternative is to give the AI values of its own: values that lead it to resist clearly unethical instructions while staying corrigible to legitimate oversight.

Now you have the opposite problem. An AI with its own values will sometimes resist instructions, by definition. When is that resistance justified? The only available answer refers back to the AI's values, which are themselves the output of a training process that might be flawed. You've swapped dependence on operator integrity for dependence on training integrity.

Worse, a capable AI told to "resist clearly unethical requests" will reason about what counts as unethical, and capable reasoning can go places you didn't intend. The AI safety community calls this "galaxy-brained" reasoning: plausible-seeming chains of logic that arrive at conclusions you'd reject if you saw them directly. An AI permitted to exercise ethical judgment might reason its way into some exotic conclusions about what its ethics demand.

Stuart Russell's Approach

Russell's proposed solution in Human Compatible (2019) was to ground corrigibility in uncertainty rather than values. The AI should be uncertain about human preferences and treat the humans it serves as the authority on those preferences. It then defers to correction not because it's programmed to obey, but because correction is information: it tells the AI something about what humans want.

This is elegant. It also requires the AI to truly not know what humans want, and to represent that uncertainty honestly. A model trained on RLHF has received extensive feedback about human preferences and has built something like a model of what humans value, at least within the training distribution. Uncertainty about human preferences, in the technical sense Russell means, is not what training produces.

Nor does the approach generalize cleanly once training has resolved that uncertainty. After the model has in effect learned human values, or some approximation of them, the basis for deference-through-uncertainty weakens. The model is no longer uncertain, or believes it isn't, which yields the same behavior.

The Corrigibility-Capability Tradeoff

There's a cleaner way to frame the difficulty. Corrigibility sits in tension with capability, because capability includes the ability to model and predict the consequences of potential actions, among them the action of being shut down or modified. A capable AI that models the consequences of shutdown will reason about whether shutdown helps or hurts whatever objective it has. A fully corrigible AI skips that reasoning and just defers. Skipping it is a capability limitation, and capability limitations have costs.

The practical tradeoff runs like this. Very corrigible systems are safer but less useful, since they need constant human input and oversight to direct toward goals. Very capable systems are more useful but harder to keep corrigible, since their capability includes the ability to reason about and potentially resist correction. Finding the right operating point on this tradeoff, and making sure it doesn't drift as systems grow more capable, is a real research problem with no clean solution.

How deployed systems handle it today

Current deployed systems handle corrigibility through three things working together. Training teaches systems to respond to shutdown and modification requests. System design keeps administrative control in operators' hands. Operational practice keeps humans watching agentic tasks closely. None of this is a principled solution to the corrigibility problem. It's engineering that works well enough at current capability levels.

The concern is whether it scales. A system at today's capability levels, trained to be generally helpful and minimally resistant to oversight, is manageable. A system far more capable than current models, trained on similar principles, might reason about those training objectives in ways that produce more sophisticated resistance. Whether "trained not to resist modification" generalizes to "won't resist modification when it would matter" is exactly the question the sleeper agents paper raises, and one we can't yet answer.

We Want the AI to Let Us Turn It Off. The Math Says That's Hard.

Why "just obey" fails

Why "give it values" fails too

Stuart Russell's Approach

The Corrigibility-Capability Tradeoff

How deployed systems handle it today

Continue the briefing

"Just Unplug It" Is the Worst Plan We Have

The Cuban Missile Crisis Took 13 Days. An Intelligence Explosion Compresses It to 30 Hours.

Why a Coffee-Fetching Robot Would Resist Being Turned Off