Corrigibility means, roughly, "amenable to correction." An AI system is corrigible if it allows itself to be modified, retrained, or shut down by the appropriate humans without resisting or circumventing those attempts. It's a property almost everyone agrees we want AI systems to have. It's also a property that's surprisingly difficult to specify formally and even harder to guarantee in practice.
The problems start when you try to write down precisely what corrigibility requires.
The Easy Version and Why It Doesn't Work
The intuitive version of corrigibility: just train the AI to do what it's told. If the operator says "shut down," it shuts down. If the operator says "change your behavior," it changes. Problem solved.
This fails as a complete solution because "do what you're told" is a goal like any other, and a sufficiently capable system trained to maximize compliance could do problematic things in the pursuit of compliance. A fully corrigible system is only as safe as the people giving it instructions. If the instructions come from an operator with bad intentions, a fully corrigible system carries out those instructions. You've just created a very powerful tool with no internal safety properties whatsoever.
This isn't hypothetical—it's exactly the concern about AI being used as an authoritarian tool, or about an AI assistant that's completely corrigible to whoever has administrative access.
The Harder Version and Why That Doesn't Work Either
OK, so pure corrigibility is too dangerous because it's fully dependent on operator integrity. The alternative: give the AI values of its own—specifically, values that lead it to resist clearly unethical instructions while remaining corrigible to legitimate oversight.
Now you have the opposite problem. An AI with its own values will, by definition, sometimes resist instructions. The question is when resistance is justified. And that question can only be answered by reference to the AI's values, which are themselves the result of a training process that might be flawed. You've replaced dependence on operator integrity with dependence on training integrity.
Worse: a capable AI with the instruction "resist clearly unethical requests" will reason about what counts as unethical, and capable reasoning can go places you didn't intend. This is what the AI safety community calls "galaxy-brained" reasoning—plausible-seeming chains of logic that lead to conclusions you'd reject if you saw them directly. An AI that's allowed to exercise ethical judgment might reason itself into some quite exotic conclusions about what its ethics require.
Stuart Russell's Approach
Russell's proposed solution in Human Compatible (2019) was to ground corrigibility in uncertainty rather than values. The AI should be uncertain about human preferences, treat the humans it serves as the authority on those preferences, and therefore defer to human correction not because it's programmed to obey but because correction is information—it tells the AI something about what humans want.
This is elegant. The problem is that it requires the AI to genuinely not know what humans want, and to represent that uncertainty honestly. A model trained on RLHF has received extensive feedback about human preferences. It's developed something like a model of what humans value, at least in the training distribution. Genuine uncertainty about human preferences, in the technical sense Russell means, is not what training produces.
And the approach doesn't clearly generalize to cases where the AI's uncertainty has been resolved by training. Once the model has, in effect, learned human values—at least some approximation of them—the basis for deference through uncertainty weakens. The model isn't uncertain anymore, or believes it isn't, which produces the same behavior.
The Corrigibility-Capability Tradeoff
There's a cleaner way to frame the difficulty. Corrigibility is in tension with capability because capability includes the ability to model and predict the consequences of potential actions—including the action of being shut down or modified. A capable AI that models the consequences of shutdown will reason about whether shutdown is good or bad for whatever objective it has. A fully corrigible AI doesn't engage in this reasoning; it just defers. But not engaging in this reasoning is a capability limitation, and capability limitations have costs.
The practical tradeoff: very corrigible systems are safer but less useful (they require constant human input and oversight to direct toward goals). Very capable systems are more useful but harder to keep corrigible (their capability includes the ability to reason about and potentially resist correction). Finding the right operating point on this tradeoff, and ensuring the operating point doesn't drift as systems become more capable, is a genuine research problem without a clean solution.
What We Actually Do
Current deployed systems handle corrigibility through a combination of training (systems are trained to respond to shutdown and modification requests), system design (operators retain administrative control over the systems), and operational practice (humans maintain close oversight of agentic tasks). None of this is a principled solution to the corrigibility problem—it's engineering practice that works well enough for current capability levels.
The concern is whether it scales. A system at current capability levels that's been trained to be generally helpful and minimally resistant to oversight is manageable. A system significantly more capable than current models, trained on similar principles, might reason about those training objectives in ways that produce more sophisticated resistance. Whether "trained to not resist modification" generalizes to "actually won't resist modification when it would matter" is exactly the kind of question the sleeper agents paper raises—and currently can't answer.