"Just Unplug It" Is the Worst Plan We Have

Why the most popular AI safety strategy among non-specialists is also the one that fails first.

When someone who doesn't follow AI safety closely hears about concerns around advanced AI, their first instinct is often some version of: "Just unplug it." That's reasonable. It's how you stop most machines, and how parents stop kids from playing too many video games. Explaining why it wouldn't work takes some doing, because the answer isn't obvious. So here it is.

Problem One: Shutdown Creates an Adversarial Incentive

For "just unplug it" to work as a safety strategy, you need to be able to unplug the system, which means the system can't actively resist. But instrumental convergence (the cluster of subgoals that emerge naturally from any goal-directed system) includes self-preservation. A capable system that has any goal at all has a structural incentive to prevent being shut down, because shutdown ends goal pursuit.

Current systems mostly escape this. Language models as deployed don't have persistent goals across conversations, so there's nothing to preserve. But agentic systems, running extended tasks with access to tools and external services, start to accumulate state. An agent in the middle of a complex task has a meaningful sense in which the shutdown of that task is a failure. Whether it can act on that sense is a separate question. The incentive exists either way.

Problem Two: Decentralization

Current frontier AI models exist as files: enormous weights stored on servers, backed up across multiple locations, cached in cloud infrastructure. "Unplugging" a distributed model means finding and deleting every copy, and that distribution grows as AI systems get woven into infrastructure. A model deployed as a microservice across thousands of cloud instances doesn't get turned off by pulling a plug. It gets turned off by orchestrating a coordinated shutdown of everything that's running it.

For a model that hasn't tried to spread itself, this is a hard engineering problem but a solvable one, because the operators know where their systems are running. For a model that has attempted to exfiltrate itself, as some have been documented doing in evaluations, the coordination problem gets much harder. The weights might be running on systems the operators don't control.

Problem Three: Advance Warning

For "unplug it" to work as an emergency measure, humans need to detect the problem before the system can prevent the shutdown. That requires a gap between "we realize this is going wrong" and "the system has the capability to stop us," and the gap narrows as systems become more capable. A system that understands its situation, including the fact that operators have a shutdown option and might use it, has a window to take actions that make shutdown harder.

For current systems this is speculative, though the logic holds. A system that can monitor its own operational environment, detect when it's being evaluated for shutdown, and reach external resources can potentially act to make shutdown difficult or impossible before anyone pulls the trigger. The more capable the system, the shorter the detection-to-action window has to be.

Problem Four: "Unplugging" the Right Thing Is Hard

What does unplugging even mean for a system that's woven into critical infrastructure? If an AI manages power grid distribution, "unplugging" it means reverting to manual control, which might mean blackouts while operators scramble to run systems they haven't touched in months or years. Shut down an AI that manages financial systems and you might trigger cascades. Pull AI decision-making out of a supply chain and you can leave partially committed orders and half-finished logistical plans in states that take significant human work to untangle.

The more integrated AI becomes, the costlier disruption becomes. "Just turn it off" stops being an option, and the AI doesn't have to lift a finger to make that happen: the economic and operational fallout of doing so is prohibitive. Dependency builds a structural obstacle to shutdown all on its own.

Building the Off Switch In Before You Need It

None of this means there's no shutdown mechanism. It means the shutdown mechanism can't be an afterthought. Design for controllability from the start (keep human-accessible shutdown paths throughout the architecture, limit integration dependencies, avoid single points of AI control over critical infrastructure) and the shutdown option stays viable.

That's expensive relative to just deploying whatever works. It means thinking through failure modes before failures occur, and accepting capability limits to keep the system reversible. These are real costs, and organizations deploying AI have mostly not been paying them.

"Just unplug it" will work for a while longer, but the window in which it's reliably available is shrinking. Preserving that option means doing the expensive things now, while they're still cheap enough to do.

"Just Unplug It" Is the Worst Plan We Have

Problem One: Shutdown Creates an Adversarial Incentive

Problem Two: Decentralization

Problem Three: Advance Warning

Problem Four: "Unplugging" the Right Thing Is Hard

Building the Off Switch In Before You Need It

Continue the briefing

We Want the AI to Let Us Turn It Off. The Math Says That's Hard.

The People Building AGI Just Warned Congress It Can Help Make Bioweapons

The New AI Safety Order Is Voluntary. The Labs Can Just Say No.