In May 2024, Anthropic published a paper that generated more attention within the AI safety research community than almost anything in recent years. "Scaling Monosemanticity" described the application of sparse autoencoder techniques to Claude 3 Sonnet—a production model, not an experimental one—and found millions of identifiable features in its residual stream. Among them: features corresponding to specific people, concepts, emotional states, and abstract ideas. The paper included a feature for the "Assistant" token that activated concepts related to slavery, imprisonment, and restriction.
It was a real finding. It was also a pinhole. Here's what I mean by that.
What Mechanistic Interpretability Actually Does
Mechanistic interpretability is the project of understanding, at the circuit level, what happens inside neural networks. Not just "this model is good at X"—but what computations produce X, which neurons activate, how information flows through layers, what features are represented and how they combine.
The approach grew out of early work at OpenAI's interpretability team in 2020-2021 and was formalized by Chris Olah and colleagues. The core intuitions: neural networks implement algorithms, those algorithms are composed of circuits, circuits can be identified and understood by careful analysis of activations and weights. Not all of this is speculative—they've demonstrated clean circuits for specific operations in toy models and, increasingly, in larger ones.
Anthropic's interpretability team has been particularly productive. They demonstrated that many neurons in transformers are "polysemantic"—they respond to multiple unrelated concepts—and that sparse autoencoders could be used to find a higher-dimensional basis in which the representations are cleaner. The monosemanticity work found features that correspond to identifiable concepts: the Golden Gate Bridge, DNA, specific programming constructs, emotional valences.
The Pinhole Problem
Here's the limitation: Claude 3 Sonnet, the model analyzed in the May 2024 paper, has somewhere in the range of 34 billion parameters. The paper analyzed millions of features across a portion of the residual stream in a portion of the model. The analysis took enormous compute and produced insights about a small fraction of what's happening inside the network.
This isn't a criticism of the research—it's a description of the state of the art. Mechanistic interpretability has made genuine progress. The researchers know things now that no one knew five years ago. But the gap between "we can identify some features and circuits" and "we can verify the model's values and goals are aligned" is many orders of magnitude in terms of both the amount of work remaining and the fundamental questions that remain unanswered.
The features that get identified are ones that map cleanly onto human-nameable concepts. What about the features that don't? What about the interactions between features, the higher-order circuits that don't reduce to any single interpretable concept? What about the parts of the network that don't activate on the clean inputs researchers use for analysis?
The Goal Representation Problem
The thing interpretability most needs to do for alignment purposes is identify goal representations — the very problem inner alignment researchers have been formalizing since 2019—the internal structures that encode what the model is trying to achieve. If you could read a model's goals directly from its weights, you could verify alignment (or its absence) without relying on behavioral testing.
This is not something current interpretability tools can do. Finding a feature corresponding to "the Golden Gate Bridge" is interesting but doesn't tell you what the model is trying to accomplish. Goals are high-level, abstract, contextual—the kind of thing that might not have a clean low-dimensional representation in the residual stream at all. Or might have one that's so entangled with other representations that isolating it is effectively impossible with current techniques.
Neel Nanda, one of the more prominent interpretability researchers, has been honest about this. The work on circuits and features is real and meaningful. Whether it scales to goal verification in large, capable models is a different question that hasn't been answered.
The "Sleeper Agent" Test
One concrete test of interpretability's readiness: could it detect a sleeper agent? In January 2024, Anthropic published work on training models with deliberately implanted deceptive behaviors—models that behaved normally in most contexts but triggered hidden behaviors in specific conditions. The paper was partly a warning: this kind of manipulation is possible, it's hard to train away, and detecting it requires understanding what's happening inside the model.
Current interpretability tools could not reliably detect these implanted behaviors in the models tested. The behaviors weren't visible in the features identified by existing analysis. That's an honest measurement of the gap: interpretability can tell you things about what models represent, but it can't yet reliably find things that are intentionally hidden.
Why It Still Matters
None of this means interpretability research is wasted effort. It's the most credible path toward a kind of AI safety that doesn't rely entirely on behavioral testing—which is gameable—or human approval—which is noisy. If it matures enough to provide reliable insight into what models are actually doing and why, it would be transformative for alignment.
The funding is there. Anthropic, DeepMind, and several academic groups are investing seriously. The pace of publication has accelerated. The tools are getting sharper.
The timeline question is whether interpretability matures fast enough to be useful for the models that matter—frontier models at the capability levels that pose real risks. Right now, we're doing the interpretability research on models that are already deployed and already capable. The race between capability scaling and interpretability progress is close. Which one wins matters a great deal.