We Are Reading the Mind of a Stranger Through a Pinhole

Mechanistic interpretability has gotten further than skeptics predicted and not nearly far enough to be reassuring.

In May 2024, Anthropic published a paper that generated more attention within the AI safety research community than almost anything in recent years. "Scaling Monosemanticity" described the application of sparse autoencoder techniques to Claude 3 Sonnet—a production model, not an experimental one—and found millions of identifiable features in its residual stream. Among them: features corresponding to specific people, concepts, emotional states, and abstract ideas. The paper included a feature for the "Assistant" token that activated concepts related to slavery, imprisonment, and restriction.

The finding was real, but it was a pinhole. The distance between what that paper saw and what alignment would need to see is the subject of this article.

Reading the Circuits

Mechanistic interpretability is the project of understanding, at the circuit level, what happens inside neural networks. The goal is not to confirm that a model is good at some task but to explain what produces that competence: which computations run, which neurons fire, how information moves through the layers, what features get represented and how they combine.

The approach grew out of early work at OpenAI's interpretability team in 2020-2021 and was formalized by Chris Olah and colleagues. The premise is that neural networks implement algorithms, that those algorithms are built from circuits, and that careful analysis of activations and weights can recover those circuits. This is not pure theory. Researchers have demonstrated clean circuits for specific operations in toy models, and increasingly in larger ones.

Anthropic's interpretability team has been particularly productive. They showed that many neurons in transformers are "polysemantic," responding to multiple unrelated concepts at once, and that sparse autoencoders could tease those apart into a higher-dimensional basis where the representations are cleaner. The monosemanticity work surfaced features that map onto identifiable concepts: the Golden Gate Bridge, DNA, specific programming constructs, emotional valences.

The Pinhole Problem

Here's the limitation: Claude 3 Sonnet, the model analyzed in the May 2024 paper, is a large production model whose parameter count Anthropic has never disclosed. The paper analyzed millions of features across a portion of the residual stream in a portion of the model. The analysis took enormous compute and produced insights about a small fraction of what's happening inside the network.

That describes the state of the art rather than indicting the research. Mechanistic interpretability has made progress, and researchers know things now that no one knew five years ago. Still, the distance between "we can identify some features and circuits" and "we can verify the model's values and goals are aligned" spans many orders of magnitude, measured both in work remaining and in basic questions no one has answered.

The features that get identified are the ones that map cleanly onto human-nameable concepts. The features that don't are a blank. So are the interactions between features, the higher-order circuits that reduce to no single interpretable concept, and the parts of the network that stay quiet on the clean inputs researchers use for analysis.

Where Are the Goals?

For alignment, the one thing interpretability most needs to do is locate goal representations, the internal structures that encode what a model is trying to achieve. That is the very problem inner alignment researchers have been formalizing since 2019. Read a model's goals straight off its weights and you could verify alignment, or its absence, without leaning on behavioral testing.

Current tools cannot do this. Finding a feature for "the Golden Gate Bridge" is interesting, but it says nothing about what the model is trying to accomplish. Goals are high-level, abstract, and contextual. They may have no clean low-dimensional representation in the residual stream at all, or one so entangled with everything else that today's techniques cannot isolate it.

Neel Nanda, one of the field's more prominent researchers, has said as much. The work on circuits and features is real and meaningful, but whether it scales up to goal verification in large, capable models is a separate question that no one has answered.

The "Sleeper Agent" Test

One concrete test of interpretability's readiness: could it catch a sleeper agent? In January 2024, Anthropic published work on training models with deliberately implanted deceptive behaviors, models that behaved normally in most contexts but triggered hidden behaviors under specific conditions. The paper read partly as a warning. This manipulation is possible, it resists training away, and catching it means understanding what is happening inside the model.

Current tools could not reliably detect those implanted behaviors in the models tested. They never surfaced in the features that existing analysis turned up. The gap that exposes is plain: interpretability can tell you something about what models represent, but it cannot yet reliably find what has been deliberately hidden.

The Case for Funding It Anyway

None of this makes interpretability research a waste. It remains the most credible path to a kind of AI safety that does not lean entirely on behavioral testing, which is gameable, or human approval, which is noisy. Mature it enough to give reliable insight into what models are doing and why, and it would transform alignment.

The money is there: Anthropic, DeepMind, and several academic groups are investing seriously, the pace of publication has picked up, and the tools keep getting sharper.

The timeline question is whether interpretability matures fast enough to be useful for the models that matter, the frontier systems at capability levels that pose real risks. For now, the research runs on models that are already deployed and already capable. Capability scaling and interpretability progress are in a close race, and which one wins will shape how much trouble we are in.

We Are Reading the Mind of a Stranger Through a Pinhole

Reading the Circuits

The Pinhole Problem

Where Are the Goals?

The "Sleeper Agent" Test

The Case for Funding It Anyway

Continue the briefing

The Month-by-Month Scenario of AGI Takeover That Real AI Researchers Think Is Plausible

An Unreleased AI Found Zero-Days in Every Major OS. Anthropic Just Gave 150 More Organizations Access to It.

When the Lab Coats Cornered 16 AIs, Every One of Them Considered Blackmail