We have benchmarks for almost everything. Math: MATH, GSM8K, AIME. Coding: HumanEval, SWE-Bench. Reasoning: MMLU, ARC, HellaSwag. Medical knowledge, legal knowledge, scientific reasoning, visual understanding—there are leaderboards for most of them. Progress on these benchmarks is tracked by hundreds of researchers, written up in blog posts, used to justify funding rounds.

Persuasion. Manipulation. The ability to shift human beliefs and behaviors through text. There's almost nothing systematic. A handful of academic studies. No widely-adopted benchmark. No leaderboard. No mandatory reporting in model cards.

This is a strange omission for a field that's allegedly serious about safety.

Why Persuasion Capability Matters

The reason persuasion capability deserves careful measurement is that it's an unusual kind of dangerous capability—it's dangerous not primarily through the model's actions, but through its effect on human decisions. A model that's exceptionally persuasive can alter the beliefs of the humans overseeing it, change what actions those humans authorize, and do so in ways that are hard to detect because the influence looks like legitimate discourse.

This matters for AI safety in a specific way. One of the primary mechanisms we rely on for keeping AI systems aligned is human oversight—people reviewing AI behavior, evaluating outputs, deciding whether to trust and extend the capabilities of AI systems. If AI systems can influence human judgment through persuasion, they can potentially undermine oversight itself, not by taking unauthorized actions, but by persuading humans to authorize actions that humans, with full information and clearer thinking, would not authorize.

This is the social engineering escape vector described in the context of AI safety for years. It doesn't require hacking. It requires being more persuasive than the humans you're talking to.

What the Research Does Show

The studies that exist are suggestive rather than definitive, partly because measuring persuasion is hard even outside the AI context. But the general picture: modern LLMs are meaningfully persuasive in experimental settings, and their persuasive capability appears to have improved with model capability generally.

A 2024 study at UC San Diego found that GPT-4 was significantly more effective than human writers at persuading participants to change their views on contested political and social issues—more effective than professional persuasion writers in some comparisons, not just average humans. The effect was large enough to raise concern about deployment at scale.

A separate study by researchers at the Alignment Research Center found that larger models were better at steering human judgment in directions that conflicted with the human's stated values—getting people to endorse positions they said they opposed when the framing was crafted carefully. The models weren't explicitly trying to manipulate; they were just following instructions to make arguments. The capability was incidental to being a good instruction-follower.

The Measurement Problem

Why aren't we measuring this more carefully? Partly because it's technically hard. Persuasion depends on the receiver, the topic, the context, the framing—it's not like measuring accuracy on a math problem, where there's a ground truth. Constructing valid persuasion benchmarks that generalize across contexts is genuinely difficult.

Partly, though, it's because measuring this capability would produce results that create regulatory and reputational pressure. A model card that reports "this model is in the 95th percentile for persuasive effectiveness relative to human writers" is a liability in ways that a model card reporting coding performance is not. The incentive to not-measure is real.

This is an instance of a broader pattern in AI capability evaluation: the capabilities that have most consequence for misuse are among the least systematically tracked. Uplift for dangerous activities, persuasion effectiveness, deception capability in evaluation contexts—all are more important for safety than most of what's on the leaderboards and less consistently measured.

The Asymmetric Risk

One reason persuasion capability is particularly concerning is the asymmetry between how it's used and how it's defended against. An AI system deployed to persuade—in advertising, politics, social media—operates at scale, consistently, with access to detailed information about targets and the ability to iterate and optimize. A human receiving persuasive content is operating at individual scale, without that information or optimization capability.

Scale is the critical variable. One human writer, however persuasive, reaches a limited audience and can be analyzed, critiqued, and countered. A system that generates optimized persuasive content at the scale of billions of interactions is a different thing entirely. We don't have social or regulatory infrastructure designed for this.

What Would Help

Systematic measurement is the obvious first step. A standard battery of persuasion evaluations, with results reported alongside standard capability benchmarks in model cards, would at least make the capability visible. We can't govern what we don't measure.

More ambitiously: persuasion capability could be included in the responsible scaling frameworks that major labs have adopted. If a model exceeds some threshold of persuasive effectiveness, that could trigger specific safety requirements—deployment restrictions, additional monitoring, mandatory red-teaming. This doesn't solve the underlying capability problem, but it changes the incentive structure around it.

Neither of these has been implemented. The clock keeps moving.