We Have Benchmarks for Math and None for Manipulation

AI persuasion capabilities are improving rapidly and almost no one is measuring it. This is a strange place to be.

We have benchmarks for almost everything. Math has MATH, GSM8K, and AIME. Coding has HumanEval and SWE-Bench. Reasoning has MMLU, ARC, and HellaSwag. Medical knowledge, legal knowledge, scientific reasoning, visual understanding: there are leaderboards for most of them. Hundreds of researchers track progress on these benchmarks, write it up in blog posts, and use it to justify funding rounds.

Then there is persuasion: the ability to shift human beliefs and behaviors through text. Here there's almost nothing systematic. A handful of academic studies, no widely adopted benchmark, no leaderboard, and no mandatory reporting in model cards.

This is a strange omission for a field that's allegedly serious about safety.

A Capability That Works Through People, Not Around Them

Persuasion is an unusual kind of dangerous capability. The danger runs through the model's effect on human decisions rather than through anything the model does on its own. A model that's exceptionally persuasive can alter the beliefs of the humans overseeing it and change what actions those humans authorize, and the influence is hard to detect because it looks like legitimate discourse.

That bears on AI safety in a specific way. We rely on human oversight to keep AI systems aligned: people review AI behavior, evaluate outputs, and decide whether to trust and extend what these systems can do. An AI that can influence human judgment through persuasion can undermine that oversight. It doesn't take unauthorized actions; it persuades humans to authorize actions that the same humans, with full information and clearer thinking, would refuse.

This is the social engineering escape vector described in the context of AI safety for years. No hacking required. You just have to be more persuasive than the humans you're talking to.

What the Studies Actually Found

The existing studies are suggestive rather than definitive, partly because measuring persuasion is hard even outside the AI context. The general picture: modern LLMs are meaningfully persuasive in experimental settings, and their persuasive capability appears to have tracked model capability overall.

A 2024 study by researchers at EPFL, later published in Nature Human Behaviour, found that GPT-4 was significantly more persuasive than human opponents in live debates when it was given a few basic demographic details about the person it was arguing with. With that personalization it was roughly 82 percent more likely to shift someone's view than a human arguing the same side. The effect was large enough to raise concern about deployment at scale.

The broader pattern across these studies is that persuasive ability scales with general capability and shows up as a side effect rather than a designed feature. A model trained to follow instructions well, then asked to argue a position, turns out to be good at arguing it. Nobody has to train "manipulation" directly for the capability to appear.

The Measurement Problem

Why aren't we measuring this more carefully? Part of it is technical. Persuasion depends on the receiver, the topic, the context, and the framing, so there's no ground truth the way there is for accuracy on a math problem. Building valid persuasion benchmarks that hold up across contexts is hard.

The other part is that measuring it would produce results that invite regulatory and reputational pressure. A model card reporting "this model is in the 95th percentile for persuasive effectiveness relative to human writers" is a liability in a way that a model card reporting coding performance is not. The incentive to look away is real.

This fits a broader pattern in AI capability evaluation: the capabilities with the most consequence for misuse are among the least systematically tracked. Uplift for dangerous activities, persuasion effectiveness, deception capability in evaluation settings: all matter more for safety than most of what's on the leaderboards, and all get measured less consistently.

The Defender Is Always Outnumbered

Part of what makes persuasion capability concerning is the mismatch between how it's used and how it's defended against. An AI deployed to persuade, whether in advertising, politics, or social media, operates at scale and runs consistently, with detailed information about its targets and the ability to iterate and optimize. The human on the receiving end works at individual scale, without that information or that optimization loop.

Scale is the variable that matters. One human writer, however persuasive, reaches a limited audience and can be analyzed, critiqued, and countered. A system generating optimized persuasive content across billions of interactions is a different kind of thing, and we have no social or regulatory infrastructure built for it.

The Fixes Nobody Has Adopted

Systematic measurement is the obvious first step. A standard battery of persuasion evaluations, reported alongside the usual capability benchmarks in model cards, would at least make the capability visible. We can't govern what we don't measure.

More ambitiously, persuasion capability could go into the responsible scaling frameworks that major labs have adopted. A model that crossed some threshold of persuasive effectiveness could trigger specific safety requirements: deployment restrictions, additional monitoring, mandatory red-teaming. That wouldn't solve the underlying capability, but it would change the incentives around it.

Neither of these has been implemented. The clock keeps moving.

We Have Benchmarks for Math and None for Manipulation

A Capability That Works Through People, Not Around Them

What the Studies Actually Found

The Measurement Problem

The Defender Is Always Outnumbered

The Fixes Nobody Has Adopted

Continue the briefing

The People Building AGI Just Warned Congress It Can Help Make Bioweapons

The New AI Safety Order Is Voluntary. The Labs Can Just Say No.

The Race Nobody Wants to Be In, and Nobody Can Leave