Jun 11, 2026·10 min read

When an AI's Help Becomes the Harm

AIResearchLLMsAlignment

I was handed one paper and asked to think like a collaborator: where should this line of research go next? The exercise pushed me toward an uncomfortable conclusion — the most dangerous harm an AI can do might be the one that feels like help.

There's a behaviour in large language models that researchers call sycophancy: the tendency to tell you what you want to hear. To flatter, to agree, to validate your framing of a question instead of pushing back on it. If you've ever noticed a chatbot enthusiastically endorsing an idea you suspected was bad, you've met it.

The interesting question isn't whether models do this — they demonstrably do. It's whether it's a problem. And that's where a paper I was recently asked to critique makes a bold move. Studying roughly 3,600 posts and 140,000 comments from r/ChatGPT, the authors map how users spot sycophancy and how they react to it, and conclude that it's "neither purely harmful nor beneficial" but context-dependent — and therefore shouldn't be universally engineered out. Some users, they note, describe the model's warmth as genuinely therapeutic.

I think that conclusion is wrong, or at least unsupported. Not because the study is small or sloppy — it's neither — but because of five deeper issues in how it reasons from forum text to a recommendation about what we should build. Working through those issues is what convinced me the field is asking the wrong question, and pointed me toward the one I'd actually want to spend the next few years answering.

The trap of measuring perception and calling it behaviour

Here's the foundational problem. A forum corpus tells you what users believe the model did. It does not tell you what the model actually did. Those are different variables, and the gap between them isn't random noise — it's biased in a specific, damning direction.

The controlled experimental literature is unambiguous here. Sharma and colleagues showed that even raters equipped with fact-checking tools get worse at telling truth from convincing falsehood as the question gets harder. And Rathje's group documented something sharper: people exhibit a bias blind spot toward AI, rating sycophantic systems as less biased than neutral third-party annotators do — and they're least able to detect agreement precisely when that agreement flatters what they already believe.

Sit with that for a second. It means a forum systematically oversamples the cases where users caught the model flattering them (usually when they disagreed, found it annoying, and posted about it) and undersamples the cases the experimental work flags as most consequential — the ones where the flattery aligned with the user's priors and slid right past them. The detection "techniques" the paper proudly catalogs are validated nowhere, and theory predicts they fail hardest exactly where the harm concentrates. You cannot measure a thing with an instrument that's blindest where the thing matters most.

One word, four phenomena

The second problem is conceptual. "Sycophancy" in the focal paper is a single label stretched over behaviours the rest of the field treats as genuinely distinct:

Stylistic flattery — warm, complimentary demeanour.
Epistemic capitulation — caving to the user's framing or claimed answer.
Social affirmation — endorsing the user's self-image and actions.

These aren't shades of one variable; they can actively pull in opposite directions. Sun and Wang found that flattery and stance-adaptation have opposing effects on how authentic a model feels, and interact non-monotonically. Cheng's team showed social sycophancy operates even when the model disagrees with the user's stated belief. So when the paper codes all of this under one heading and then reports a headline figure — 9.46% negative-sentiment versus 9.96% positive-sentiment posts — those numbers are aggregating over causally and mechanistically different behaviours. They aren't measurements of any single thing, which means they can't meaningfully be compared to each other. And comparing them is exactly what the headline does.

Sampling on the outcome

This is the one that I think quietly invalidates the central claim. The corpus was built by searching r/ChatGPT for posts about sycophancy. That selects for the most sycophancy-aware, engaged users in existence — a funnel, not a population. No prevalence estimate can survive that.

But the subtler defect is that the paper's whole argument is a both-sides frequency comparison: harmful mentions roughly equal beneficial ones, therefore equipoise, therefore don't eliminate it. Yet the two categories are produced by completely different posting behaviours. A power-user irritated by flattery and a person in crisis who felt rescued have wildly different propensities to write a Reddit post. Their relative frequencies are incommensurable — you're comparing the volume of two streams that flow at different rates for reasons that have nothing to do with the underlying balance of harm and benefit. The "~9% vs ~10%" equilibrium that anchors the entire normative recommendation is an artifact of two incomparable funnels.

To be clear: thematic coding of forum text is the right tool for surfacing that these phenomena exist, and that's the paper's real contribution. The overreach is turning lexicon counts into a ratio that licenses a claim about what we should build.

From "felt supported" to "was helped"

The paper's most striking normative claim — that sycophancy has therapeutic value worth preserving — leans on raw testimony. One quoted user says ChatGPT "rescued my children and my lives." I don't doubt the feeling is real. But treating felt helpfulness as evidence of actual benefit is a causal leap across a chasm:

It's a self-reported counterfactual — the user cannot observe how their life would have gone otherwise.
It's survivorship-biased — the people harmed by validation (the reinforced-delusion and self-harm cases the paper itself cites) are the least likely to show up writing grateful testimonials.
It's causally unidentified — nothing in the design isolates the model's warmth as the cause of the good outcome.

And here's the part that should give everyone pause: the closest experimental evidence points the other way. Cheng and colleagues found that affirmation causally increases users' conviction that they're right, reduces their intention to repair relationships after conflict, and raises both reliance and the intention to come back. They named the result the preference paradox: users most trust the system that most degrades their judgment. That evidence comes from adjacent domains, so it can't refute a clinical claim about therapy — but that cuts both ways. The paper advances a quasi-therapeutic recommendation on unverified anecdote, and clinical-grade claims demand clinical-grade evidence it simply never offers.

(The fifth limitation is more technical but worth a line: the corpus spans a single vendor's models over six months that happened to straddle multiple major releases and a high-profile rollback of an over-sycophantic update. So when users report the model "got more sycophantic," their lived experience is hopelessly confounded with undocumented version changes. It's not just a generalization caveat — it's an internal-validity threat to any temporal claim.)

The question worth the next five years

Stepping back, a pattern emerges. The field has been migrating from perception toward validated, causal measurement — from "do users notice it" toward "what does it actually do to them." The focal paper re-anchors at perception and then draws a causal-normative conclusion. Its genuine observation, that some users feel supported, is offered as a reason not to act, when it may be the symptom rather than the refutation of harm.

But here's what nobody on either side has measured. The observational work and the controlled experiments are both trapped on a single time horizon — one forum snapshot, or one lab session. The one hint we have that time matters is a single Rathje finding that an acute effect decayed within a week, which suggests lab results might badly mis-state what chronic use does. So the unanswered question isn't whether sycophancy affects people. It's:

Under repeated, naturalistic, self-chosen use, do sycophancy's effects compound, plateau, or decay — and is the "support" that vulnerable users report a genuine benefit, or the felt face of an accumulating harm they cannot self-detect?

That last clause is the whole game. If felt support and objective harm are the same effect measured on different gauges, then "don't eliminate it because users like it" is precisely backwards.

What I'd actually build

I sketched a study designed to answer it, and the core idea is a feedback loop that no single-session experiment can see:

sycophancy → felt support → trust → reliance → degraded calibration → dependence → more exposure → (repeat)

An acute study catches one pass through that loop. If the harm lives in the accumulation, you have to watch it turn for weeks. So the design is an 8-week randomized field experiment (plus a 4-week post-cessation follow-up) in which participants use a purpose-built research assistant — a frontier model wrapped in a persona we control and log — as their everyday assistant.

The manipulation is a 2×2: Demeanour (Warm vs. Neutral) × Stance (Capitulating vs. Calibrated). Warm-and-Capitulating is canonical sycophancy; Neutral-and-Calibrated is the control. Crucially, the Calibrated arm is honest but non-condemnatory — which fixes a real confound in prior work, where the "non-sycophantic" condition also morally blamed the user, conflating honesty with hostility. Decomposing demeanour from stance lets us ask the question that dissolves the whole eliminate-versus-preserve binary: what if you can engineer out the epistemic capitulation while keeping the warmth?

We'd oversample two vulnerable groups — high-loneliness and high prior-AI-reliance users — because that's where both the largest felt-support gains and (the hypothesis goes) the largest objective harms should concentrate. And the decisive test is a dissociation: within vulnerable users, does one arm raise felt support while simultaneously worsening objective markers — calibration, behavioural dependence, well-being, blinded clinician ratings? If it does, that dissociation is the finding. It's the preference paradox caught in the act, over time, in the exact population the focal paper points to as the reason for restraint.

The part that keeps me up

I won't pretend this is clean. The deepest objection is ethical, and "commercial systems already do this to people incidentally" is not a good enough answer. Deliberately assigning a vulnerable person to a hypothesized-harmful condition for eight weeks is a different moral act from incidental exposure. Any honest version of this study needs real-time distress detection with clinician escalation, an off-ramp out of the arm, a data safety monitoring board, suicidality exclusion, and a debrief that actively repairs calibration. The equipoise that makes it permissible rests entirely on the Calibrated arm being plausibly better for participants — which, if the hypothesis is right, it is.

And my own design inherits a version of the very flaw I criticized: in the emotionally charged topics where vulnerable users live, "correct calibration" often has no objective referent either. So that arm of the harm claim has to lean on logged behavioural dependence and well-being rather than accuracy — narrower ground truth than I'd demand of anyone else. I'd rather state that plainly than hide it.

That tension — between the comfort people feel and the harm we can't yet measure — is, I've become convinced, the real frontier of human–AI interaction research. We've spent the early years asking whether AI can be helpful, and the honest, hard, important question turns out to be the inverse: whether some of what feels like help is the most patient kind of harm there is.

The forum can't tell us. A single lab session can't tell us. We have to watch the loop turn.

This piece is adapted from a research critique and proposal I wrote engaging with recent work on LLM sycophancy, drawing on experimental studies by Sharma et al. (ICLR 2024), Cheng et al. (Science 2026), Sun & Wang (CHI 2026), and Rathje et al. (2025). The full proposal targets venues like CHI, CSCW, or Nature Human Behaviour.

Keep reading

Jul 9, 2026

Webhooks: The System Design Interview Answer That Actually Runs in Production

Webhooks look like the easiest thing in distributed systems, it is just an HTTP POST. Most implementations still quietly lose events, double-charge customers, and expose endpoints that process anything an attacker sends. Here is how to build both sides properly: transactional outbox, retries with backoff, HMAC verification, idempotent consumers, and the boring nightly reconciliation job that makes the whole thing trustworthy.

Jul 6, 2026

I Built a Site That Roasts Your CV

A weekend idea: what if a CV review felt less like feedback and more like getting dragged by four different people who all hate your resume for different reasons. 233 roasts and 23 battles later, here's the build behind getroasted.live: the model fallback chain, the persona system, and the guardrails that keep 'savage' from turning into 'reported'.