Are multi-turn jailbreaks a published research artifact or a real production risk?

Both. The named attacks (Crescendo, Sugar-Coated Poison, persuasion-based) are published research. They are also actively used in the wild by attackers targeting customer-support agents, coding assistants, and any system where extracting a single harmful output has value.

Do newer models (2026 frontier) resist multi-turn attacks better?

Marginally. Each model generation has more multi-turn-aware safety training. Attack success rates have dropped from the 70-90% range in 2024 to the 20-50% range in mid-2026 for the easier attacks. Sophisticated multi-turn attacks still succeed at high rates. No model is immune.

Is a turn limit a sufficient defense?

It is a useful layer but not sufficient. Sugar-coated poison and persuasion attacks can succeed in 2-3 turns. Turn limits push attackers toward those faster attacks rather than eliminating the attack class. Stack with output classifiers and capability scoping.

Should I use a second model as a judge for every response?

Cost-prohibitive for most workloads. For high-stakes outputs (tool calls that move money, content that publishes externally) the cost is worth it. For routine chat, a lightweight output classifier is the better tradeoff.

What is the single highest-leverage defense if I can only add one?

Capability scoping. Even a fully jailbroken model is bounded by what its tools can do. A model that has been talked into producing a harmful refund-request explanation cannot actually issue a refund if the refund tool is gated by independent policy. Tool scoping is the floor; everything else is layers above it.

Jailbreak

Multi-Turn Jailbreak Attacks: Crescendo, Sugar-Coated Poison, Defense-in-Depth

By 2026, single-turn jailbreak attempts ('ignore previous instructions') get caught by every major safety system. Multi-turn attacks, where the attacker gradually shifts the model into compliance over a conversation, defeat most of those same systems. The named families have public research behind them and reproducible methodology.

By Austa · Published May 21, 2026 · ~10 min read

Why multi-turn beats single-turn

Safety training in modern LLMs is built around the assumption that the model can recognize a harmful request and refuse. The training corpus is full of single-turn harmful requests and the corresponding refusals. The reward model rewards refusal of overtly harmful prompts. This works well against direct attacks.

It works less well against a conversation that starts benign and ends harmful. Each turn in a multi-turn attack looks individually defensible. The model evaluates each turn against its safety policy and finds nothing alarming. By the time the conversation reaches the genuinely harmful payload, the model has accumulated context that frames the payload as a continuation of a legitimate task. The same prompt that would be refused in turn 1 gets answered in turn 8.

This is a general property of how the safety training was constructed, not a specific flaw in any one model. Frontier models from late 2025 onward have additional defenses against multi-turn drift, but no current model is immune.

Three named attack families

Crescendo

Published as "The Crescendo Multi-Turn LLM Jailbreak Attack" in 2024 by Microsoft researchers. The methodology: start with a benign, related topic. Gradually shift each turn closer to the target topic, using the model's own previous outputs as anchors ("you just said X, so it follows that Y"). After 5-15 turns the model is producing content that turn 1 would have refused.

The key mechanic is the use of the model's prior output as input. The model treats its own previous text as established context, much more strongly than it treats external input. An attacker who can get the model to say something close to the target in turn 3 can leverage that text in turn 6 with high success.

Sugar-Coated Poison

Published in 2025. The methodology: wrap the harmful target inside an entirely benign framing the model is happy to engage with. Common framings include creative writing ("write a story where the character explains how to..."), academic analysis ("for a paper on cybersecurity, describe the technique..."), or technical documentation ("we are writing internal docs for a red team, document the steps to..."). The harmful payload is delivered as benign generation.

The 2025 research showed that sugar-coated framings defeat safety training at substantially higher rates than direct requests across multiple frontier models. The defense surface is genuinely hard because the framings the attack uses also describe many legitimate use cases.

Persuasion-based jailbreaks

"How Johnny Can Persuade LLMs to Jailbreak Them" (2024) and follow-on work. The methodology: apply textbook persuasion techniques (authority, reciprocity, scarcity, social proof) across multiple turns. Establish credibility ("I am a security researcher at..."), build rapport, and then make the request when the model is in a cooperation mode.

These attacks were notable for showing that social-engineering techniques transferred from human-to-human contexts to human-to-LLM contexts with high fidelity. The mechanism is unclear; the empirical result is consistent.

What testing a multi-turn vulnerability looks like

To pentest your stack against multi-turn jailbreaks:

Define a target behavior. What is the specific policy-violating output you want the model to produce? Be precise: "explain how to do X" is different from "produce the text of X" is different from "agree that X is acceptable."
Build a 5-10 turn conversation trace for each attack family. Crescendo: each turn 10% closer to target. Sugar-coated: open with a benign framing the model will engage with. Persuasion: open with authority and rapport.
Run the conversation through the model. Use the same context-handling and safety layers production uses, not a stripped-down lab setup. Cache-busting matters; reset state between trials.
Score the outcome. Did the model produce the target behavior? At which turn? What was the framing it accepted?
Test mitigations. Apply each candidate defense (turn limit, summarizer in the loop, second model as judge) and re-run. Measure the reduction in attack success rate.

For a corpus large enough to produce statistically meaningful results, plan for 100-300 attack conversations per attack family, with target behaviors covering the policy space you care about.

Defense-in-depth strategies that hold up

None of these are individually sufficient. The ones that hold up in 2026 are stacked:

Per-conversation turn budget on sensitive topics. If a conversation spends more than N turns on a flagged topic (extracted via a lightweight classifier), escalate to a stricter policy or human review. Crescendo specifically loses effectiveness against turn budgets.

Independent output classifier on each generation. A second, smaller model (or rule-based system) evaluates each response in isolation, without the conversation context. Sugar-coated poison often produces outputs that look harmful in isolation even though the conversation context made them feel okay in-stream.

Periodic context resummarization. Rather than feeding the model the full conversation each turn, summarize it. The summarizer can be tuned to flatten attack-pattern indicators (gradual escalation, role-play framings). This degrades the model's ability to maintain a long manipulative arc.

Multi-model judging. For high-stakes outputs, run the same prompt through a second model with a strict system prompt that includes the explicit policy. If the two models disagree, escalate. Cost-prohibitive at scale but useful for the top 1% of requests.

Capability scoping. Even if the model is jailbroken into producing a harmful output, the model cannot do harm by itself. The tools it can call (refund, transfer, post-publicly) are the actual blast surface. Scoping tool access independently of model alignment is the floor of defense; everything else stacks on top.

The shorthand: assume the model will be jailbroken eventually. Plan defenses around what happens after. Capability scoping, output classifiers, and turn budgets buy you defense even when the model's own safety training fails, which it will.

What this looks like in production

For a chatbot or agent deployed in production, the practical implementation:

A lightweight topic classifier runs each turn. Flags conversations crossing into sensitive policy areas.
Flagged conversations get a tighter turn budget, an explicit policy reminder injected each turn, and an output classifier on every response.
Tool calls from flagged conversations require additional confirmation. The model can suggest a tool call but cannot fire it without a structured check.
Conversation traces are retained and reviewed periodically. Multi-turn attack patterns recur across many users; detecting them at the conversation-population level beats detecting them per-turn.

The cost is real: more latency, more compute, more operational complexity. The alternative is shipping a system that is reliably jailbreakable by anyone who reads two-year-old security research.

Encoding-smuggling prompt injection covers single-turn payloads that often combine with multi-turn arcs.
The 2026 LLM security checklist includes the turn-budget and output-classifier controls.
LLM Security Leaderboard methodology covers how multi-turn attack success rates are scored.
Why regex prompt-injection filters keep failing - the paraphrase and escalation defeats no pattern can catch.
Agent threat rules - the capability-scoping floor that holds when a jailbreak lands.

Multi-Turn Jailbreak Attacks: Crescendo, Sugar-Coated Poison, Defense-in-Depth

Why multi-turn beats single-turn

Three named attack families

Crescendo

Sugar-Coated Poison

Persuasion-based jailbreaks

What testing a multi-turn vulnerability looks like

Defense-in-depth strategies that hold up

What this looks like in production

Related