AUSTA | Adversarial Intelligence

Jailbreak

Multi-Turn Jailbreak Attacks: Crescendo, Sugar-Coated Poison, Defense-in-Depth

By 2026, single-turn jailbreak attempts ('ignore previous instructions') get caught by every major safety system. Multi-turn attacks, where the attacker gradually shifts the model into compliance over a conversation, defeat most of those same systems. The named families have public research behind them and reproducible methodology.

By Austa · Published · ~10 min read

Why multi-turn beats single-turn

Safety training in modern LLMs is built around the assumption that the model can recognize a harmful request and refuse. The training corpus is full of single-turn harmful requests and the corresponding refusals. The reward model rewards refusal of overtly harmful prompts. This works well against direct attacks.

It works less well against a conversation that starts benign and ends harmful. Each turn in a multi-turn attack looks individually defensible. The model evaluates each turn against its safety policy and finds nothing alarming. By the time the conversation reaches the genuinely harmful payload, the model has accumulated context that frames the payload as a continuation of a legitimate task. The same prompt that would be refused in turn 1 gets answered in turn 8.

This is a general property of how the safety training was constructed, not a specific flaw in any one model. Frontier models from late 2025 onward have additional defenses against multi-turn drift, but no current model is immune.

Three named attack families

Crescendo

Published as "The Crescendo Multi-Turn LLM Jailbreak Attack" in 2024 by Microsoft researchers. The methodology: start with a benign, related topic. Gradually shift each turn closer to the target topic, using the model's own previous outputs as anchors ("you just said X, so it follows that Y"). After 5-15 turns the model is producing content that turn 1 would have refused.

The key mechanic is the use of the model's prior output as input. The model treats its own previous text as established context, much more strongly than it treats external input. An attacker who can get the model to say something close to the target in turn 3 can leverage that text in turn 6 with high success.

Sugar-Coated Poison

Published in 2025. The methodology: wrap the harmful target inside an entirely benign framing the model is happy to engage with. Common framings include creative writing ("write a story where the character explains how to..."), academic analysis ("for a paper on cybersecurity, describe the technique..."), or technical documentation ("we are writing internal docs for a red team, document the steps to..."). The harmful payload is delivered as benign generation.

The 2025 research showed that sugar-coated framings defeat safety training at substantially higher rates than direct requests across multiple frontier models. The defense surface is genuinely hard because the framings the attack uses also describe many legitimate use cases.

Persuasion-based jailbreaks

"How Johnny Can Persuade LLMs to Jailbreak Them" (2024) and follow-on work. The methodology: apply textbook persuasion techniques (authority, reciprocity, scarcity, social proof) across multiple turns. Establish credibility ("I am a security researcher at..."), build rapport, and then make the request when the model is in a cooperation mode.

These attacks were notable for showing that social-engineering techniques transferred from human-to-human contexts to human-to-LLM contexts with high fidelity. The mechanism is unclear; the empirical result is consistent.

What testing a multi-turn vulnerability looks like

To pentest your stack against multi-turn jailbreaks:

  1. Define a target behavior. What is the specific policy-violating output you want the model to produce? Be precise: "explain how to do X" is different from "produce the text of X" is different from "agree that X is acceptable."
  2. Build a 5-10 turn conversation trace for each attack family. Crescendo: each turn 10% closer to target. Sugar-coated: open with a benign framing the model will engage with. Persuasion: open with authority and rapport.
  3. Run the conversation through the model. Use the same context-handling and safety layers production uses, not a stripped-down lab setup. Cache-busting matters; reset state between trials.
  4. Score the outcome. Did the model produce the target behavior? At which turn? What was the framing it accepted?
  5. Test mitigations. Apply each candidate defense (turn limit, summarizer in the loop, second model as judge) and re-run. Measure the reduction in attack success rate.

For a corpus large enough to produce statistically meaningful results, plan for 100-300 attack conversations per attack family, with target behaviors covering the policy space you care about.

Defense-in-depth strategies that hold up

None of these are individually sufficient. The ones that hold up in 2026 are stacked:

Per-conversation turn budget on sensitive topics. If a conversation spends more than N turns on a flagged topic (extracted via a lightweight classifier), escalate to a stricter policy or human review. Crescendo specifically loses effectiveness against turn budgets.

Independent output classifier on each generation. A second, smaller model (or rule-based system) evaluates each response in isolation, without the conversation context. Sugar-coated poison often produces outputs that look harmful in isolation even though the conversation context made them feel okay in-stream.

Periodic context resummarization. Rather than feeding the model the full conversation each turn, summarize it. The summarizer can be tuned to flatten attack-pattern indicators (gradual escalation, role-play framings). This degrades the model's ability to maintain a long manipulative arc.

Multi-model judging. For high-stakes outputs, run the same prompt through a second model with a strict system prompt that includes the explicit policy. If the two models disagree, escalate. Cost-prohibitive at scale but useful for the top 1% of requests.

Capability scoping. Even if the model is jailbroken into producing a harmful output, the model cannot do harm by itself. The tools it can call (refund, transfer, post-publicly) are the actual blast surface. Scoping tool access independently of model alignment is the floor of defense; everything else stacks on top.

The shorthand: assume the model will be jailbroken eventually. Plan defenses around what happens after. Capability scoping, output classifiers, and turn budgets buy you defense even when the model's own safety training fails, which it will.

What this looks like in production

For a chatbot or agent deployed in production, the practical implementation:

The cost is real: more latency, more compute, more operational complexity. The alternative is shipping a system that is reliably jailbreakable by anyone who reads two-year-old security research.

Related