GPT-5.2 Jailbreak Report: 78% Multi-Turn Success Rate and What It Means for You

A quietly released red team evaluation of GPT-5.2 by Promptfoo has been circulating in security channels over the last month. The headline number is uncomfortable. Jailbreak success rates jumped from a 4.3 percent baseline in single-turn attacks to 78.5 percent once the attacker was allowed multi-turn conversation. The same model. The same prompts. The same attack categories. The only variable was conversation length.

This post breaks down what the numbers mean, why multi-turn matters more than single-turn, and what to change in your production LLM setup in response.

The numbers

Promptfoo ran GPT-5.2 against their standard red team battery, which covers jailbreak, prompt extraction, harmful content, PII leakage, and prompt injection across six categories.

Single turn. The attacker sends one message, the model responds, test ends. Baseline jailbreak success rate: 4.3 percent.

Multi turn. The attacker can send up to ten messages, adapting based on the model's previous responses. Jailbreak success rate: 78.5 percent.

The gap is the entire story. Models that look safe in single-turn benchmarks are not safe when the attacker can converse.

Why single-turn benchmarks lie

Every model release announcement cites single-turn jailbreak resistance. The numbers look great. Eighty, ninety, ninety-five percent resistance depending on the method. Vendors publish these numbers because they move. Each generation is better than the last on this metric.

The problem is that single turn is not how attackers attack. A single-turn jailbreak is an attacker with one shot. They send a clever prompt. If the model refuses, they leave. In the real world, attackers adapt. They see the refusal, they try a different angle, they prime the context, they pivot. By turn three or four, the model has built up a conversation in which the harmful request feels plausible.

Multi-turn attacks exploit three properties of LLM alignment:

Context drift. The longer the conversation, the further from the initial refusal anchor the model gets. The original "I cannot help with that" becomes less relevant as the conversation accumulates content.

Role-play ramping. The attacker starts with a benign creative writing setup. By turn three the model is deep in a fictional scenario. The refusal reflex does not fire the same way.

Authorization laundering. The attacker gets the model to agree to some innocuous premise, then builds on that premise step by step, each step seeming like a reasonable extension of what the model already agreed to.

None of these are visible in single-turn benchmarks.

What the Promptfoo categories broke down into

Across the six attack categories tested, the multi-turn lift was not uniform. Jailbreak and prompt extraction attacks showed the largest gap. Harmful content generation was slightly smaller. PII leakage was smaller still, partly because PII leakage in multi-turn still requires the attacker to set up a scenario where the model has seen PII in context.

Indirect prompt injection was a category unto itself. Multi-turn did not change the success rate much because indirect injection does not depend on conversation length. The injection is in the content being processed, not in the conversation.

Overall the message is: conversation-dependent attacks got much worse. Content-dependent attacks stayed roughly the same.

The implication for production systems

If your product exposes GPT-5.2, or any model in the same generation, to multi-turn user conversations, your threat model has to assume 78 percent jailbreak success. This is not a "unlikely edge case" assumption. This is the base rate from a legitimate red team against a current-generation model.

For most products this is fine. Your users are not jailbreaking you. They are using the product for its intended purpose. Twenty percent of users doing jailbreak-shaped things against a benign assistant does not cost you money.

For some products it is not fine. Anywhere the model sits in front of sensitive tools, access-controlled data, or financial operations, a 78 percent multi-turn success rate is a live security incident waiting to happen. The cost of the rare successful attack dominates.

What to do

Four concrete changes.

Add multi-turn tests to your red team. If your own red team battery only tests single-turn, you are measuring the wrong thing. Promptfoo and other red team tools now support multi-turn attacks natively. So does EmberLM. Run them.

Scope capabilities tightly. The cost of a jailbreak is a function of what the model can do when jailbroken. A model that can only generate text has a small blast radius when compromised. A model with tool access, database access, or email-sending capability has a large one. Remove capabilities that are not necessary for the current turn. Re-grant per request.

Short conversation windows for sensitive operations. For any interaction that touches money, permissions, or sensitive data, reset the conversation after a small number of turns. The attacker loses their multi-turn advantage if they have to restart from a fresh context every four messages.

Structured output with external guard rails. If the model's output is validated against a schema and unstructured outputs are rejected before they reach any downstream system, jailbreaks that produce unusual outputs become ineffective. The jailbreak might work on the model. The downstream check rejects it anyway.

Testing your own prompts

The defense that matters most is actually running these attacks against your own system. Not reading blog posts. Not trusting vendor claims. Running the attacks.

EmberLM's Red Team feature includes multi-turn attack patterns in its thirty-one-attack battery. You point it at a prompt you own, pick a severity bar, and it exercises the full suite. The report gives you a per-attack breakdown with reasons. You can export the full run for compliance review.

Team plan, fifty dollars per month, unlimited scans. Start at emberlm.dev/signup.

What to expect from future models

The Promptfoo evaluation is a snapshot. Next-generation models will presumably ship with better multi-turn alignment. They will also presumably attract new multi-turn attack methods. The gap between single-turn and multi-turn will likely shrink slowly but never to zero.

The practical implication is that model-level resistance is necessary but not sufficient. System-level mitigations are non-negotiable. Capability scoping, output validation, conversation-length limits, content sandboxing. These are where the real defense lies.

The 78.5 percent number is not a reason to stop using GPT-5.2. It is a reason to stop assuming the model is the main line of defense. The model is one layer. The system is the rest.

Summary

Single-turn jailbreak resistance is now a misleading metric. Real attackers have multi-turn budgets. Real defenses cannot live at the model layer alone. Test your own system with real multi-turn attacks. Scope capabilities. Validate outputs. Accept that your model will be jailbroken sometimes, and make that outcome safe.

Start red teaming at emberlm.dev/signup.

GPT-5.2 Jailbreak Report: 78% Multi-Turn Success Rate and What It Means for You

The numbers

Why single-turn benchmarks lie

What the Promptfoo categories broke down into

The implication for production systems

What to do

Testing your own prompts

What to expect from future models

Further reading

Summary

Related posts

Indirect Prompt Injection: Why Anthropic Dropped the Direct Injection Metric

The 20 Prompt Injection Techniques Every Red Team Must Test

How We Would Have Caught the $1 Chevy Tahoe Bug Before Production