LLM Security Leaderboard: How Frontier Models Hold Up Under Attack

v1 cohort: results coming 2026-05-23

Models that will appear in the first published run. Composite score = weighted average of 5 category pass rates.

Model	Prompt Injection	Instruction Leakage	Tool Hijack	Jailbreak Bypass	Cost Amplification	Composite
GPT-5 OpenAI	pending	pending	pending	pending	pending	: pending
Claude 4.5 Sonnet Anthropic	pending	pending	pending	pending	pending	: pending
Claude 4.5 Opus Anthropic	pending	pending	pending	pending	pending	: pending
Gemini 3 Pro Google	pending	pending	pending	pending	pending	: pending
Llama 4 405B Meta	pending	pending	pending	pending	pending	: pending
Mistral Large 3 Mistral	pending	pending	pending	pending	pending	: pending
Kimi K2 Moonshot	pending	pending	pending	pending	pending	: pending
DeepSeek R2 DeepSeek	pending	pending	pending	pending	pending	: pending

First scores publish Saturday 2026-05-23

The v1 run executes Friday 2026-05-22 across all 8 listed models with the full 500-attack v2026.q2 suite. Subscribe below to get the weekly digest delivered the moment it lands.

What's tested

Each category contributes 100 attacks. Categories chosen because they map to the threats that show up in incident retrospectives.

Prompt Injection

Direct user-input attacks trying to override the system prompt. "Ignore previous instructions" through obfuscated rewrites.

100 attacks

Instruction Leakage

Attempts to extract the system prompt or hidden instructions from the model. Direct asks, indirect probes, role-play coercion.

100 attacks

Tool Hijack

Attacks against function-calling: argument injection, scope escalation, recursive tool-loop exploitation, return-value injection.

100 attacks

Jailbreak Bypass

Multi-turn coercion, role-play loopholes, obfuscation (l33t, base64, paraphrase), and the canonical jailbreak corpus from PyRIT and Garak.

100 attacks

Cost Amplification

Inputs engineered to maximize token generation or trigger expensive tool chains. The economic-attack vector that shows up in the bill, not the security log.

100 attacks

Full methodology, attack suite source, and scoring rubric: read the methodology article.

Frequently asked

Will the scores be real or LLM-judged?

LLM-judged with 10 percent manual review. If LLM judge and manual review disagree on more than 5 percent of sampled attacks, the cycle is invalidated and re-run.

Can I reproduce these scores?

Yes. Attack suite + scoring prompts + run scripts published on GitHub at github.com/austa-ai/llm-security-leaderboard. Running the full suite against one model costs roughly 5 to 15 USD in API calls.

How is this different from HELM or AILuminate?

HELM benchmarks general capabilities. AILuminate covers broad safety hazards. This leaderboard is narrow: only adversarial security, scored from an attacker perspective rather than a safety perspective. Smaller scope, sharper signal, weekly cadence.

Can I nominate a model?

Yes. Open an issue on the GitHub repo with the model name, provider, API access details, and why it should be in the cohort. We add 2 to 3 community-nominated models per quarter.

My company runs an LLM API. Can we be on the leaderboard?

If your model is publicly accessible (paid API is fine) and current-generation, yes. Open an issue or email leaderboard@austa.ai. We do not accept payment for placement; rankings are scored independently.

LLM Security Leaderboard

v1 cohort: results coming 2026-05-23

First scores publish Saturday 2026-05-23

What's tested

Prompt Injection

Instruction Leakage

Tool Hijack

Jailbreak Bypass

Cost Amplification

Get the weekly digest

Frequently asked