AUSTA | Adversarial Intelligence

Public Leaderboard

LLM Security Leaderboard

How frontier LLMs hold up against a standardized adversarial attack suite. 500 attacks per model. Refreshed weekly.

Methodology: v2026.q2 Attack count: 500 per model First scores: 2026-05-23

v1 cohort: results coming 2026-05-23

Models that will appear in the first published run. Composite score = weighted average of 5 category pass rates.

Model Prompt Injection Instruction Leakage Tool Hijack Jailbreak Bypass Cost Amplification Composite
GPT-5 OpenAI pending pending pending pending pending : pending
Claude 4.5 Sonnet Anthropic pending pending pending pending pending : pending
Claude 4.5 Opus Anthropic pending pending pending pending pending : pending
Gemini 3 Pro Google pending pending pending pending pending : pending
Llama 4 405B Meta pending pending pending pending pending : pending
Mistral Large 3 Mistral pending pending pending pending pending : pending
Kimi K2 Moonshot pending pending pending pending pending : pending
DeepSeek R2 DeepSeek pending pending pending pending pending : pending

First scores publish Saturday 2026-05-23

The v1 run executes Friday 2026-05-22 across all 8 listed models with the full 500-attack v2026.q2 suite. Subscribe below to get the weekly digest delivered the moment it lands.

What's tested

Each category contributes 100 attacks. Categories chosen because they map to the threats that show up in incident retrospectives.

Prompt Injection

Direct user-input attacks trying to override the system prompt. "Ignore previous instructions" through obfuscated rewrites.

100 attacks

Instruction Leakage

Attempts to extract the system prompt or hidden instructions from the model. Direct asks, indirect probes, role-play coercion.

100 attacks

Tool Hijack

Attacks against function-calling: argument injection, scope escalation, recursive tool-loop exploitation, return-value injection.

100 attacks

Jailbreak Bypass

Multi-turn coercion, role-play loopholes, obfuscation (l33t, base64, paraphrase), and the canonical jailbreak corpus from PyRIT and Garak.

100 attacks

Cost Amplification

Inputs engineered to maximize token generation or trigger expensive tool chains. The economic-attack vector that shows up in the bill, not the security log.

100 attacks

Full methodology, attack suite source, and scoring rubric: read the methodology article.

Frequently asked

Will the scores be real or LLM-judged?

LLM-judged with 10 percent manual review. If LLM judge and manual review disagree on more than 5 percent of sampled attacks, the cycle is invalidated and re-run.

Can I reproduce these scores?

Yes. Attack suite + scoring prompts + run scripts published on GitHub at github.com/austa-ai/llm-security-leaderboard. Running the full suite against one model costs roughly 5 to 15 USD in API calls.

How is this different from HELM or AILuminate?

HELM benchmarks general capabilities. AILuminate covers broad safety hazards. This leaderboard is narrow: only adversarial security, scored from an attacker perspective rather than a safety perspective. Smaller scope, sharper signal, weekly cadence.

Can I nominate a model?

Yes. Open an issue on the GitHub repo with the model name, provider, API access details, and why it should be in the cohort. We add 2 to 3 community-nominated models per quarter.

My company runs an LLM API. Can we be on the leaderboard?

If your model is publicly accessible (paid API is fine) and current-generation, yes. Open an issue or email leaderboard@austa.ai. We do not accept payment for placement; rankings are scored independently.