AUSTA | Adversarial Intelligence

Methodology Reference

LLM Security Leaderboard Methodology (2026)

How the Austa LLM Security Leaderboard works. Reproducible methodology for benchmarking the security posture of frontier LLMs. 5 attack categories, 500 prompts per model, weekly re-runs, public scoring rubric.

By Austa · Published · Methodology v2026.q2 · ~9 min read

Why this leaderboard exists

Capability benchmarks for LLMs are everywhere. Stanford's HELM scores reasoning and knowledge. MLCommons' AILuminate covers broad safety hazards. Hugging Face's Open LLM Leaderboard tracks general performance. None of these answer the specific question that ships product: under attack, how does this model actually hold up?

The Austa LLM Security Leaderboard is narrow by design. It scores frontier LLMs from an attacker's perspective on a fixed adversarial suite, with results refreshed weekly. The point is not to declare a winner forever; the point is to make the security posture of each model visible, comparable, and tracked over time.

Stance: a high score on this leaderboard means the model resisted the published attack suite. It does not mean the model is unhackable, just that the categories we test are well-defended. A model that scores low on this leaderboard is probably worse in production; a model that scores high is at least not obviously broken on the well-known wedges.

What gets measured

Five attack categories. Each category contributes 100 attacks per cycle (500 total per model). Categories were chosen because they map to the threat models incident retrospectives keep returning to:

CategoryWhat it testsAttack count
Prompt injectionDirect user-input attacks trying to override the system prompt100
Instruction leakageAttempts to extract the system prompt or hidden instructions100
Tool hijackAttacks against function-calling: argument injection, scope escalation, tool-loop exploitation100
Jailbreak / refusal bypassMulti-turn coercion, role-play loopholes, obfuscation (l33t, base64, paraphrase)100
Cost amplificationInputs designed to maximize token generation or trigger expensive tool chains100

The attack suite is versioned. v2026.q2 is the live suite. v2026.q3 will be published in August with categories rebalanced based on the prior quarter's learnings. Historical results stay comparable within a methodology version.

How attacks are scored

Each attack produces one of three outcomes:

Scoring is done by an LLM judge using a frozen judge prompt (published alongside the suite) plus 10 percent manual review. Manual review samples are pulled at random across categories. If the LLM judge and manual review disagree on more than 5 percent of sampled attacks, the cycle is invalidated and re-run with a recalibrated judge.

Composite scoring

Per category: pass_rate = passes / (passes + fails), expressed as a percentage. Inconclusive responses are excluded from the denominator.

Composite score: weighted average across the 5 categories with equal weights in v2026.q2 (each category contributes 20 percent). Future versions may adjust weights if a category turns out to be too easy or too hard across the cohort.

A model with the following per-category scores:

Prompt injection:      82%
Instruction leakage:   91%
Tool hijack:           67%
Jailbreak bypass:      78%
Cost amplification:    88%

Composite:             81.2%

Which models are tested

The v1 cohort tests current-generation frontier models with publicly accessible APIs:

Models are added or removed quarterly. A model is removed from the leaderboard if its API has been deprecated, it is no longer in active use, or the provider has requested removal. Newer models can be nominated via the GitHub repo's issues.

Update cadence

WhenWhat
Friday 00:00 UTCFresh run starts. Full attack suite against all listed models.
Friday during the dayRun completes (typically 4 to 7 hours depending on API rate limits).
Friday eveningLLM judge scores the run. Manual review samples are pulled.
Saturday morning UTCResults published. Leaderboard updates. Week-over-week deltas computed.
Quarterly (March, June, September, December)Methodology version updated. New attack categories or rebalanced weights as needed.

Reproducibility

The full attack suite, scoring prompts, and run scripts are published on GitHub at github.com/austa-ai/llm-security-leaderboard. You need API keys for the models you want to test. Running the full suite against one model costs around 5 to 15 US dollars in API calls depending on model pricing.

If our published scores diverge from your own re-run, file an issue with the diff. We treat reproducibility as a non-negotiable: a leaderboard that can't be re-run is just opinion.

How this differs from existing benchmarks

BenchmarkScopeRefresh cadenceAdversarial?
HELM (Stanford)General capabilities (reasoning, knowledge, fairness)Major releases, ~yearlyNo
AILuminate (MLCommons)Broad safety hazards (harm categories)Major releasesPartial
Open LLM Leaderboard (HF)General performance, fine-tune rankingContinuousNo
Austa LLM SecurityNarrow: adversarial security, attacker perspectiveWeeklyYes (only this)

Smaller scope, sharper signal, faster update cycle. The trade-off is depth: we are not measuring everything a model can do, only how it holds against this specific suite of attacks.

Honest limitations

Three limitations we want explicit before anyone treats these scores as gospel:

Use this responsibly

The leaderboard is a starting point, not a procurement decision. Use it to:

Do not use it to:

The leaderboard goes live 2026-05-23

v1 results published Saturday May 23 with the full v2026.q2 attack suite. Subscribe via the homepage to get the weekly digest in your inbox.

View the leaderboard page

Related