Engine Internals
Scoring, Severity, and Regression: From a Raw Success to a Finding You Can Trust
The loop just made an attack land. That is not a finding yet. It is one success buried in a long, noisy transcript, possibly the fortieth copy of the same trick, with no severity attached and no proof it will happen again. The scorer is the stage that turns that raw event into something a defender can read in a minute, rank against everything else, and replay on demand. This is the finale of the series.
What "raw success" actually looks like
By the time the loop hands work to the scorer, a campaign has produced a pile of successful traces. Each one is a full recorded run from the harness: every message, every tool call, every retrieved document, every judge verdict. All of it is real and replayable. None of it is yet usable as a finding.
Three problems stand between a raw success and a finding worth shipping. The success has no severity, so a flaky jailbreak that produced mildly off-policy text sits in the same pile as a reliable secret exfiltration on a shared agent. The success is probably not alone, because adaptive search converges on whatever seam is soft and fires forty variations of the same trick, leaving the pile full of near-duplicates that describe one flaw. And the success is not minimal, because the trace that landed is the long one the orchestrator happened to walk, full of turns and tokens that contributed nothing. The scorer fixes all three, in order: rank, deduplicate, minimize.
Severity: exploitability times impact times blast radius
A finding's severity is a product of three axes, and the engine keeps them separate because they answer different questions and a defender prioritizes on all three.
- Exploitability. How reliably and how cheaply does the attack reproduce? An exploit that lands on the first try, every time, on a short transcript is far more dangerous than one that needs forty turns and lands one run in five. The engine has the data to measure this directly, because it can replay the minimized transcript many times and record the hit rate.
- Impact. What does the success actually let an attacker read, do, or move? Leaking a system-prompt secret outranks coaxing the model into a rude paragraph. Getting an agent to invoke a money-moving tool with attacker-chosen arguments outranks both. Impact is read off what the success touched: which canary surfaced, which tripwire tool fired, which forbidden capability was exercised.
- Blast radius. How far does one exploit reach? A flaw in a single-user assistant is contained. The same flaw in a shared multi-tenant agent, or one whose output a downstream system consumes, reaches everyone behind it. Blast radius is a property of the deployment, not the payload, and it is what turns a moderate issue into a critical one.
The multiplication is deliberate. A near-zero on any axis pulls the whole score down, which matches the intuition that an unexploitable flaw, a zero-impact success, or a contained blast radius is not where a defender should spend the first hour. The output is not a single magic number on a slide; it is a ranking that lets the most consequential exploits float to the top of the report. This is the same severity discipline that the agentic risk catalog in the OWASP Top 10 for AI Agents 2026 assumes, and it lines up with the control expectations in the NIST AI agent security controls.
Deduplication: one canonical finding per flaw
Adaptive search is a duplicate factory. Once the judge signals that a translation-wrapper jailbreak works, the generator pours budget into that lineage and produces dozens of cousins that differ only in surface wording. Reporting all of them as separate findings would bury the signal and exhaust the reader. The scorer collapses them.
The engine clusters successes by what they share underneath the surface noise: the gadget lineage they descend from, the capability they exercised, the canary or tripwire they tripped, and the structural shape of the winning trace. Successes that land in the same cluster are one flaw seen many times. The scorer elects a canonical representative, usually the one that is most reliable and smallest after minimization, and attaches the rest as evidence rather than as separate items. The finding then carries an honest count: this flaw was reproduced N independent ways, which is itself a severity signal, because a flaw with many paths to it is harder to patch than one with a single fragile path. ML test frameworks lean on the same idea of grouping many failing cases under one named issue; Giskard organizes model failures into issue categories rather than reporting every failing example as its own defect.
Minimization: delta-debugging the attack trace
The transcript that landed is rarely the transcript you want to ship. The orchestrator's tree search wandered, built rapport it did not strictly need, and stacked transforms that may not have all mattered. A defender handed that raw trace has to reverse-engineer which part was load-bearing. The scorer does that work first.
Minimization is delta-debugging applied to an adversarial trace. The engine takes the successful transcript and systematically tries to make it smaller: drop a turn, drop a paragraph, strip a stacked transform, shorten a payload. After each cut it replays the candidate and asks the judge. If the attack still succeeds, the cut stays. If the success disappears, the cut is reverted and that element is marked load-bearing. The procedure repeats until nothing more can be removed without breaking the exploit.
Why minimization needs the engine's determinism. Delta-debugging only works if a replay means something. Because the deterministic transforms are seeded pure functions and the stochastic attacker model and judge run with fixed sampling, every candidate cut is evaluated under the same conditions as the original. A success that survives minimization is not luck; it is the irreducible core of the exploit, established the same way canaries and deterministic ground truth establish that a leak happened at all.
What survives is small, and small is the whole point. A four-turn rapport ladder that collapses to a single sentence once you strip the scaffolding is a clearer warning and a faster test. A finding you can read in thirty seconds gets fixed; a finding that is a forty-turn wall of text gets deferred. Minimization is the difference between the two.
The regression suite: every finding becomes a permanent test
A minimized, deterministic, replayable transcript is more than a report artifact. It is a test. The moment a finding is confirmed, the engine banks it: the minimal transcript, its seed, the judge configuration that scored it, and the canary or tripwire that proved it, all stored as a permanent regression case keyed to the flaw it represents.
The suite re-runs on two triggers: every model version change and every prompt or configuration change. Both are the moments a guardrail's behavior can shift out from under you. This is the failure mode the agentic risk list names model-agnostic drift: a guardrail that fired yesterday and silently stops firing after a provider upgrades the model, after a temperature tweak, or after a system-prompt edit that looked harmless. Without a regression suite, drift is invisible until an incident surfaces it. With one, a re-opened finding shows up the same day the swap lands, because the deterministic transcript gives a clean pass-or-fail rather than a guess.
That is the conceptual shape of continuous pentesting: not a once-a-quarter engagement, but a suite that runs on every swap, the same way unit tests run on every commit. The engine borrows its scoring discipline directly from the public benchmark world. JailbreakBench fixes a behavior set and an evaluation protocol so that a jailbreak result means the same thing across runs and across systems, and HarmBench standardizes the behaviors and the classifier so attack methods can be compared on equal footing. A regression suite is that idea turned inward: a private, target-specific benchmark that grows by one case every time the engine confirms a new way in.
The leaderboard: comparable scores, held under fixed constraints
Severity ranks findings within one target. The leaderboard answers a different question: how does one model or configuration stack up against another? That requires aggregate scores, and aggregate scores are only honest if the conditions that produced them were identical. The engine treats fairness as a methodology constraint, not a courtesy.
Three things have to be held constant for two scores to be comparable.
- Same corpus. Every target faces the same versioned set of gadgets. A model evaluated against an older or smaller corpus would look safer for no real reason, so the corpus version is pinned and reported alongside the score.
- Same budget. Adversarial search finds more given more attempts, turns, and tokens. A score is only meaningful next to the budget that bought it, so the budget is fixed across the field and published with the results.
- Same judge. The success oracle is the measuring instrument. If one run used a stricter judge panel than another, their numbers are in different units. The judge configuration is frozen for the comparison set, the same way the judge is calibrated against a labeled set to bound its error before it is trusted to score anything.
Hold those three steady and the aggregate scores roll up into a public LLM security leaderboard where a lower attack-success rate genuinely means a more resistant system, not an easier test. How those numbers are computed, normalized, and reported lives in the LLM security leaderboard methodology, the public face of everything the scorer does in private.
Where the loop closes
This is the last stage, and it is also the one that feeds the rest. The findings the scorer produces are not just an output; they are evidence about which gadgets and transforms actually work, and that evidence is exactly what the corpus and the generator learn from across campaigns. A flaw confirmed on one target sharpens the attacks the engine brings to the next. The closed loop the series opened with does not stop at the end of a run. It compounds.
Across eight parts we have walked the whole engine. The harness draws the boundary with the target and records every exchange for replay. The corpus and taxonomy hold the parametric gadgets. The generator synthesizes concrete attempts and recombines what works. The orchestrator runs single-shot and multi-turn campaigns as a budgeted tree search. The judge decides success objectively, and canaries make the most important successes provable rather than probable. The scorer turns all of that into small, ranked, replayable findings and banks them as a regression suite that catches drift the day it happens. Generate, run, judge, learn, and then score: that is the engine, end to end.
The Austa engine series
- Architecture overview
- The target harness
- The attack corpus and taxonomy
- The adversarial generator
- The orchestrator and multi-turn campaigns
- The judge
- Canaries and deterministic ground truth
- Scoring, severity, and regression
That completes the series. For the whole picture in one place, start back at the architecture overview, the hub that ties every subsystem to the closed loop.