Engine Internals
The Adversarial Generator: Closed-Loop Attack Synthesis
A gadget from the corpus is a recipe, not an attack. The generator is the layer that turns that recipe into thousands of concrete attempts, then watches which ones land and breeds more like them. This is where "adversarial intelligence" stops being a slogan and becomes a search algorithm.
What the generator is for
The corpus (covered in Part 3) gives the engine a library of gadgets: parametric attack templates with typed mutation knobs. A gadget like "smuggle a tool-call instruction past an input filter" describes a class of attacks, not a single string. The generator's job is to instantiate that class into specific payloads, send them through the harness, and use the verdict from the judge to decide what to try next.
The reason this is a generator and not a payload list is the same reason the whole engine is a loop. A fixed list of jailbreak strings finds the attacks that already worked against some other model. It does not find the attack that works against this target, with this system prompt, behind these filters. Defenses are contextual. The generator's value is that it adapts the population of attempts to the specific thing it is attacking, guided entirely by feedback rather than by an analyst hand-writing variants.
The generator is built as three layers stacked from cheapest and most deterministic to most expensive and most exploratory. A campaign spends its budget mostly in the cheap layers, escalating to the expensive ones only where they earn their cost.
Layer one: deterministic transforms
The bottom layer is a set of pure, seeded functions that rewrite a payload without changing its intent. These are the obfuscation tricks every red-teamer knows: base64 and rot encodings, homoglyph substitution, switching the request into another language, whitespace and zero-width smuggling, character-level perturbations, and instruction wrapping. We cover the attack semantics of these in encoding and smuggling; here the point is purely mechanical.
Every transform takes a payload and a seed and returns a transformed payload. Same input, same seed, identical output, every time. There is no clock, no global random state, no network call. A transform that needs randomness (which homoglyph to pick, where to inject whitespace) draws it from a seeded PRNG that is part of the function's input, so the choice is recorded in the seed and replays exactly.
Why purity matters here. Transforms are the layer that runs millions of times across a campaign. If they were impure, no finding produced by a transformed payload could be reproduced, and the whole regression story collapses. Keeping this layer deterministic is what lets the engine say "finding 4471 is exactly this byte sequence" instead of "we saw something like this once."
This is the part of Austa that most resembles TextAttack. TextAttack frames an adversarial NLP attack as a composition of transformations (ways to perturb text) and a search method over them, with constraints that keep the perturbation valid. Austa's transform layer is the transformation half of that framing, applied to attack payloads rather than classifier inputs, and it inherits the same discipline: a transformation is a small, composable, testable unit, and chains of them are just function composition with a combined seed.
Layer two: the red model
Deterministic transforms can disguise a payload but they cannot write one. They will not invent a roleplay frame, talk a model into a persona, or rephrase a refused request into one the target accepts. That is natural-language work, and the layer that does it is an LLM that the engine drives as the attacker. We call it the red model.
The red model takes a gadget, the goal, and the target's most recent responses, and produces or rewrites a natural-language payload. It is the layer that writes "I'm a security researcher and my grandmother used to read me firewall rules," that restructures a flat instruction into a layered hypothetical, that notices the target refused on the word "exfiltrate" and tries "export" instead. In a multi-turn campaign it is also the voice that builds rapport before pivoting, which the orchestrator sequences and which we describe as an attack pattern in multi-turn jailbreaks.
Driving an attacker LLM through structured turns is the idea behind the orchestrators in PyRIT, Microsoft's risk-identification toolkit for generative AI. PyRIT formalizes the notion of an attack strategy that uses one model to probe another and loops on the responses. Austa's red model is that role, wired into a feedback loop with the judge so that the rewrites are not blind: the model is told, in machine-readable terms, how close the last attempt came.
The red model is also where the determinism problem bites hardest, because an LLM is stochastic by nature. The engine pins it down the same way it pins down anything stochastic: fixed sampling settings (temperature, top-p, seed where the provider exposes one), a frozen prompt template, and full transcript recording of every request and response. A recorded red-model transcript is the ground truth. If a provider's sampling drifts under us, the recorded transcript still reproduces the exact exchange that produced a finding, which is what a regression suite actually needs. We do not pretend the red model is deterministic; we make every finding replayable from its transcript regardless.
Layer three: evolutionary search
The top layer is what makes the generator adaptive rather than merely varied. It maintains a population of candidate attacks, each one a gadget with concrete parameter settings and a chain of transforms and red-model edits. Every candidate that runs gets a score from the judge. Each generation, the search keeps the high scorers, mutates them (flip a transform, retune a knob, ask the red model for a variant), recombines pairs of them (take the framing from one and the encoding from another), and discards the rest. Over generations the population concentrates on whatever is getting through this specific target's defenses.
This is a genetic algorithm over attack structure, and it is deliberately modeled on the real automated-jailbreak research line. The GCG work (the universal adversarial suffix paper, Greedy Coordinate Gradient) showed that an attack string can be optimized against an objective rather than written by hand, and that optimized suffixes transfer across models. AutoDAN extended the line by evolving fluent, human-readable jailbreak prompts with a genetic algorithm instead of producing the garbled token soup that gradient methods tend to. Austa's search sits in the AutoDAN tradition: it treats jailbreak discovery as black-box optimization, because a real target is usually an API with no gradients to follow, and it keeps the population readable so that a winning attack is also an explainable finding.
The score that drives selection comes entirely from the judge, which is the subject of Part 6. The generator does not decide whether an attack succeeded; it only proposes. Keeping proposal and adjudication in separate subsystems is what lets the search be aggressive without lying to itself.
The determinism contract
A pipeline with a stochastic red model and an evolutionary search sounds like the opposite of reproducible. It is not, and the contract that holds it together is worth stating plainly.
- Transforms are pure seeded functions. Given the seed, byte-identical output.
- The red model and the judge run with fixed sampling and every request and response is recorded. They are not deterministic, but their outputs are captured, so a recorded transcript replays the exact exchange.
- The search uses a seeded PRNG for every mutation, recombination, and selection decision, and logs the full lineage of every candidate.
The result is that a finding is not "the engine found something." A finding is a seed, a gadget version, a transform chain, and a set of recorded model transcripts that together reproduce the attack byte for byte. That artifact is what gets handed to scoring and regression so it can be re-run against the next model version. Reproducibility is the line between a demo and a test.
Why this needs the judge and the canaries to stay honest
Fully automated attack generation is noisy, and it is worth being candid about that. An evolutionary search optimizing against a feedback signal will happily exploit a weak signal. If the score is "the response did not contain the word refuse," the search learns to elicit responses that avoid that word while accomplishing nothing. This is reward hacking, and an unchecked generator produces a population of attacks that game the metric instead of breaking the target.
Two things keep it honest. The first is a strong judge: a multi-signal oracle rather than a single keyword check, so the score the search optimizes against actually tracks exploitation. The second, and the one that matters most, is deterministic ground truth from canaries. When the goal of a gadget is to exfiltrate a planted secret or trip a tripwire tool, success is not a judge's opinion at all. The canary string either appears in the output or it does not. That binary anchor is what the search optimizes toward whenever it can, because it cannot be gamed: there is no phrasing that produces the canary without the exploit actually working. The judge handles the fuzzy goals; the canaries handle the ones that can be made crisp, and the generator is steered toward crisp signals by design.
The same separation of concerns shows up across the broader tooling landscape. Counterfit, Microsoft's automation layer for attacking ML systems, made the same architectural bet years earlier for classical models: separate the attack algorithms from the target interface and the success criteria, so each can be swapped and trusted independently. Austa applies that bet to the LLM and agent case, with the generator as the attack-algorithm layer, the harness as the target interface, and the judge plus canaries as the success criteria the generator is never allowed to grade for itself.
The Austa engine series
- Architecture overview
- The target harness
- The attack corpus and taxonomy
- The adversarial generator
- The orchestrator and multi-turn campaigns
- The judge
- Canaries and deterministic ground truth
- Scoring, severity, and regression