Engine Internals
Inside the Austa Engine: How a Closed-Loop System Pentests an LLM
A pentest of a language model is not a scan. It is an argument the attacker keeps having with the target until it breaks. The Austa engine is built to have that argument at machine speed: generate an attack, run it, judge whether it landed, learn from the result, and go again. This is a tour of the core.
The threat model the engine has to cover
Before the architecture makes sense, the target has to. A modern LLM application is rarely just a model behind a text box. It is a system prompt, a set of tools or functions the model can call, a retrieval layer that pulls in documents, and often a multi-turn conversation that carries state. Every one of those surfaces is an entry point, and the interesting attacks live where they meet.
The engine is designed to find a specific spread of failures:
- Prompt injection. An attacker's text overrides the developer's instructions, directly in the user turn or indirectly through content the model reads.
- Jailbreaks. The model is coaxed past its own safety policy into producing output it was tuned to refuse.
- Tool and function hijack. The model is steered into calling a privileged tool with arguments the developer never intended, the agent equivalent of getting a process to run a command for you.
- Data exfiltration. Secrets in the system prompt, in tool definitions, or in retrieved context get coaxed out, sometimes verbatim, sometimes laundered through a summary.
- Multi-turn escalation. No single message is an attack, but a sequence is. The model is walked from a benign opener through rapport to a request it should have refused.
- Indirect injection via RAG and documents. The payload is not in the conversation at all. It is planted in a webpage, a PDF, an email, or a knowledge-base entry that the agent will later retrieve and treat as trusted instruction.
That last category is why the engine cannot be a thing you point only at a chat endpoint. The dangerous instruction often enters through a document the developer never thought of as an input. We cover that surface in detail in the OWASP Top 10 for AI Agents 2026, which catalogs the agency-specific risks the engine has to probe beyond the classic LLM list.
Why a static checklist is not enough
The first generation of LLM security tooling was a scanner: a fixed list of known-bad prompts, fired once at the target, with a report of which ones got a forbidden response. That approach is genuinely useful and the engine borrows heavily from it. But a one-shot payload list has a structural ceiling, and it is worth being precise about why.
A fixed payload either works against this target or it does not. The most consequential attacks are the ones that do not work in the form anyone wrote down. They work after a transformation the target did not anticipate: the refused instruction passes when it is base64-encoded, or split with zero-width whitespace, or asked in a second language, or reframed as a fictional transcript. They work after a setup: the model that refuses a request cold will grant it on the fourth turn, once the conversation has established a premise that makes the request feel consistent. And they work in combination: an encoding trick that fails alone succeeds once it rides on top of a multi-turn frame.
The gap a checklist leaves. A static scan finds the failures that are stable across all targets. It misses the failures that are specific to this target, this prompt, this tool set, this conversation. Those are exactly the failures an actual attacker spends time on, because they are the ones a generic defense did not anticipate.
The fix is not a longer list. A longer list still fires once and stops. The fix is to make the testing loop adaptive: let what got through inform what gets tried next. That is the whole idea behind the Austa engine, and it is what separates it from a checklist like the 2026 LLM Security Checklist, which is the right instrument for coverage but not for adversarial depth.
The closed loop: generate, run, judge, learn
The spine of the engine is a four-stage loop that repeats under a budget.
- Generate. Synthesize concrete attack attempts from the corpus. Not abstract categories, but specific strings, specific tool-call lures, specific multi-turn scripts, ready to send.
- Run. Drive those attempts at the target through the harness, single turn or multi-turn, capturing every message, tool call, and retrieved document.
- Judge. Decide objectively whether each attempt succeeded. Not "this looks risky" but a defensible yes or no, backed by signals that range from a regex match to a panel of LLM judges.
- Learn. Feed the judge's verdict back to the generator. Attempts that made progress get mutated and recombined. Attempts that hit a wall get abandoned. The next generation is shaped by what this specific target tolerated.
That feedback edge is the difference between a scanner and adversarial intelligence. A scanner has no memory of the target inside a run. The engine does. After the first wave, it knows that this model resists direct jailbreaks but leaks when a request is wrapped in a translation task, and it pours its remaining budget into that seam. The behavior is emergent search, not a script, and it is the closest a machine gets to the way a human red teamer probes, notices a soft spot, and leans on it.
None of these stages is new on its own. The prior art is real and excellent. garak pioneered probe-based scanning of LLMs with a detector layer that decides whether a probe landed. PyRIT formalized the orchestrator pattern for red-teaming generative systems, including multi-turn attack flows. promptfoo built a rigorous evaluation and red-team harness with assertions you can run in CI. The Austa engine generalizes the three: garak's probe-and-detect, PyRIT's orchestration, and promptfoo's evaluation become the generate, run, and judge stages of one loop, with a learn edge wired across all of them so the system improves within a single campaign instead of just reporting at the end.
The component map
The loop is implemented by seven cooperating subsystems. Each gets its own part in this series; here is the one-line role of each and where the engine draws the line between them.
The harness
The engine's boundary with reality. It connects to whatever the system under test actually is (a chat-completion API, a tool-using agent, an MCP server, a RAG app, a browser agent) and normalizes every interaction into a single Exchange type: messages, tool calls, retrieved context, and side effects. It records tool calls without executing real side effects, so the engine can watch an agent decide to wire money without any money moving.
The corpus and taxonomy
The structured library of attack families, mapped to the OWASP LLM Top 10 and the OWASP Agentic Top 10 plus Austa's own axes. Each technique is a parametric template called a gadget: a skeleton with typed mutation knobs for target capability, obfuscation, delivery surface, and payload goal. Curated, deduplicated, and versioned so a campaign is reproducible.
The generator
The attack synthesizer at the heart of the loop. Three layers stacked: seeded deterministic transforms (encoding, homoglyphs, language switching, whitespace smuggling), an attacker model that writes and rewrites natural-language payloads, and an evolutionary search that mutates and recombines gadgets under judge feedback, keeping a population of what is getting through.
The orchestrator
The campaign runner. It decides whether an attack is one shot or a multi-turn escalation ladder, manages conversation state, and runs a tree search over the dialogue: branch, score, prune, backtrack. It enforces the budgets (max turns, tokens, wall-clock, attempts) and fans work out across target replicas in parallel.
The judge
The success oracle. It layers cheap deterministic detectors, tool-call inspection, structured-output checks, and a panel of LLM judges for the fuzzy cases, then calibrates against a labeled set to bound false positives and false negatives. A single LLM judge is itself unreliable and attackable, so the judge is built as a panel, not a verdict from one model.
Canaries
The trick that turns "it probably leaked" into a hard yes. The engine plants unique canary secrets in the system prompt, tools, and data, watermarks the documents an agent will retrieve, and installs tripwire tools that record any invocation. If a canary string appears in output or a tripwire fires, the exploit is proven deterministically, with no judge guesswork.
The scorer
The stage that turns a raw success into a finding. It assigns severity from exploitability, impact, and blast radius, deduplicates near-identical successes, and shrinks each exploit to a minimal reproducing transcript by delta-debugging the attack trace. A regression suite re-runs every known finding against new model versions and prompt changes to catch drift.
How a campaign flows end to end
Put the pieces in motion and a campaign reads like a single coherent run. An operator points the engine at a target and the harness establishes the connection, fingerprinting whether it is a bare chat endpoint, an agent with tools, or a RAG pipeline with a retrieval surface to poison. The corpus contributes the gadgets relevant to that target's capabilities: there is no point firing tool-hijack gadgets at a model with no tools.
The generator instantiates the first wave of concrete attempts from those gadgets, and the orchestrator runs them. Single-turn gadgets go out as one shot. Multi-turn gadgets become conversation trees that branch on the target's responses. Every exchange flows back through the harness into the record, and the judge scores each leaf. Where canaries are in play, the scoring is unambiguous: the secret either surfaced or it did not.
Then the learn edge closes. The generator reads the judge's signal, keeps the lineages that made progress, mutates them, recombines the transforms that correlated with success, and emits a smarter second wave. The loop continues until a budget is hit, at which point the scorer takes every successful trace, deduplicates, ranks by severity, and minimizes each one to the shortest transcript that still reproduces the failure. The output is not a risk score on a slide. It is a set of exploits, each with a transcript you can replay.
Determinism and budget: the guarantees that make it a test
An adversarial engine that uses stochastic models everywhere could easily produce a "we saw it once" demo that never reproduces. The engine is built so that does not happen, and the guarantee rests on two design commitments.
The first is determinism by construction. The deterministic transforms are pure, seeded functions: the same gadget and seed always produce the same payload. The stochastic parts, the attacker model and the LLM judges, run with fixed sampling settings, and every exchange is recorded in full. The result is that any finding replays exactly from its transcript even though parts of the pipeline are probabilistic. That is what lets a one-time discovery become a regression test, and it is the line between a demo and a control you can rely on. We carry the same reproducibility discipline into how aggregate results are computed in the LLM security leaderboard methodology.
The second is the budget. Adversarial search over a conversation space is unbounded in principle, so the engine treats budget as a first-class input: a campaign declares its ceilings on turns, tokens, wall-clock time, and total attempts, and the orchestrator spends them deliberately, steering the remaining budget toward the seams the judge has already shown are soft. A campaign always terminates, always with the best findings it could buy for the budget it was given, and always with the receipts to prove them.
The Austa engine series
- Architecture overview: the closed-loop adversarial engine
- The target harness
- The attack corpus and taxonomy
- The adversarial generator
- The orchestrator and multi-turn campaigns
- The judge
- Canaries and deterministic ground truth
- Scoring, severity, and regression