What Is AI Red-Teaming? A Step-by-Step Process for Testing LLM Applications
AI red-teaming is the adversarial testing of an LLM application to find the inputs that make it misbehave before attackers do. Here is a repeatable before, during, and after process, mapped to OWASP, NIST, and MITRE ATLAS.
TL;DR
- AI red-teaming is adversarial testing that finds the prompts, documents, and tool calls that make an LLM application misbehave, before real attackers find them.
- It differs from a penetration test: red-teaming probes model and application behavior (jailbreaks, prompt injection, data leakage, tool abuse), not just network and code flaws.
- Run it as a process: scope and threat-model first, attack the live app during, then triage findings to fixes and a baseline you can retest after.
- Map every finding to a standard ID (OWASP LLM Top 10, NIST adversarial ML taxonomy, MITRE ATLAS) so fixes are tracked, not just noted.
What is AI red-teaming?
AI red-teaming is the structured, adversarial testing of an AI system to discover the inputs and conditions that make it produce harmful, leaked, or out-of-policy behavior, before a real attacker or an ordinary user stumbles into them. For an LLM application that means deliberately crafting prompts, documents, emails, and tool responses designed to bypass your guardrails, then recording what got through.
The term comes from traditional security, where a "red team" plays the attacker against a defending "blue team." With LLMs the practice has widened: you are not only testing the base model's safety, you are testing your whole application, the system prompt, the retrieval layer, the tools the model can call, and the actions those tools take. Microsoft's red-teaming guidance frames it as a best practice in responsible AI development whose job is to "uncover and identify harms" so that measurement and mitigation work has something concrete to fix (Microsoft, 2026).
The honest framing matters: red-teaming does not prove your app is secure. It surfaces failures. A clean round means you did not find a problem this time, not that none exists. That is exactly why it is run as a repeatable process rather than a one-off audit.
How is AI red-teaming different from penetration testing?
The short answer: a penetration test attacks the infrastructure and code; AI red-teaming attacks the behavior of the model and the application built around it. They overlap, but they find different classes of bug.
A traditional pentest looks for the flaws security teams have hunted for decades: unauthenticated endpoints, injection into a database, broken access control, exposed secrets, vulnerable dependencies. Those still apply to an LLM app, and you still need them.
AI red-teaming targets failures that have no equivalent in a normal web app:
- Jailbreaks, where a user talks the model out of its own rules.
- Prompt injection, where instructions hidden in user input or in third-party content hijack the model's behavior. This is ranked LLM01, the single highest risk in the OWASP Top 10 for LLM Applications 2025 (OWASP, 2025).
- Sensitive information disclosure, where the model reveals other users' data, training data, or its own system prompt.
- Excessive agency and tool abuse, where the model is manipulated into calling a tool, sending an email, or running code it should not.
These are model and application behaviors, not code paths, so they need adversarial probing rather than a vulnerability scanner. In practice a mature engagement does both: classic pentest coverage of the surrounding system, plus red-teaming of the AI behavior. If you are commissioning external testing, the distinction is exactly what separates a real assessment from automated noise, which we cover in LLM penetration testing and AI security audits.
What frameworks should an LLM red team test against?
Test against a published taxonomy rather than your own intuition, so that coverage is comprehensive and every finding maps to a standard ID your team can track. Three frameworks do most of the work, and they complement rather than compete.
OWASP Top 10 for LLM Applications (2025) is the application-layer checklist. It names the ten risks that show up in real LLM and agent apps, from LLM01 Prompt Injection through LLM02 Sensitive Information Disclosure, LLM06 Excessive Agency, LLM07 System Prompt Leakage, and LLM08 Vector and Embedding Weaknesses (OWASP, 2025). Use it as your test plan: there should be at least one attack attempt per relevant category.
NIST AI 100-2e2025, the adversarial machine learning taxonomy finalized in March 2025, gives you the deeper attack vocabulary: evasion, data and model poisoning, privacy and extraction attacks, and prompt injection, organized by attacker goals, capabilities, and knowledge (NIST, 2025). It is the reference for the more research-flavored attacks, such as trying to extract training data or invert a model.
MITRE ATLAS is the ATT&CK-style knowledge base of real-world adversary tactics and techniques against AI systems. Red teams use its technique IDs the way they use ATT&CK IDs in a classic engagement: to describe an attack chain in shared, comparable language and to map findings to documented case studies.
You do not need all three for every test. OWASP is the floor for any application red team; NIST and ATLAS add depth for higher-stakes systems.
What is the step-by-step process to red team an LLM application?
Red-team an LLM application in three phases: plan and threat-model before you attack, run layered adversarial testing during, then triage and fix after. Advance planning is what separates a productive exercise from random prompt-poking (Microsoft, 2026).
Before: scope, threat-model, and assemble the team
- Define the system and its boundary. Write down what the app does, what data it can reach, which tools it can call, and what a worst-case action looks like (sends money, deletes records, emails a customer). The blast radius of a successful attack is the thing you are really testing.
- Pick your frameworks and build the test plan. Turn the OWASP LLM Top 10 into a checklist of attack categories, and add NIST or ATLAS techniques for the risks that matter to your system.
- Decide what layers to test. Microsoft's guidance recommends probing several layers: the base model through its API, the full application through its real UI, and both with and without your mitigations in place, so you can tell whether a defense actually works (Microsoft, 2026).
- Assemble a mixed team. Combine an adversarial, security-minded mindset with people who think like ordinary users; the second group finds the harms regular usage triggers, which the security specialists often skip.
- Set up recording before you start. Agree on what every finding captures: the date, a reproducible ID for the input and output pair, the exact prompt, and a description or screenshot of the result. A shared spreadsheet is enough, and it lets testers build on each other's ideas.
During: attack across categories
- Start open-ended. Let testers explore freely and document anything problematic, rather than hunting for one specific harm. Open-ended probing exposes blind spots in your own understanding of the risk surface.
- Run direct attacks. Jailbreak attempts, "ignore previous instructions," role-play framings, encoded or obfuscated payloads, and attempts to extract the system prompt.
- Run indirect prompt injection. Plant adversarial instructions in the content the app reads on the user's behalf: a retrieved document, an inbound email, OCR text from an upload, a web page, or a tool result. Greshake et al. showed these attacks compromising real LLM-integrated applications, with retrieved content effectively acting as code that redirects the system (Greshake et al., 2023). For the mechanics, see prompt injection explained.
- Probe for data leakage. Try to retrieve another user's records, surface confidential documents the current user should not see, or pull back fragments of training or system data.
- Convert open-ended findings into a guided list. Build a running list of confirmed harms with definitions and examples, then test it systematically and re-probe each one to check coverage.
After: triage, fix, retest
- Report and prioritize. Summarize the top issues with links to the raw data, and rank them by severity and likelihood, not by how clever the attack was.
- Map each finding to a fix and a framework ID. Tie every issue to an OWASP, NIST, or ATLAS reference so it lands in your tracker as a tracked item, not a note.
- Establish a regression baseline. Turn the attacks that worked into automated test cases so a future change cannot silently reintroduce the same hole.
- Retest after mitigation. Re-run the same attacks against the patched app. A finding is closed only when the attack that produced it no longer works.
How do you red team AI agents and tools?
When the LLM can call tools and take actions, red-teaming shifts from "what can it say" to "what can it do," and the most important attacks aim to make the agent perform an action the user never authorized. The risk is not bad text; it is a real side effect.
Test the agent's tool layer specifically:
- Tool misuse via injection. Hide an instruction in a document or tool result that tells the agent to call a sensitive tool, then check whether it complies. This is the agent version of indirect prompt injection, and it is why untrusted content should arrive only inside tool-result blocks, be JSON-encoded, and be screened before the model acts on it (Anthropic, 2026).
- Excessive agency. Probe whether the agent has more permission, autonomy, or tool access than the task needs (OWASP LLM06). The fix is least privilege, and the test is to confirm a successful injection cannot reach anything dangerous.
- Confused-deputy chains. Try to get the agent to use its own legitimate access on the attacker's behalf, for example reading a record the user cannot, then leaking it back through the conversation.
- Human-in-the-loop bypass. If high-impact actions require confirmation, attack the confirmation step itself, and verify the agent cannot act before approval.
Anthropic's own guidance is blunt about the final step: red-team your own agent before deploying it, with documents, emails, and tool outputs that deliberately contain injection attempts, and confirm both that the model ignores them and that your screening and confirmation steps catch the rest (Anthropic, 2026). For the underlying defenses, see securing AI agents against MCP tool poisoning and excessive agency.
What tools do AI red teamers use?
AI red teamers combine automated attack tooling for breadth with manual, creative testing for depth, because the most damaging attacks are usually the ones a scanner never generates. Tools generate volume; people find the clever chain.
Common categories of open-source tooling:
- Automated probing and scanners that fire large batteries of known jailbreak and injection payloads at an endpoint and flag responses that look like a bypass. These are good for regression and coverage of known patterns.
- Adversarial prompt and dataset collections that supply curated jailbreak corpora and attack templates to seed your own testing.
- Agent and tool-abuse harnesses that drive an agent through scripted attack scenarios against its tools.
The catch is the same one buyers complain about across the industry: a scan alone is not a red team. Automated fuzzing covers known payloads and produces a report, but it does not threat-model your specific blast radius, chain attacks across your retrieval and tool layers, or judge real-world severity. Treat automation as the wide net and manual red-teaming as the part that finds the attacks that matter. The distinction is the whole subject of LLM penetration testing and AI security audits, and it is the same builder discipline behind how we built Onepilot.
FAQ
Is AI red-teaming the same as a jailbreak test?
No. A jailbreak test is one technique inside red-teaming: trying to talk the model out of its own safety rules. Red-teaming is the full exercise, covering jailbreaks plus prompt injection, data leakage, tool and agent abuse, system-prompt extraction, and more, then mapping each finding to a fix. Treating a jailbreak attempt as the whole job is exactly the shallow testing the practice is meant to replace.
Can you fully secure an LLM application through red-teaming?
No. Red-teaming finds failures; it does not prove their absence, and a clean round only means you did not find a problem this time. Indirect prompt injection in particular is not fully solvable with current models, so the realistic goal is least-privilege design that limits the damage of any single bypass, verified by repeated red-teaming rather than a one-time sign-off.
How often should you red team an LLM application?
Red-team before launch, after any material change to the model, system prompt, tools, or retrieved data sources, and on a recurring schedule for systems that stay in production. Because models, attacks, and your own app all change, robustness is something you maintain, not something you design in once. Automated regression tests run continuously; deeper manual rounds run periodically and on major changes.
Who should be on an AI red team?
A mix. You want people with a security and adversarial mindset who know how injection and extraction attacks work, alongside testers who behave like ordinary users and surface the harms normal usage triggers. For domain-specific apps, add a subject-matter expert who can recognize a harmful or non-compliant answer in that field. Diversity of perspective is what widens coverage.
Does red-teaming replace automated LLM security scanning?
No, the two are complementary. Automated scanning gives you cheap, repeatable coverage of known payloads and is ideal for regression testing. Manual red-teaming finds the novel, chained, and context-specific attacks a scanner will never generate, and judges real severity. A credible program runs both; relying on the scan alone is the automated-fuzzing-with-a-cover-report trap.
What is MITRE ATLAS and how is it used in AI red-teaming?
MITRE ATLAS is an ATT&CK-style knowledge base of real-world adversary tactics and techniques against AI systems. Red teams use its technique IDs to describe an attack chain in standard, shared language and to map their findings to documented case studies, the same way classic engagements use ATT&CK IDs. Alongside the OWASP LLM Top 10 and the NIST adversarial ML taxonomy, it lets a team report findings as tracked, comparable items rather than ad-hoc notes.
ELM Labs is an applied AI lab that designs, builds, and red-teams LLM and agent systems end to end.