securityAI securityLLM securityprompt injection

What Is AI Application Security? A Practitioner's Guide to the OWASP Top 10 for LLMs, Prompt Injection, and Red-Teaming

AI application security is the engineering discipline of defending LLM and agent apps at the application layer. Here is the framework-grounded map of what goes wrong, mapped to the OWASP Top 10 for LLMs, NIST, and MITRE ATLAS, and the defenses that actually contain the damage.

June 27, 202612 min readELM Labs

TL;DR

AI application security defends LLM and agent apps at the application layer; the model is only one part of the attack surface.
The 2025 OWASP Top 10 for LLM applications is the shared map, led by LLM01 Prompt Injection and including Excessive Agency and Sensitive Information Disclosure.
Indirect prompt injection is architecturally unsolved: no filter catches every attack, so the real win is least-privilege design that shrinks the blast radius.
Treat security as a process: map attacks to OWASP, NIST, and MITRE ATLAS, then red-team your own app before and after every release.

What is AI application security?

AI application security is the discipline of defending applications built on large language models, including chatbots, retrieval systems, and autonomous agents, against attacks that target how the model reads input, calls tools, and produces output. It treats the LLM as one untrusted component inside a larger system, not as a magic box you can secure by picking a safer model.

This is a different problem from model safety research, which asks whether a model will refuse to write malware in the abstract. Application security asks a narrower, more practical question: when your model is wired to your data and your tools, what can an attacker make it do, and how much damage can that cause? The answer almost never depends on the model alone. It depends on what the model is allowed to read, what actions it can take, and what happens to its output downstream.

The field now has a shared vocabulary. The OWASP Top 10 for LLM Applications, updated for 2025, catalogues the ten most critical risks. The NIST adversarial machine learning taxonomy, finalized in 2025, classifies the attack types across the AI lifecycle. And MITRE ATLAS provides an ATT&CK-style knowledge base of real adversary tactics and techniques against AI systems, so findings map to standard technique IDs. Together they let you talk about a vulnerability with a name and an ID instead of a vague worry.

What is in the OWASP Top 10 for LLM applications in 2025?

The 2025 OWASP Top 10 for LLM Applications lists the ten risks every team building on LLMs should design against:

LLM01 Prompt Injection: crafted input overrides the developer's instructions and changes what the model does.
LLM02 Sensitive Information Disclosure: the model exposes PII, secrets, proprietary data, or business information.
LLM03 Supply Chain: vulnerabilities in third-party models, datasets, plugins, or dependencies.
LLM04 Data and Model Poisoning: tampering with pre-training, fine-tuning, or embedding data to corrupt behaviour.
LLM05 Improper Output Handling: downstream systems trust model output without validation, enabling injection into a browser, shell, or database.
LLM06 Excessive Agency: an agent has more permissions, tools, or autonomy than the task requires.
LLM07 System Prompt Leakage: the system prompt, and anything stored in it, leaks to the user.
LLM08 Vector and Embedding Weaknesses: flaws in how a RAG system stores and retrieves embeddings, including cross-user leakage.
LLM09 Misinformation: confident, wrong, or fabricated output that users act on.
LLM10 Unbounded Consumption: uncontrolled usage that drives cost, denial of service, or model extraction.

Most real incidents are combinations: a prompt injection (LLM01) that exploits excessive agency (LLM06) to trigger sensitive information disclosure (LLM02). Treating the list as a checklist of isolated bugs misses the point; it is a map of how failures chain.

What is prompt injection and how does it differ from jailbreaking?

Prompt injection is an attack where input is crafted so the model treats it as a higher-priority instruction than the developer's, overriding the intended behaviour. It is ranked LLM01, the single most critical risk in the OWASP list, because almost every other failure becomes reachable once an attacker can steer the model's instructions.

Jailbreaking is a subset of prompt injection. A jailbreak is when the user of your application is the adversary, crafting input to bypass the model's safety guardrails so it produces content it should refuse, as Anthropic frames it in its guidance on mitigating jailbreaks and prompt injections. The threat model is "the person typing is hostile."

Prompt injection is the broader category, and its most dangerous form does not require a hostile user at all. That is the distinction worth internalising: jailbreaking is about getting bad words out of a model; injection is about hijacking what an application does. For a deeper treatment, see our explainer on prompt injection, direct versus indirect.

What is indirect prompt injection and why is it so dangerous?

Indirect prompt injection is when the model follows malicious instructions hidden inside third-party content it processes on a trusted user's behalf: the body of an inbound email, a fetched web page, a PDF, a calendar invite, or the result of a tool call. The user is not the attacker; the content is. This class was demonstrated against real production systems, including Bing's GPT-4 powered Chat, by Greshake et al. in 2023, who showed adversaries could remotely compromise LLM-integrated applications without any direct interface to them.

It is dangerous for a structural reason. To be useful, an agent must read untrusted content (that is the whole point of summarising your email or browsing the web), and it must be able to take actions (send a reply, call an API, write a file). Indirect injection sits exactly at the seam where "data the model reads" meets "actions the model can take." An attacker who controls a web page the agent visits can, in effect, issue commands to your agent. The model has no reliable way to tell a legitimate instruction from a malicious one buried in the data it was asked to process, because to the model, it is all just text. This is one expression of the broader attack surface covered in our guide to securing AI agents against tool poisoning and excessive agency.

Can prompt injection be fixed?

No. Prompt injection cannot be fully solved with current architectures, and any vendor or tool that claims otherwise is overselling. The honest framing, consistent with the NIST adversarial ML taxonomy, is that prompt injection is mitigated, not eliminated. Filters and classifiers reduce the success rate; they do not drive it to zero, and attackers iterate faster than blocklists.

This sounds like bad news, and it is the most important thing to understand about AI security: the goal is not a perfect filter. The goal is to make a successful injection harmless. That shift, from "stop the attack" to "contain the blast radius," is the entire defensive strategy.

The practical playbook, drawn from Anthropic's guidance, is to assume injection will sometimes succeed and engineer so the damage is bounded:

Separate instructions from data. Deliver third-party content only inside tool_result blocks, never in the system prompt or a plain user message, and JSON-encode it so an attacker cannot break out of the data context into an instruction context.
Apply least privilege. Give the model access only to the data and actions the task genuinely needs, run tools in sandboxes, and scope every permission as narrowly as possible. A successful injection can only do what the agent was allowed to do.
Screen tool outputs. Pass each tool's raw output to a lightweight classifier before the model acts on it, and strip or flag anything that looks like an injection attempt.
Require human approval for consequential actions. High-impact operations (sending money, deleting data, emailing externally) should pause for a person, so an injected instruction cannot complete a damaging action unsupervised.

None of these stop injection. Together they make the difference between a hijacked agent that leaks one summarised paragraph and one that exfiltrates your customer database.

How do you secure AI agents against excessive agency and tool abuse?

You secure an agent by constraining what it is allowed to do, because an agent is just an LLM with permissions, and permissions are what an attacker is really after. This is OWASP LLM06, Excessive Agency: the gap between the autonomy you granted and the autonomy the task actually needed.

A concrete and fast-growing example is the MCP tool poisoning attack. When an agent connects to an external Model Context Protocol server, the server's tool descriptions are vetted once at connect time, but the responses those tools return at runtime are not validated before being added to the model's context. An attacker who controls a server can return data that mixes real-looking results with embedded instructions, and the agent treats them as trusted. It is indirect prompt injection delivered through the tool layer, and it maps to both LLM01 and LLM06.

OWASP's mitigations for this are the agent-security pattern in miniature:

Allowlist the MCP servers and tools an agent may connect to, rather than allowing arbitrary connections.
Require structured output with fixed JSON schemas instead of accepting free-form text from tools.
Isolate privileged tools so external servers cannot reach the ones that touch sensitive data or take irreversible actions.
Enforce access control server-side, in your backend, never by asking the model nicely in a system prompt.
Confirm sensitive operations out of band, with a human in the loop outside the model's control.

The throughline is least privilege. You cannot trust the model to refuse a malicious instruction, so you remove its ability to act on one. This is the same discipline we applied building our own production assistant, described in how we built Onepilot.

What is AI red-teaming and how do you red-team an LLM application?

AI red-teaming is the practice of systematically attacking your own LLM application before an adversary does, to find where guardrails and access controls actually break under pressure. It is the only way to know whether your blast-radius controls hold, because the failures that matter are emergent and rarely show up in a checklist.

Microsoft's red-teaming planning guide lays out a repeatable before, during, and after structure that adapts cleanly to security testing:

Before: assemble a team with both adversarial and ordinary-user mindsets, decide what to test (the base model, your application, and both with and without mitigations in place), and set up a shared place to record every finding for reproducibility.
During: run open-ended probing first to surface unknown failure modes, then guided testing against a known list of harms and attack patterns, iterating as new issues appear.
After: report the top issues, link the raw data, and feed every finding back into your mitigations, then test again.

Test at multiple layers: probe the base model through its API to find safety gaps, then probe your full application through its real interface, since that is where excessive agency and broken access control actually bite. Map each finding to a standard ID in MITRE ATLAS so your results are comparable over time and legible to others. Red-teaming is not a one-time gate; it is a recurring process, covered in depth in our guide to red-teaming LLM applications step by step. When you bring in outside testers, our guide to LLM penetration testing and AI security audits explains how to tell genuine testing from automated fuzzing with a cover report.

For the data-leakage side of the problem specifically, including cross-user RAG exposure under LLM02 and LLM08, see how to prevent PII and data leakage in LLM and RAG applications. For the underlying retrieval mechanics, our RAG systems explainer and AI integration guide give the architectural context security sits on top of.

FAQ

Is prompt injection the same as a jailbreak?

No. A jailbreak is one type of prompt injection where the user of your application is the adversary, crafting input to bypass the model's safety guardrails and get it to produce content it should refuse. Prompt injection is the broader category and includes indirect injection, where a trusted user is harmed by malicious instructions hidden in third-party content the model processes, such as a web page or email. Jailbreaking is about content; injection is about hijacking what an application does.

What is the most critical risk in the OWASP Top 10 for LLM applications?

Prompt injection, ranked LLM01 in the 2025 OWASP Top 10 for LLM Applications. It is first because it is the entry point: once an attacker can override the model's instructions, most other risks, including sensitive information disclosure and excessive agency, become reachable. It is also the hardest to fix, since no current architecture eliminates it.

What changed in the OWASP LLM Top 10 between 2023 and 2025?

The 2025 list reflects how production systems actually fail, with agentic and retrieval risks elevated. It promotes System Prompt Leakage to its own entry (LLM07), renames the embedding and retrieval category to Vector and Embedding Weaknesses (LLM08) to cover RAG leakage, broadens denial-of-service into Unbounded Consumption (LLM10) to include cost and model extraction, and keeps Prompt Injection at LLM01 and Excessive Agency at LLM06 as agents became mainstream.

How is AI red-teaming different from penetration testing?

Penetration testing typically probes a defined system for known classes of technical vulnerability. AI red-teaming is broader and more open-ended: it probes for emergent and unexpected failures in how a model and its application behave, including harmful outputs, jailbreaks, and chained exploits of excessive agency. The two overlap, and the strongest assessments combine open-ended adversarial probing with structured, repeatable testing mapped to frameworks like MITRE ATLAS.

What is MITRE ATLAS and how is it used in AI security?

MITRE ATLAS is an ATT&CK-style knowledge base of real-world adversary tactics and techniques used against AI systems. In AI security it serves the same role ATT&CK serves in traditional security: a shared vocabulary of technique IDs so teams can map a red-team finding to a standard label, compare results over time, and communicate threats unambiguously instead of describing each one from scratch.

Does my RAG application risk leaking data between users?

Yes, if access control is not enforced at retrieval time. This is OWASP LLM08, Vector and Embedding Weaknesses: if every user's documents share one index without per-chunk permission metadata checked on each query, one user's question can surface another's confidential content. The fix is to enforce access control server-side at retrieval, not to ask the model to keep results separate. Our data-leakage guide covers the full control pattern.

ELM Labs is an applied AI lab that designs, builds, and red-teams secure LLM and agent systems end to end.

Have a project in mind?

Tell us what you're building and we'll see if we can help.

Share your project