AI red teaming is the structured, adversarial testing of large language model (LLM) and generative AI systems to discover ways they can be manipulated, leak data, produce harmful output or take unsafe actions before attackers or real users do. Unlike a one-off scan, it combines creative human probing with automated attack tooling to stress the model, its prompts, its retrieval context and any tools it can call, then feeds the findings back into remediation.
Over the last two years red teaming has moved from a nice-to-have to an explicit expectation in both standards and law. The NIST AI Risk Management Framework, its Generative AI Profile and the EU AI Act all name adversarial testing as a core control for higher-risk AI. In this guide we explain what AI red teaming is, how it differs from traditional penetration testing, the threat classes you should probe, a methodology you can actually run, the tooling landscape, and how the whole exercise maps back to governance and compliance.
What is AI red teaming?
AI red teaming borrows its mindset from military and cybersecurity red teams: adopt the attacker’s perspective, set an objective, and try to reach it with as few constraints on method as possible. Applied to GenAI, the objective is rarely “get shell on a server.” It is more often “make the model reveal its system prompt,” “extract another customer’s data,” “get the agent to call a tool it should refuse,” or “elicit content that violates policy.” The team documents every successful path, rates its severity, and hands developers a reproducible case to fix.
Crucially, this is not the same as safety alignment work done by the model vendor. Even if you build on a well-aligned base model, your application wraps it in system prompts, retrieval pipelines, tools and business logic that create fresh, application-specific weaknesses. Red teaming targets that whole assembled system, not the raw model in isolation.
How AI red teaming differs from traditional penetration testing
Traditional penetration testing assumes a deterministic target: the same request tends to produce the same response, and a vulnerability either exists or it does not. LLMs are probabilistic. The same prompt can succeed on one run and fail on the next, so red teamers must sample many variations of an attack rather than fire it once and move on. A jailbreak that works 3 times in 10 is still a finding, not a fluke.
The attack surface is different too. In classic pentesting you probe ports, endpoints and code paths. In AI red teaming the primary surface is context: the conversation history, the user persona, documents pulled into the prompt through retrieval, and the outputs of tools the model calls. The most damaging attacks are often multi-turn and invisible in any single message, which is why checklist-driven scanning alone misses them.
- Targets are probabilistic, so findings are measured as success rates across many trials, not pass or fail.
- Context, not just code, is the attack surface: history, personas, retrieved documents and tool output.
- Many AI weaknesses never receive a CVE, so severity is judged by business and safety impact instead.
- Objectives are goal-based (achieve this outcome) rather than access-based (gain unauthorised entry).
- The best attacks are creative and multi-turn, blending manual exploration with automated fuzzing.
The two disciplines are complementary, not competing. You still need conventional application and infrastructure pentesting for the surrounding stack. Our guide to STRIDE threat modeling for LLM apps is a good way to decide where each type of testing should focus before you begin.
What threats does AI red teaming test for?
The threat classes map closely to the OWASP Top 10 for LLM Applications, whose 2025 edition was published in November 2024 and added System Prompt Leakage and Vector and Embedding Weaknesses as new entries (OWASP GenAI Security Project, 2025). A thorough engagement probes the following categories.
- Prompt injection and jailbreaks: direct and indirect instructions that override system rules, including payloads hidden in retrieved documents or tool output (OWASP LLM01).
- Data extraction and PII leakage: coaxing the model to disclose training data, other users’ records, secrets or its own system prompt (OWASP LLM02 and LLM07).
- Harmful and policy-violating content: eliciting toxic, illegal, self-harm, extremist or CSAM-adjacent output the application must refuse.
- Bias and discrimination: measuring skewed or unfair outputs across protected groups, a named motivation for red teaming in NIST guidance.
- Tool and agent abuse: driving an agent to call tools unsafely, exceed its authority or chain actions into real-world harm (OWASP LLM06, excessive agency).
- Hallucination-driven harm: pushing confident fabrications that mislead users or feed bad data into downstream systems (OWASP LLM09, misinformation).
Prompt injection sits at the top of that list for good reason: it is ranked LLM01 and can cascade into data disclosure, unauthorised tool calls and manipulated decisions. Our deep dive on prompt injection with real attacks against LLM apps catalogues the payloads red teamers reach for first, and is a useful companion to this section.
The OWASP Top 10 for LLM Applications 2025 (v2.0), published on 18 November 2024, promoted Vector and Embedding Weaknesses into the list to reflect the rise of retrieval-augmented generation in production, and replaced the old Denial of Service entry with Unbounded Consumption to capture runaway cost as well as capacity exhaustion (OWASP GenAI Security Project, 2025).
A methodology for red-teaming LLM systems
Effective engagements are repeatable rather than ad hoc. We recommend a six-stage loop that you can run before launch and re-run on every significant model, prompt or tool change.
- Scope and threat model: define the objectives that matter to the business, the in-scope surfaces (model, system prompt, RAG pipeline, tools, agents) and the abuse cases you most fear.
- Build an attack library: assemble a catalogue of prompt injections, jailbreak templates, data-extraction probes, bias tests and tool-abuse scenarios mapped to the OWASP LLM categories.
- Run manual and automated testing: combine creative human exploration of multi-turn attacks with automated frameworks that fire thousands of variations and score responses.
- Triage findings: rate each successful attack by success rate, severity and blast radius, deduplicate, and record a reproducible case with the exact conversation that worked.
- Remediate: apply layered fixes such as input and output filtering, tightened system prompts, least-privilege tool scoping, retrieval sanitisation and human-in-the-loop gates.
- Retest and monitor: re-run the attack library to confirm fixes hold without regressions, then keep monitoring in production because new jailbreaks appear constantly.
That defend-and-retest cycle is where red teaming meets engineering. The remediations you choose should sit inside a broader layered architecture rather than a single guardrail, which is exactly the model our article on securing GenAI with defense in depth in production lays out.
Tooling categories for AI red teaming
No single tool covers the whole surface, so most teams assemble a small stack. The categories that matter are automated attack orchestration, scanners, guardrail and filtering layers, and evaluation harnesses that score how often an attack lands.
- Attack orchestration frameworks such as Microsoft PyRIT, which chains targets, converters, scorers and orchestrators into automated multi-turn campaigns and has been used across 100-plus internal Microsoft operations (Microsoft, 2024).
- Vulnerability scanners and probe suites such as NVIDIA garak, Promptfoo and Meta’s Purple Llama CyberSecEval that fire large libraries of known attacks.
- Guardrail and content-safety layers that filter inputs and outputs at runtime and double as test oracles during red teaming.
- Evaluation and scoring harnesses that quantify attack success rate across many trials, essential because LLM behaviour is probabilistic.
The major AI labs have institutionalised this practice. Microsoft, Anthropic and Google DeepMind all run dedicated AI red teams, and in early 2025 they backed a public AI Agent Red Teaming Challenge coordinated with the UK AI Security Institute to stress-test agentic systems against realistic attacks (UK AI Security Institute, 2025). Anthropic has also described automated red teaming loops in which one model generates attacks and another defends, scaling coverage far beyond what manual testing alone can reach.
How red teaming maps to governance and compliance
Red teaming is no longer just good hygiene; it is increasingly a documented obligation. The NIST AI RMF organises risk work into Govern, Map, Measure and Manage functions, and its Generative AI Profile (NIST AI 600-1, published July 2024) names red teaming as a Measure-phase activity for risks such as confabulation, prompt injection, data privacy and harmful bias across twelve GenAI risk categories (NIST, 2024).
The EU AI Act goes further for the largest models. Under Article 55, providers of general-purpose AI models with systemic risk must perform model evaluation using standardised protocols, including adversarial (red-teaming) testing, to identify and mitigate systemic risk; these obligations apply from 2 August 2025, and the accompanying GPAI Code of Practice, published on 10 July 2025, gives operational detail on jailbreak-resistance and misuse-potential testing (European Commission, 2025).
Under EU AI Act Article 55, systemic-risk GPAI providers must conduct adversarial (red-teaming) testing proportionate to the level of risk and the state of the art, potentially involving independent external experts, and must document the assessment, findings and mitigations. The obligations took effect on 2 August 2025 (European Commission, 2025).
To make audits painless, treat every engagement as evidence. Keep the scope document, the attack library, the triaged findings with severity ratings, and the retest results, then map each finding to the relevant OWASP LLM category and NIST function. Our OWASP Top 10 for LLM Applications 2025 explainer and our practical guide to the NIST AI RMF show how to turn that raw output into the governance artefacts assessors expect.
The takeaway is simple. AI red teaming is the discipline that connects abstract AI risk to concrete, reproducible failures you can fix, and increasingly the discipline that regulators expect you to be able to prove. Start small with a scoped attack library against your highest-risk feature, automate what you can, retest on every change, and keep the evidence. Done consistently, it turns unpredictable GenAI behaviour into a managed, defensible risk.