Pillar C: Cybersecurity of AI SystemsC5

AI Red Teaming

AI system threat modeling, red teaming methodology for LLMs (OWASP Top 10 for LLMs), automated red teaming tools, evaluation frameworks.

Part of Pillar C: Cybersecurity of AI Systems · Cybersecurity of AI Systems groups the disciplines that share methods, tools, and threat models with AI Red Teaming.

What is AI Red Teaming?

AI red teaming applies offensive security methodology to artificial intelligence systems, systematically probing models, agents, and AI-powered applications for vulnerabilities that traditional security testing cannot find. Unlike conventional penetration testing, AI red teaming requires understanding both the AI attack surface (prompt injection, jailbreaking, data extraction) and the unique failure modes of AI systems (hallucination, unsafe content generation, alignment failures).

The OWASP Top 10 for LLM Applications provides the foundational risk taxonomy, covering vulnerabilities from prompt injection (LLM01) through excessive agency (LLM08) to model theft (LLM10). AI red teamers combine these known vulnerability classes with creative adversarial thinking to find novel attack chains — such as using prompt injection to manipulate an agent's tool calls, or chaining hallucination with output handling vulnerabilities to achieve code execution.

Automated red teaming is rapidly evolving. Tools like Microsoft's PyRIT, NVIDIA's Garak, and custom harnesses enable systematic testing at scale — generating adversarial prompts, testing for content policy violations, evaluating guardrail bypass techniques, and fuzzing agent tool integrations. The field draws from both traditional application security testing and AI safety evaluation, creating a new discipline that requires expertise in both.

Why it matters

AI systems cannot be secured without being adversarially tested. AI red teaming is how organizations discover the gap between their intended AI behavior and what attackers can actually make these systems do.

AI red teaming is the offensive counterpart to AI safety and governance. It provides empirical evidence of AI risk by demonstrating real attacks, grounding abstract risk frameworks in concrete, exploitable vulnerabilities.

Key topics

OWASP Top 10 for LLM Applications methodology
AI threat modeling (STRIDE for AI, MITRE ATLAS)
Prompt injection testing (direct and indirect)
Jailbreak technique development and testing
Agent and tool-use attack chains
Automated red teaming frameworks (PyRIT, Garak)
Content safety boundary testing
Model exfiltration and extraction testing
Multi-turn and multi-modal attack scenarios
AI red team reporting and risk communication

People shaping this field

Researchers and practitioners worth following in this space.

AI security researcher, creator of the AI Attack Surface Map

Security researcher, expert on indirect prompt injection

Researcher on prompt injection taxonomy and Learn Prompting

Curated resources

Authoritative sources we ground AI Red Teaming questions in — frameworks, research, guides, and tools.

OWASPframework

OWASP Top 10 for LLM Applications 2025

The definitive security risk list for LLM-powered applications. Covers prompt injection, insecure output handling, training data poisoning, and more.

MITREframework

MITRE ATLAS

Adversarial Threat Landscape for AI Systems. ATT&CK-style knowledge base of adversarial ML techniques, tactics, and real-world case studies.

NISTframework

NIST AI 100-2e2023 — Adversarial Machine Learning

Comprehensive taxonomy of adversarial ML attacks and mitigations. Covers evasion, poisoning, extraction, and inference attacks with standardized terminology.

Microsoftguide

Microsoft AI Red Team

Comprehensive guide to AI red teaming from Microsoft's dedicated AI security team. Covers methodology, tools, and findings.

Black Hat / DEF CONguide

Black Hat / DEF CON Archives

Conference presentations covering novel attack techniques and defensive research. Essential for cutting-edge offensive/defensive questions. AI Village talks particularly relevant for Pillars B and C.

NISTframework

NIST AI 600-1 — AI RMF Generative AI Profile

Companion to AI RMF 1.0 specifically for generative AI. Maps 12 GenAI risks to RMF actions. Covers CBRN, CSAM, confabulation, data privacy, environmental, human-AI interaction, information integrity, IP, obscenity, toxicity, value chain.

AI Village / DEF CONguide

DEF CON 31 — AI Village Red Team Challenge

Largest public AI red teaming event. 2,200+ participants testing multiple foundation models. Established community norms for responsible AI red teaming. Good for questions on practical red team methodology.

Google DeepMindresearch

Google DeepMind — "Evaluating Frontier Models for Dangerous Capabilities" (2023)

Framework for evaluating dangerous capabilities: persuasion, deception, cyber operations, self-replication. Defines evaluation methodology for frontier model safety. Questions on what to test and how to interpret results.

Unknownresearch

Harang & Ruef — "Securing LLMs Against Jailbreaking for Cybersecurity" (2024)

Analysis of how LLMs can be used for offensive security tasks and the implications for defensive guardrails. Covers the dual-use nature of security LLMs.

Bishop Foxguide

Bishop Fox — AI Red Teaming Research

Research on using AI for penetration testing automation: reconnaissance, vulnerability discovery, exploit generation. Practitioner perspective on what's practical vs. theoretical.

Anthropicresearch

Anthropic — "Responsible Scaling Policy" and capability evaluations

Evaluates model capabilities for autonomous cyber operations at each AI Safety Level (ASL). Defines thresholds where AI capability in offensive security requires additional safeguards. Key reference for responsible AI in offensive security.

Microsoftguide

Microsoft — "Lessons Learned from Red-Teaming 100 Generative AI Products"

Practical lessons from large-scale LLM red teaming across real products. Covers failure modes, testing methodologies, and organizational patterns. Rare insight into enterprise-scale AI security.

Roles where this matters

Career paths where this domain shows up as core or recommended.

🔍Penetration TesterRecommended

Ethically hack systems to find vulnerabilities before attackers do. Offensive security requires deep technical knowledge.

🤖AI Security EngineerCore

Secure AI/ML systems from adversarial attacks, data poisoning, and model compromise. The fastest-growing specialization in cybersecurity.

🖥ML Platform Security EngineerRecommended

Secures the platform that trains, stores, and serves ML models — multi-tenant GPU isolation, pipeline integrity, feature-store hygiene, secrets management in ML workflows.

📦Product Security EngineerRecommended

Embedded in a product team — owns threat modelling, secure design, libraries, dependency risk, and increasingly the AI-specific hardening of LLM features the product ships.

Certifications that signal this domain

Credentials whose blueprint meaningfully covers this domain. Core means centrally covered; also touched means present in the blueprint but not the primary focus.

Core coverage

GMLEProfessional·GIACOfficial page →

GIAC Machine Learning Engineer

GIAC Machine Learning Engineer

OSAIProfessional·OffSecOfficial page →

OffSec AI Security Practitioner

Offensive AI security — adversarial ML, LLM attacks, agent abuse.

Browse all certifications → — pick a cert on the interactive map to highlight every domain it covers.

More in Cybersecurity of AI Systems

See how your AI Red Teaming skills stack up

301 questions available. Compete head-to-head or run a quick speed quiz to benchmark yourself.