Pillar C: Cybersecurity of AI SystemsC4

AI Data Security

Training data poisoning, PII leakage from models, differential privacy, federated learning security.

Part of Pillar C: Cybersecurity of AI Systems · Cybersecurity of AI Systems groups the disciplines that share methods, tools, and threat models with AI Data Security.

What is AI Data Security?

AI data security focuses on protecting the data that powers machine learning systems across its entire lifecycle — from collection and curation through training, fine-tuning, and inference. Training data is the DNA of every AI model, and its compromise can have cascading effects that are nearly impossible to detect or remediate after the model is deployed.

Training data poisoning is a primary concern. Attackers who can influence even a small fraction of training data can implant backdoors that activate only on specific trigger inputs, degrade model performance on targeted classes, or embed biases that serve adversarial objectives. The scale of modern training datasets (billions of tokens for LLMs, millions of images for vision models) makes comprehensive data validation extremely challenging.

PII leakage from trained models is another critical risk. LLMs have been shown to memorize and regurgitate verbatim training data, including personally identifiable information, API keys, and proprietary content. Differential privacy provides mathematical guarantees that individual training examples cannot be extracted, but it comes at a cost to model utility. Data security for AI requires a combination of data governance, privacy-preserving techniques, access controls on training infrastructure, and continuous monitoring for data exfiltration through model outputs.

Why it matters

The integrity and privacy of training data directly determines the trustworthiness of every model built from it. A single data compromise can create vulnerabilities that persist across every downstream application.

AI data security bridges traditional data governance and privacy with the unique requirements of ML systems, where data doesn't just need to be protected at rest and in transit — it needs to be validated, curated, and audited as a security-critical input to model behavior.

Standards and frameworks

Roles where this matters

Career paths where this domain shows up as core or recommended.

🤖AI Security EngineerRecommended

Secure AI/ML systems from adversarial attacks, data poisoning, and model compromise. The fastest-growing specialization in cybersecurity.

🔒Privacy Engineer / DPOCore

Build privacy into systems by design. Navigate GDPR, CCPA, and emerging AI privacy regulations.

AI Governance / AI Risk SpecialistCore

The policy/controls counterpart to the AI Security Engineer — owns risk frameworks, regulatory mapping (EU AI Act, NIST AI RMF), model documentation, and AI incident response policy.

🖥ML Platform Security EngineerCore

Secures the platform that trains, stores, and serves ML models — multi-tenant GPU isolation, pipeline integrity, feature-store hygiene, secrets management in ML workflows.

📦Product Security EngineerRecommended

Embedded in a product team — owns threat modelling, secure design, libraries, dependency risk, and increasingly the AI-specific hardening of LLM features the product ships.

Certifications that signal this domain

Credentials whose blueprint meaningfully covers this domain. Core means centrally covered; also touched means present in the blueprint but not the primary focus.

Core coverage

CRAIProfessional·ISACAOfficial page →

ISACA Certified in Risk of Artificial Intelligence (emerging)

AI risk management and governance — emerging blueprint, expect revisions.

Also touched

AIGPProfessional·IAPPOfficial page →

Artificial Intelligence Governance Professional

AI risk, governance, and regulatory literacy (EU AI Act, NIST AI RMF).

CIPMProfessional·IAPPOfficial page →

Certified Information Privacy Manager

Running a privacy program end-to-end.

CIPTProfessional·IAPPOfficial page →

Certified Information Privacy Technologist

Privacy engineering, privacy-by-design in products and platforms.

Browse all certifications → — pick a cert on the interactive map to highlight every domain it covers.

People shaping this field

Researchers and practitioners worth following in this space.

Pioneer of differential privacy

Cornell professor, research on ML privacy and data poisoning

ETH Zurich professor, model extraction and training data privacy

Curated resources

Authoritative sources we ground AI Data Security questions in — frameworks, research, guides, and tools.

NISTframework

NIST Privacy Framework

Voluntary framework for improving privacy through enterprise risk management. Complements the Cybersecurity Framework.

Academicresearch

Extracting Training Data from Large Language Models (Carlini et al. 2021)

Demonstrated that LLMs memorize and can be prompted to regurgitate training data verbatim, including PII. Foundational work on LLM privacy risks.

Academicresearch

Membership Inference Attacks Against Machine Learning Models (Shokri et al. 2017)

First practical membership inference attack against ML models. Showed that ML APIs leak information about their training data.

Academicresearch

Deep Learning with Differential Privacy (Abadi et al. 2016)

Introduced DP-SGD for training neural networks with formal differential privacy guarantees. Foundation for private ML.

Academicresearch

Machine Unlearning (Bourtoule et al. 2021)

Introduced SISA training for efficient machine unlearning — enabling models to "forget" specific training data without full retraining.

Unknownresearch

Carlini et al. — "Extracting Training Data from Diffusion Models" (USENIX Security 2023)

Extended training data extraction to image models. Showed Stable Diffusion memorizes and regurgitates training images. Important for multimodal AI data security questions.

Google DeepMindresearch

Nasr et al. — "Scalable Extraction of Training Data from (Production) Language Models" (2023)

Extracted training data from ChatGPT (production model) using a divergence attack. Showed alignment doesn't prevent memorization. Questions on the gap between safety fine-tuning and data protection.

Microsoft Research / UPennresearch

Dwork & Roth — "The Algorithmic Foundations of Differential Privacy" (2014)

The theoretical foundation for differential privacy. Essential for questions on privacy-preserving ML training (DP-SGD) and the epsilon-delta framework.

Googleresearch

Google — "Differential Privacy in Practice" (DP libraries)

Open-source DP libraries and practical guides. Bridges theory to implementation. Good for questions on real-world DP deployment challenges and privacy budget management.

More in Cybersecurity of AI Systems

See how your AI Data Security skills stack up

302 questions available. Compete head-to-head or run a quick speed quiz to benchmark yourself.