Pillar C: Cybersecurity of AI SystemsC8

AI Safety & Alignment

Guardrails, content filtering bypass, model monitoring, drift detection, output control.

Part of Pillar C: Cybersecurity of AI Systems · Cybersecurity of AI Systems groups the disciplines that share methods, tools, and threat models with AI Safety & Alignment.

What is AI Safety & Alignment?

AI safety and alignment ensures that deployed AI systems behave as intended, stay within defined operational boundaries, and do not produce harmful outputs. While AI governance sets the policies, AI safety provides the technical controls — guardrails, content filtering, output monitoring, and drift detection — that enforce those policies in production systems.

Guardrails are the primary enforcement mechanism for AI safety. They include input classifiers that detect malicious or out-of-scope prompts, output filters that block harmful or sensitive content, tool-use policies that restrict what actions AI agents can take, and behavioral boundaries that define acceptable response patterns. Modern guardrail frameworks like NVIDIA NeMo Guardrails, Guardrails AI, and LLM-as-judge systems provide layered defense architectures.

Model monitoring and drift detection ensure that AI systems continue to perform as expected over time. Distribution drift in input data, concept drift in the relationship between inputs and outputs, and performance degradation from adversarial pressure all require continuous monitoring. Safety alignment research addresses deeper questions about ensuring AI systems remain aligned with human values and intentions as they become more capable — an active research area with implications for both current systems and future AI development.

Why it matters

AI systems that pass all safety evaluations at launch can still fail dangerously in production. Continuous monitoring, guardrails, and drift detection are essential to maintain AI safety throughout the deployment lifecycle.

AI safety and alignment is the runtime enforcement layer for AI governance policies. It translates ethical principles and regulatory requirements into technical controls that operate continuously on live AI systems.

Decide who or what can do what, enforce it cryptographically, constrain AI behaviour.

Other domains in this layer

See how this layer connects to the rest of the domain map →

Standards and frameworks

Curated resources

Authoritative sources we ground AI Safety & Alignment questions in — frameworks, research, guides, and tools.

Googleframework

Google Secure AI Framework (SAIF)

Google's conceptual framework for securing AI systems. Covers supply chain, data governance, and deployment security.

Anthropicresearch

Constitutional AI: Harmlessness from AI Feedback (Bai et al. 2022)

Anthropic's approach to AI alignment using a set of principles (a "constitution") to train helpful and harmless AI. Foundation of modern RLHF alternatives.

Anthropicguide

Anthropic Research Index

Collection of Anthropic's published research on AI safety, alignment, interpretability, and security.

CAISresearch

Center for AI Safety (CAIS) — "An Overview of Catastrophic AI Risks"

Comprehensive taxonomy of AI risks: weaponization, misinformation, power concentration, value lock-in, rogue AI. Good for strategic-level safety questions beyond technical alignment.

OpenAIresearch

OpenAI — "Weak-to-Strong Generalization" (2023)

Research on the core alignment challenge: can weaker systems supervise stronger ones? Showed partial generalization is possible. Key for superalignment and scalable oversight questions.

Unknownresearch

Christiano et al. — "Deep Reinforcement Learning from Human Feedback"

The RLHF paper that enabled ChatGPT-style alignment. Reward model from human preferences + PPO. Foundational for understanding modern alignment approaches and their limitations.

Google Brain / OpenAIresearch

Amodei et al. — "Concrete Problems in AI Safety" (2016)

Five practical safety problems: avoiding side effects, reward hacking, scalable oversight, safe exploration, distributional shift. Still the canonical taxonomy for AI safety research questions.

Google DeepMindresearch

DeepMind — "Scalable Agent Alignment via Reward Modeling" and safety research

Research on reward modeling, debate, recursive reward modeling, and interpretability. Provides an alternative perspective to Anthropic/OpenAI approaches.

Anthropicresearch

Anthropic — "Responsible Scaling Policy" and capability evaluations

Evaluates model capabilities for autonomous cyber operations at each AI Safety Level (ASL). Defines thresholds where AI capability in offensive security requires additional safeguards. Key reference for responsible AI in offensive security.

OpenAIresearch

OpenAI — "Red Teaming Network" and GPT-4 System Card

Description of external red teaming program and findings from GPT-4 pre-deployment testing. The system card details risk categories, testing methodology, and residual risks.

Google DeepMindresearch

Google DeepMind — "Evaluating Frontier Models for Dangerous Capabilities" (2023)

Framework for evaluating dangerous capabilities: persuasion, deception, cyber operations, self-replication. Defines evaluation methodology for frontier model safety. Questions on what to test and how to interpret results.

OWASPguide

OWASP AI Security and Privacy Guide

Comprehensive guide covering AI security threats, privacy risks, and practical controls for AI-powered applications.

Certifications that signal this domain

Credentials whose blueprint meaningfully covers this domain. Core means centrally covered; also touched means present in the blueprint but not the primary focus.

Core coverage

AAISMExpert·ISACAOfficial page →

Advanced in AI Security Management

ISACA specialization for AI Security Management. Requires active CISM or CISSP. Focus on AI Governance & Program Management, AI Risk Management, and AI Technologies & Controls. For security leaders managing AI risks.

AIGPProfessional·IAPPOfficial page →

Artificial Intelligence Governance Professional

AI risk, governance, and regulatory literacy (EU AI Act, NIST AI RMF).

Also touched

CRAIProfessional·ISACAOfficial page →

ISACA Certified in Risk of Artificial Intelligence (emerging)

AI risk management and governance — emerging blueprint, expect revisions.

OSAIProfessional·OffSecOfficial page →

OffSec AI Security Practitioner

Offensive AI security — adversarial ML, LLM attacks, agent abuse.

Browse all certifications → — pick a cert on the interactive map to highlight every domain it covers.

Education and certifications

More in Cybersecurity of AI Systems

See how your AI Safety & Alignment skills stack up

304 questions available. Compete head-to-head or run a quick speed quiz to benchmark yourself.