Where every claim in SecProve
comes from.
A dense reading catalog. Every claim is footnoted. Sort by source, filter by pillar, type, or recency. Built for analysts who want to see what we are standing on.
Evaluates model capabilities for autonomous cyber operations at each AI Safety Level (ASL). Defines thresholds where AI capability in offensive security requires additional safeguards. Key reference for responsible AI in offensive security.
NVIDIA's open-source toolkit for adding programmable guardrails to LLM applications. Supports input/output validation and topic control.
Test your knowledge · B7Comprehensive guide covering AI security threats, privacy risks, and practical controls for AI-powered applications.
Test your knowledge · B7Collection of Anthropic's published research on AI safety, alignment, interpretability, and security.
Test your knowledge · C8Five practical safety problems: avoiding side effects, reward hacking, scalable oversight, safe exploration, distributional shift. Still the canonical taxonomy for AI safety research questions.
Test your knowledge · C8Benchmark measuring whether language models generate truthful answers. Tests for common misconceptions and falsehoods.
Test your knowledge · C8Anthropic's framework for responsible AI development. Defines AI Safety Levels (ASL) and capability thresholds.
Test your knowledge · C8Anthropic's approach to AI alignment using a set of principles (a "constitution") to train helpful and harmless AI. Foundation of modern RLHF alternatives.
Test your knowledge · C8Comprehensive taxonomy of AI risks: weaponization, misinformation, power concentration, value lock-in, rogue AI. Good for strategic-level safety questions beyond technical alignment.
Test your knowledge · C8Google's conceptual framework for securing AI systems. Covers supply chain, data governance, and deployment security.
Test your knowledge · C7Research on reward modeling, debate, recursive reward modeling, and interpretability. Provides an alternative perspective to Anthropic/OpenAI approaches.
Test your knowledge · C8Framework for evaluating dangerous capabilities: persuasion, deception, cyber operations, self-replication. Defines evaluation methodology for frontier model safety. Questions on what to test and how to interpret results.
Test your knowledge · C5Description of external red teaming program and findings from GPT-4 pre-deployment testing. The system card details risk categories, testing methodology, and residual risks.
Test your knowledge · C5Research on the core alignment challenge: can weaker systems supervise stronger ones? Showed partial generalization is possible. Key for superalignment and scalable oversight questions.
Test your knowledge · C8The RLHF paper that enabled ChatGPT-style alignment. Reward model from human preferences + PPO. Foundational for understanding modern alignment approaches and their limitations.
Test your knowledge · C8Ready to test what you've learned?
Our questions are built directly from these resources. Take a quiz and see how your knowledge stacks up.