Make safe AI systems Deploy them reliably
We develop large-scale AI systems so that we can study their safety properties at the technological frontier, where new problems are most likely to arise. We use these insights to create safer, steerable, and more reliable models, and to generate systems that we deploy externally, like Claude.
Publication
No results found.
Interpretability
Circuits Updates – April 2024
Alignment · Interpretability
Simple probes can catch sleeper agents
Societal Impact
Measuring the Persuasiveness of Language Models
Alignment
Many-shot jailbreaking
Interpretability
Reflections on Qualitative Research
Alignment
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
Societal Impact
Evaluating and Mitigating Discrimination in Language Model Decisions
Alignment
Specific versus General Principles for Constitutional AI
Alignment
Towards Understanding Sycophancy in Language Models
Societal Impact · Policy
Collective Constitutional AI: Aligning a Language Model with Public Input
Interpretability
Decomposing Language Models Into Understandable Components
Interpretability
Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Policy
Challenges in evaluating AI systems
Alignment
Tracing Model Outputs to the Training Data
Alignment
Studying Large Language Model Generalization with Influence Functions
Alignment
Measuring Faithfulness in Chain-of-Thought Reasoning
Alignment
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
Societal Impact
Towards Measuring the Representation of Subjective Global Opinions in Language Models
Interpretability
Circuits Updates — May 2023
Interpretability
Interpretability Dreams
Interpretability
Distributed Representations: Composition & Superposition
Interpretability
Privileged Bases in the Transformer Residual Stream
Societal Impact
The Capacity for Moral Self-Correction in Large Language Models
Interpretability
Superposition, Memorization, and Double Descent
Alignment
Discovering Language Model Behaviors with Model-Written Evaluations
Alignment
Constitutional AI: Harmlessness from AI Feedback
Alignment
Measuring Progress on Scalable Oversight for Large Language Models
Interpretability
Toy Models of Superposition
Societal Impact
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Alignment
Language Models (Mostly) Know What They Know
Interpretability
Softmax Linear Units
Interpretability
Scaling Laws and Interpretability of Learning from Repeated Data
Alignment
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Interpretability
In-context Learning and Induction Heads
Societal Impact
Predictability and Surprise in Large Generative Models
Interpretability
A Mathematical Framework for Transformer Circuits
Alignment