Make safe AI systems
Deploy them reliably

We develop large-scale AI systems so that we can study their safety properties at the technological frontier, where new problems are most likely to arise. We use these insights to create safer, steerable, and more reliable models, and to generate systems that we deploy externally, like Claude.

Anthropic developed a way to test how persuasive language models (LMs) are, and analyzed how persuasiveness scales across different versions of Claude.

01

AI as a Systematic Science

Inspired by the universality of scaling in statistical physics, we develop scaling laws to help us do systematic, empirically-driven research. We search for simple relations among data, compute, parameters, and performance of large-scale networks. Then we leverage these relations to train networks more efficiently and predictably, and to evaluate our own progress. We’re also investigating what scaling laws for the safety of AI systems might look like, and this will inform our future research.

02

Safety and Scaling

At Anthropic we believe safety research is most useful when performed on highly capable models. Every year, we see larger neural networks which perform better than those that came before. These larger networks also bring new safety challenges. We study and engage with the safety issues of large models so that we can find ways to make them more reliable, share what we learn, and improve safe deployment outcomes across the field. Our immediate focus is prototyping systems that pair these safety techniques with tools for analyzing text and code.

03

Tools and Measurements

We believe critically evaluating the potential societal impacts of our work is a key pillar of research. Our approach centers on building tools and measurements to evaluate and understand the capabilities, limitations, and potential for societal impact of our AI systems. A good way to understand our research direction here is to read about some of the work we’ve led or collaborated on in this space: AI and Efficiency, Measurement in AI Policy: Opportunities and Challenges, the AI Index 2021 Annual Report, and Microscope.

04

Focused, Collaborative Research Efforts

We highly value collaboration on projects, and aim for a mixture of top-down and bottom-up research planning. We always aim to ensure we have a clear, focused research agenda, but we put a lot of emphasis on including everyone — researchers, engineers, societal impact experts and policy analysts — in determining that direction. We look to collaborate with other labs and researchers, as we believe the best research into characterizing these systems will come from a broad community of researchers working together.

Make safe AI systems Deploy them reliably

01

AI as a Systematic Science

02

Safety and Scaling

03

Tools and Measurements

04

Focused, Collaborative Research Efforts

Publication

No results found.

Circuits Updates – April 2024

Simple probes can catch sleeper agents

Measuring the Persuasiveness of Language Models

Many-shot jailbreaking

Reflections on Qualitative Research

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evaluating and Mitigating Discrimination in Language Model Decisions

Specific versus General Principles for Constitutional AI

Towards Understanding Sycophancy in Language Models

Collective Constitutional AI: Aligning a Language Model with Public Input

Decomposing Language Models Into Understandable Components

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Challenges in evaluating AI systems

Tracing Model Outputs to the Training Data

Studying Large Language Model Generalization with Influence Functions

Measuring Faithfulness in Chain-of-Thought Reasoning

Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Towards Measuring the Representation of Subjective Global Opinions in Language Models

Circuits Updates — May 2023

Interpretability Dreams

Distributed Representations: Composition & Superposition

Privileged Bases in the Transformer Residual Stream

The Capacity for Moral Self-Correction in Large Language Models

Superposition, Memorization, and Double Descent

Discovering Language Model Behaviors with Model-Written Evaluations

Constitutional AI: Harmlessness from AI Feedback

Measuring Progress on Scalable Oversight for Large Language Models

Toy Models of Superposition

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Language Models (Mostly) Know What They Know

Softmax Linear Units

Scaling Laws and Interpretability of Learning from Repeated Data

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

In-context Learning and Induction Heads

Predictability and Surprise in Large Generative Models

A Mathematical Framework for Transformer Circuits

A General Language Assistant as a Laboratory for Alignment

Make safe AI systems
Deploy them reliably