Skip to main content Skip to footer

InterpretabilityResearch

A Mathematical Framework for Transformer Circuits

Dec 22, 2021

Research

Exploring model welfare

Apr 24, 2025

Research

Values in the wild: Discovering and analyzing values in real-world language model interactions

Apr 21, 2025

Research

Reasoning models don't always say what they think

Apr 03, 2025