Chris Olah

Research Scientist (Interpretability) · OpenAI · 2020

Left to co-found Anthropic to focus on AI interpretability and safety research.

Alignment Research Gaps Safety Deprioritization

Sources

Anthropic Business Breakdown & Founding StoryContrary Research

Key Publications

Feature Visualization
DistillNov 2017paper
This paper establishes feature visualization as a foundational technique for understanding what individual neurons and layers in neural networks have learned, demonstrating methods for generating synthetic images that maximally activate specific neurons to reveal what features they detect. The authors develop and refine optimization-based approaches that start with random noise and iteratively modify an image to increase a target neuron's activation, applying regularization techniques to produce clear, interpretable visualizations rather than adversarial noise patterns. The resulting visualizations reveal a rich hierarchy of learned features, from simple edge and texture detectors in early layers to complex object and scene detectors in deeper layers, providing concrete evidence that neural networks learn meaningful internal representations. The paper also introduces methods for visualizing how features interact and combine, laying groundwork for the circuits-based interpretability research that would follow. Published in the interactive Distill journal, the work exemplifies Chris Olah's commitment to making neural network internals transparent and interpretable, a research direction he would later make central to Anthropic's safety strategy after departing from Google Brain.
Zoom In: An Introduction to Circuits
DistillMar 2020paper
This paper proposes a framework for understanding neural networks at a mechanistic level by identifying meaningful 'circuits' composed of connected neurons and the weights between them, arguing that these circuits implement recognizable algorithms that humans can understand and verify. The authors present three core claims: that individual neurons in neural networks often correspond to understandable features, that the connections between these neurons form circuits that carry out specific computational functions, and that these circuits can be analyzed using the same methods across different networks and tasks. Through detailed case studies in image classification networks, they demonstrate circuits responsible for detecting curves, dog heads, and car components, showing how simple features compose into complex ones through specific wiring patterns. The work represents a major advance in mechanistic interpretability, moving beyond treating neural networks as black boxes toward a vision where their internal reasoning can be systematically reverse-engineered. Published through Chris Olah's team at OpenAI before their move to Anthropic, this research agenda became a cornerstone of Anthropic's approach to AI safety and has spawned a growing subfield dedicated to understanding the internal computations of large language models.

Share on X Share on LinkedIn

← Back to all profiles