Jan Leike

Co-Lead, Superalignment Team · OpenAI · 2024

Said 'safety culture and processes have taken a backseat to shiny products' at OpenAI. Resigned the day after Sutskever. Joined Anthropic to continue alignment work.

Leike co-led OpenAI's Superalignment team alongside Ilya Sutskever, working on the problem of ensuring that AI systems vastly more intelligent than humans would still be controllable. He resigned the day after Sutskever, posting publicly that safety culture and processes had taken a backseat to shiny products. His candor was unusual — most departing researchers stay quiet — and his immediate move to Anthropic underscored his belief that meaningful alignment work required a different institutional environment.

Safety Deprioritization Alignment Research Gaps Team Dissolution

Sources

Key Publications

Deep Reinforcement Learning from Human Preferences
NeurIPS 2017Jun 2017paper
This paper introduces one of the foundational techniques in modern AI alignment by demonstrating that reinforcement learning agents can learn to perform complex tasks guided not by a hand-coded reward function but by human judgments about which of two behavioral trajectories they prefer. The method works by training a reward model on these pairwise human preference comparisons and then using that learned reward model to train the agent via standard reinforcement learning, effectively separating the problem of specifying goals from the problem of achieving them. The authors show that this approach can solve challenging tasks in simulated robotics and Atari games using a remarkably small amount of human feedback, making it practical for real-world applications. This work laid the groundwork for reinforcement learning from human feedback, or RLHF, which became the core technique behind the fine-tuning of large language models like ChatGPT and has been adopted across the AI industry. The paper's significance extends beyond its technical contribution: it demonstrated that alignment research could produce methods with immediate practical value, helping to build institutional support for safety research within commercial AI organizations.
Scalable Agent Alignment via Reward Modeling: A Research Direction
arXivNov 2018preprint
This paper lays out a comprehensive research agenda for aligning AI agents with human values through learned reward models, proposing that the alignment problem can be decomposed into the tractable subproblems of learning what humans want and then optimizing an agent to pursue those learned objectives. The authors describe a pipeline where human feedback is used to train a reward model that captures human preferences, which then serves as the objective function for a reinforcement learning agent, and they identify the key challenges at each stage of this pipeline. They address problems including the difficulty of scaling human oversight to complex domains, the risk of reward model misspecification, and the need for safe exploration during training. The paper also discusses how recursive reward modeling, where AI assistants help evaluate more complex AI behaviors, could enable alignment techniques to scale to superhuman systems. Written primarily by Jan Leike during his time at DeepMind before he moved to lead alignment work at OpenAI, this agenda has been highly influential in shaping the practical approach to alignment adopted by major AI laboratories.

Predictions

1 of 2 confirmed

Confirmed
OpenAI safety culture has taken a back seat to product launches
“Over the past years, safety culture and processes have taken a back seat to shiny products.”
Multiple senior safety researchers departed OpenAI through 2024-2025 citing safety deprioritization. The superalignment team was dissolved in May 2024. Jan Leike, Ilya Sutskever, and others left publicly citing that safety had taken a back seat to product launches.
View evidence ⤴
Open
Without adequate oversight, alignment failures will occur in deployed systems
“Building smarter-than-human machines is an inherently dangerous endeavor. OpenAI is shouldering an enormous responsibility on behalf of all of humanity. But over the past years, safety culture and processes have taken a back seat.”

Share on X Share on LinkedIn

← Back to all profiles