Jan Leike
Co-Lead, Superalignment Team · OpenAI · 2024
Said 'safety culture and processes have taken a backseat to shiny products' at OpenAI. Resigned the day after Sutskever. Joined Anthropic to continue alignment work.
Leike co-led OpenAI's Superalignment team alongside Ilya Sutskever, working on the problem of ensuring that AI systems vastly more intelligent than humans would still be controllable. He resigned the day after Sutskever, posting publicly that safety culture and processes had taken a backseat to shiny products. His candor was unusual — most departing researchers stay quiet — and his immediate move to Anthropic underscored his belief that meaningful alignment work required a different institutional environment.
Sources
Key Publications
- Deep Reinforcement Learning from Human PreferencesNeurIPS 2017paper
This paper introduces one of the foundational techniques in modern AI alignment by demonstrating that reinforcement learning agents can learn to perform complex tasks guided not by a hand-coded reward function but by human judgments about which of two behavioral trajectories they prefer. The method works by training a reward model on these pairwise human preference comparisons and then using that learned reward model to train the agent via standard reinforcement learning, effectively separating the problem of specifying goals from the problem of achieving them. The authors show that this approach can solve challenging tasks in simulated robotics and Atari games using a remarkably small amount of human feedback, making it practical for real-world applications. This work laid the groundwork for reinforcement learning from human feedback, or RLHF, which became the core technique behind the fine-tuning of large language models like ChatGPT and has been adopted across the AI industry. The paper's significance extends beyond its technical contribution: it demonstrated that alignment research could produce methods with immediate practical value, helping to build institutional support for safety research within commercial AI organizations.
- Scalable Agent Alignment via Reward Modeling: A Research DirectionarXivpreprint
This paper lays out a comprehensive research agenda for aligning AI agents with human values through learned reward models, proposing that the alignment problem can be decomposed into the tractable subproblems of learning what humans want and then optimizing an agent to pursue those learned objectives. The authors describe a pipeline where human feedback is used to train a reward model that captures human preferences, which then serves as the objective function for a reinforcement learning agent, and they identify the key challenges at each stage of this pipeline. They address problems including the difficulty of scaling human oversight to complex domains, the risk of reward model misspecification, and the need for safe exploration during training. The paper also discusses how recursive reward modeling, where AI assistants help evaluate more complex AI behaviors, could enable alignment techniques to scale to superhuman systems. Written primarily by Jan Leike during his time at DeepMind before he moved to lead alignment work at OpenAI, this agenda has been highly influential in shaping the practical approach to alignment adopted by major AI laboratories.