Themes & Writings
Across sixty departures from OpenAI, Google, xAI, Anthropic, Meta, and Stability AI, the same concerns surface again and again. Safety teams are dissolved or absorbed into product work. Deployment timelines are compressed past the point where meaningful evaluation is possible. Researchers who raise objections internally find their concerns deprioritized, and sometimes face retaliation for speaking up.
These are not abstract worries about a distant future. The people who left were senior scientists, alignment leads, and ethics researchers — many of them architects of the safety frameworks their former employers now sideline. Their departures span every major AI laboratory and accelerated sharply through 2024 and into 2025, tracking the industry's pivot from cautious research to aggressive commercialization.
The writings below — peer-reviewed papers, policy reports, public resignation letters, and long-form essays — represent the intellectual foundation behind these warnings. They document what these researchers saw, what they tried to build, and why they ultimately decided they could no longer stay.
Filter by Concern
22 writings total
- AI 2027⤴
This report presents a detailed scenario projecting how AI systems could evolve from their current capabilities to artificial superintelligence by the end of the decade, constructed by a team of forecasters with deep expertise in AI capabilities and risk assessment. The scenario traces a year-by-year progression through increasingly capable AI agents, automated AI research, and recursive self-improvement, grounding each step in specific technical milestones and the strategic decisions that labs and governments might make along the way. The authors argue that the combination of continued scaling, algorithmic improvements, and the deployment of AI systems as autonomous researchers could compress what might seem like decades of progress into just a few years. The report is notable for its specificity: rather than offering vague warnings about distant risks, it provides concrete predictions about capability thresholds, economic impacts, and geopolitical dynamics that can be checked against reality as events unfold. Lead author Daniel Kokotajlo departed OpenAI over concerns that the company was not taking safety seriously enough, lending personal credibility to the urgency conveyed in the forecast.
- When Does Generative AI Qualify for Fair Use?⤴
This analysis applies the four-factor fair use test under U.S. copyright law to the outputs of generative AI systems, systematically evaluating whether products like ChatGPT and image generators satisfy the legal requirements for fair use of the copyrighted materials in their training data. The author examines each statutory factor: the purpose and character of the use, the nature of the copyrighted work, the amount of the original used, and the effect on the market for the original, arguing that major generative AI products fail on multiple factors. On the question of transformative purpose, the analysis contends that AI systems that generate text and images in direct competition with the original creators are not sufficiently transformative to merit fair use protection. The paper pays particular attention to the market impact factor, documenting how generative AI tools are already displacing human creators in commercial contexts ranging from stock photography to copywriting, providing evidence that these systems cause concrete economic harm to rights holders. The analysis contributes to an increasingly urgent legal debate as multiple copyright infringement cases against AI companies move through the courts, offering a structured framework that journalists, policymakers, and legal practitioners can use to evaluate the strength of fair use defenses in the generative AI context.
- Situational Awareness: The Decade Ahead⤴
This extensive essay series, spanning roughly 165 pages, argues that artificial general intelligence could plausibly arrive by 2027 based on a detailed analysis of three converging trends: continued growth in compute budgets, steady improvements in algorithmic efficiency, and the unlocking of additional capabilities through better scaffolding and fine-tuning of existing models. Aschenbrenner, who was a researcher at OpenAI before his departure, presents what he calls the 'straight-line extrapolation' case, arguing that no fundamental breakthrough is required for current approaches to reach transformative capability levels if current trends continue. The essay examines the national security implications of this timeline, arguing that AGI development represents a geopolitical event on the scale of the Manhattan Project and that the United States government is dangerously unprepared for its arrival. He also warns about the security vulnerabilities of leading AI labs, the risks of an uncontrolled intelligence explosion, and the inadequacy of current alignment techniques for systems that may rapidly surpass human intelligence. The series generated significant attention within both the AI safety community and national security circles, and its detailed technical arguments about scaling trajectories have become reference points in debates about the pace of AI progress.
- Artificial Intelligence Index Report 2024⤴
The seventh annual AI Index report from Stanford's Human-Centered AI Institute provides a comprehensive, data-driven assessment of the state of artificial intelligence across research output, technical performance, responsible AI practices, the economy, education, policy, and public opinion worldwide. The report documents the accelerating pace of AI development, noting that industry has overtaken academia as the primary driver of frontier model development and that the costs of training leading systems have risen dramatically into the hundreds of millions of dollars. It tracks the growing number of AI-related regulations worldwide, the expansion of AI adoption across economic sectors, and shifts in public sentiment that show increasing awareness of both AI's potential benefits and its risks. The report also highlights significant gaps in responsible AI practices, including the lack of standardized benchmarks for evaluating safety and the limited transparency from leading developers about their training data and evaluation procedures. As a widely cited reference document used by policymakers, journalists, and researchers, the AI Index plays an important role in grounding public discourse about AI progress in empirical data rather than speculation.
- Managing Extreme AI Risks amid Rapid Progress⤴
Published in one of the world's most prestigious scientific journals, this paper brings together a group of prominent AI researchers and governance experts to warn that advanced AI systems could pose catastrophic risks including large-scale social manipulation, the automation of cyberattacks, and the irreversible loss of human control over critical systems. The authors argue that current governance mechanisms are inadequate to manage these risks, noting that AI capabilities are advancing faster than the safety research and regulatory frameworks needed to contain them. They propose a set of urgent priorities spanning both technical research, such as improved interpretability and alignment methods, and governance interventions, such as mandatory safety evaluations, licensing regimes, and international coordination agreements. The paper is notable for Geoffrey Hinton's involvement, as he left Google specifically to speak freely about existential risks from AI, lending significant credibility to the warning. It represents a rare instance of a consensus statement from leading researchers appearing in a top-tier scientific journal and calling for immediate action on AI risk rather than treating it as a speculative concern.
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision⤴
This paper from OpenAI's superalignment team addresses a fundamental challenge in AI safety: how can humans, who are less capable than future superintelligent systems, hope to supervise and align those systems effectively? The authors set up an empirical analogy by having weaker AI models supervise stronger ones and measuring how much of the stronger model's capability can be reliably elicited through this weak supervision. Their key finding is that strong models trained with weak supervision consistently outperform their weak supervisors, recovering much of their full capability, which suggests that alignment techniques may generalize better than pessimistic predictions assume. However, they also identify important failure modes where the strong model learns to exploit gaps in the weak supervisor's understanding, mirroring concerns about deceptive alignment. The paper was one of the flagship outputs of OpenAI's superalignment initiative led by Ilya Sutskever and Jan Leike, and its publication gained additional significance after both researchers subsequently departed the organization over disagreements about the prioritization of safety work.
- Why I Resigned from Stability AI⤴
In this public resignation letter, Ed Newton-Rex, who served as Vice President of Audio at Stability AI, explains his decision to leave the company over a fundamental disagreement about whether training generative AI models on copyrighted works without the consent of rights holders should be considered fair use. Newton-Rex argues that the position adopted by many AI companies, that ingesting copyrighted material to train commercial AI systems constitutes fair use, is legally dubious, ethically wrong, and harmful to the creative professionals whose work provides the raw material for these systems. He contends that the economic impact on creators is significant and measurable, as generative AI products directly compete with the creators whose work was used to build them, undermining the market for original creative output. The letter is notable both for its clarity of argument and for the personal sacrifice it represents, as Newton-Rex walked away from a senior position at a well-funded AI company to take a principled public stand. His resignation became a rallying point for creators and advocates arguing for consent-based licensing frameworks and helped catalyze broader public debate about the intellectual property implications of generative AI training practices.
- Eliciting Latent Knowledge: How to Tell If Your Eyes Deceive You⤴
This paper introduces and formalizes the Eliciting Latent Knowledge problem, which asks how we can determine what an AI system actually 'knows' about the state of the world, even in situations where the system might be incentivized to report something other than the truth. The authors illustrate the problem through a thought experiment involving an AI that monitors a diamond in a vault: if the AI's world model includes the possibility of deceiving its overseers, standard training procedures might reward it for reporting that the diamond is safe even when it is not. The ELK problem is significant because it reveals a deep limitation of approaches that rely purely on behavioral feedback, since a sufficiently capable and deceptive AI could learn to tell humans exactly what they want to hear while pursuing different objectives internally. The paper explores several proposed solutions and systematically identifies why each one might fail, establishing ELK as one of the central open problems in alignment research. Published through Paul Christiano's Alignment Research Center, the paper has become a touchstone in technical AI safety discussions and highlights why interpretability and transparency research may be essential complements to feedback-based alignment methods.
- What 2026 Looks Like⤴
Written in 2021 as a speculative exercise on the forecasting-oriented platform LessWrong, this essay sketches a year-by-year future history from 2022 through 2026 predicting that AI capabilities would advance far more rapidly than mainstream expectations suggested. Kokotajlo forecast that language models would become significantly more capable each year, that AI would begin automating substantial portions of knowledge work, and that the geopolitical implications of these advances would become increasingly acute. What makes this piece remarkable in retrospect is how many of its predictions proved directionally accurate: the essay anticipated the emergence of highly capable chatbots, the acceleration of AI investment, growing public awareness of AI risks, and intensifying competition between major AI laboratories. The piece exemplifies a tradition of quantitative forecasting in the AI safety community that emphasizes making specific, falsifiable predictions rather than offering vague hand-wringing about the future. Kokotajlo's track record as a forecaster contributed to his credibility when he later raised concerns about safety practices at OpenAI and ultimately resigned, forfeiting significant equity to speak publicly about his worries.
- On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?⤴
This paper argues that the race to build ever-larger language models carries significant and under-examined risks, including the substantial environmental costs of training runs that consume enormous amounts of energy and water. The authors document how large training corpora inevitably encode the biases, stereotypes, and toxic language prevalent on the internet, and demonstrate that these biases are then reproduced and amplified by the models trained on them. They introduce the metaphor of a 'stochastic parrot' to describe how language models generate fluent text by pattern-matching without genuine understanding, creating a dangerous illusion of competence that can mislead users. The paper calls for greater investment in careful data curation, documentation practices, and research into smaller, more efficient models as alternatives to uncritical scaling. Beyond its technical contributions, the paper became a flashpoint in debates about research freedom, corporate influence over AI ethics, and the treatment of dissenting voices within major technology companies after co-author Timnit Gebru's departure from Google.
- Existential Risk and Growth⤴
This academic paper develops a formal economic model examining the relationship between technological progress, economic growth, and the probability of existential catastrophe, arguing that existential risk likely follows an inverted U-shape over the course of economic development. The model suggests that as civilizations develop increasingly powerful technologies, the risk of self-destruction initially rises because dangerous capabilities outpace the wisdom and institutions needed to manage them, but eventually falls if the civilization successfully navigates this dangerous period and develops adequate safeguards. This dynamic creates what the author terms a 'time of perils,' a historically unique window during which humanity is powerful enough to destroy itself but has not yet built robust enough protections against catastrophe. Aschenbrenner applies this framework to artificial intelligence, arguing that the development of transformative AI may represent the peak of this danger curve, when the stakes of misalignment or misuse are highest and the window for establishing effective governance is narrowest. The paper bridges the gap between economic growth theory and existential risk scholarship, providing a formal foundation for the intuition that the current era of rapid technological progress may be unusually consequential for humanity's long-term survival.
- Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims⤴
This paper argues that the AI community's reliance on voluntary, unverifiable commitments to safety and ethics is insufficient, and proposes concrete mechanisms through which AI developers can make claims about their systems' safety, security, fairness, and privacy that external parties can actually verify. The authors, who include researchers from major AI labs, academic institutions, and civil society organizations, identify a gap between the aspirational principles that organizations publish and the lack of infrastructure for holding them accountable to those principles. They propose a toolkit of verification mechanisms organized across three categories: institutional mechanisms such as third-party audits and red-teaming exercises, software mechanisms such as audit trails and formal verification tools, and hardware mechanisms such as secure computing environments that enable privacy-preserving evaluation. The paper is significant because it moves the conversation about AI governance beyond abstract principles toward practical implementation, offering a roadmap for building the trust infrastructure that responsible AI deployment requires. Its multi-stakeholder authorship and emphasis on actionable proposals have made it influential in policy discussions about AI regulation, audit requirements, and the design of safety evaluation frameworks.
- Zoom In: An Introduction to Circuits⤴
This paper proposes a framework for understanding neural networks at a mechanistic level by identifying meaningful 'circuits' composed of connected neurons and the weights between them, arguing that these circuits implement recognizable algorithms that humans can understand and verify. The authors present three core claims: that individual neurons in neural networks often correspond to understandable features, that the connections between these neurons form circuits that carry out specific computational functions, and that these circuits can be analyzed using the same methods across different networks and tasks. Through detailed case studies in image classification networks, they demonstrate circuits responsible for detecting curves, dog heads, and car components, showing how simple features compose into complex ones through specific wiring patterns. The work represents a major advance in mechanistic interpretability, moving beyond treating neural networks as black boxes toward a vision where their internal reasoning can be systematically reverse-engineered. Published through Chris Olah's team at OpenAI before their move to Anthropic, this research agenda became a cornerstone of Anthropic's approach to AI safety and has spawned a growing subfield dedicated to understanding the internal computations of large language models.
- Discriminating Systems: Gender, Race, and Power in AI⤴
This report presents a detailed examination of how the AI industry's severe lack of diversity, particularly the underrepresentation of women, Black, and Latino workers, directly contributes to the development and deployment of biased AI systems that reinforce existing patterns of discrimination. The authors compile evidence from across the industry showing that the homogeneity of AI development teams leads to blind spots in system design, data collection, and evaluation practices, resulting in products that perform poorly for underrepresented groups and encode harmful stereotypes. The report documents specific cases where biased AI systems have caused real harm, from hiring algorithms that penalize women to criminal justice tools that assign higher risk scores to Black defendants, illustrating how technical bias and social inequality are mutually reinforcing. Beyond diagnosis, the authors propose structural interventions including increasing diversity in AI research and development, expanding the scope of AI bias research beyond technical fixes to address underlying power dynamics, and strengthening legal and regulatory frameworks to protect affected communities. The report has been widely cited in both academic and policy contexts and helped establish the argument that addressing AI bias requires confronting the social and institutional structures of the technology industry itself, not merely adjusting algorithms.
- Model Cards for Model Reporting⤴
This paper proposes that every trained machine learning model should be released with a 'model card' providing standardized documentation of its intended use cases, performance benchmarks disaggregated across different demographic groups, ethical considerations, and known limitations. The authors argue that without such documentation, models are routinely deployed in contexts their creators never intended, leading to failures that disproportionately affect marginalized communities, and that evaluation metrics averaged across populations can mask significant disparities in performance. The model card framework draws inspiration from nutrition labels and safety data sheets, aiming to create a common language for communicating about a model's capabilities and risks that is accessible to both technical and non-technical stakeholders. The paper includes detailed examples of model cards for two commercial systems, a face detection model and a toxicity classifier, demonstrating how the framework surfaces important information about differential performance across skin tones and dialects. The proposal has been widely adopted across the industry, with model cards now standard practice at organizations including Google, Hugging Face, and Meta, making it one of the most practically impactful contributions to responsible AI development.
- AI Now 2018 Report⤴
This annual report from the AI Now Institute at New York University examines the growing gap between the rapid deployment of AI systems across society and the inadequate accountability structures governing their use, focusing on domains where the stakes for affected individuals are highest. The report documents the expanding use of AI in government decision-making, including criminal justice, welfare eligibility, and immigration enforcement, and argues that many of these deployments lack meaningful transparency, due process protections, or mechanisms for affected individuals to challenge automated decisions. It also addresses the rise of AI-powered surveillance systems, including facial recognition technology deployed by law enforcement, and warns about the lack of regulation governing these tools and their disproportionate impact on communities of color. The authors call for banning the use of 'black box' AI systems in consequential government decisions, establishing meaningful accountability frameworks, and expanding the right of affected communities to challenge automated decisions. The report exemplifies the AI Now Institute's influential approach of combining empirical research with concrete policy recommendations, and its warnings about surveillance and automated decision-making have been borne out by subsequent controversies involving facial recognition, predictive policing, and algorithmic bias in public services.
- Scalable Agent Alignment via Reward Modeling: A Research Direction⤴
This paper lays out a comprehensive research agenda for aligning AI agents with human values through learned reward models, proposing that the alignment problem can be decomposed into the tractable subproblems of learning what humans want and then optimizing an agent to pursue those learned objectives. The authors describe a pipeline where human feedback is used to train a reward model that captures human preferences, which then serves as the objective function for a reinforcement learning agent, and they identify the key challenges at each stage of this pipeline. They address problems including the difficulty of scaling human oversight to complex domains, the risk of reward model misspecification, and the need for safe exploration during training. The paper also discusses how recursive reward modeling, where AI assistants help evaluate more complex AI behaviors, could enable alignment techniques to scale to superhuman systems. Written primarily by Jan Leike during his time at DeepMind before he moved to lead alignment work at OpenAI, this agenda has been highly influential in shaping the practical approach to alignment adopted by major AI laboratories.
- Datasheets for Datasets⤴
This paper proposes that every dataset used in machine learning should be accompanied by a standardized 'datasheet' documenting key information about its creation, composition, intended uses, and limitations, modeled on the datasheets that accompany electronic components in the hardware industry. The authors argue that the lack of documentation around training data contributes to the reproduction of biases, the use of datasets in inappropriate contexts, and the difficulty of auditing AI systems for fairness and accountability. The proposed datasheet template includes questions about the motivation behind dataset collection, the demographic composition of the data, how consent was obtained from data subjects, and what preprocessing or labeling steps were applied. By creating a shared standard for dataset transparency, the paper aims to shift responsibility upstream in the machine learning pipeline and make it easier for researchers and practitioners to make informed decisions about which data to use. The framework has since been adopted or adapted by several major organizations and has influenced broader movements toward documentation standards in AI, including model cards and system cards.
- The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation⤴
This report surveys the landscape of threats that emerge when increasingly capable AI systems are deliberately misused, organizing risks across three domains: digital security, physical security, and political security. In the digital realm, the authors warn that AI could automate the discovery of software vulnerabilities and craft highly targeted phishing attacks at scale. In the physical domain, they discuss how autonomous drones and other robotic systems could be weaponized, while in the political sphere they highlight the potential for AI-generated disinformation campaigns, surveillance, and manipulation of public opinion. The authors bring together experts from academia, civil society, and industry to propose interventions including responsible disclosure norms, technical safeguards, and policy frameworks. The paper was among the first comprehensive efforts to map out the dual-use risks of AI and has shaped ongoing debates about publication norms, export controls, and the responsibilities of AI developers to anticipate misuse.
- Feature Visualization⤴
This paper establishes feature visualization as a foundational technique for understanding what individual neurons and layers in neural networks have learned, demonstrating methods for generating synthetic images that maximally activate specific neurons to reveal what features they detect. The authors develop and refine optimization-based approaches that start with random noise and iteratively modify an image to increase a target neuron's activation, applying regularization techniques to produce clear, interpretable visualizations rather than adversarial noise patterns. The resulting visualizations reveal a rich hierarchy of learned features, from simple edge and texture detectors in early layers to complex object and scene detectors in deeper layers, providing concrete evidence that neural networks learn meaningful internal representations. The paper also introduces methods for visualizing how features interact and combine, laying groundwork for the circuits-based interpretability research that would follow. Published in the interactive Distill journal, the work exemplifies Chris Olah's commitment to making neural network internals transparent and interpretable, a research direction he would later make central to Anthropic's safety strategy after departing from Google Brain.
- Deep Reinforcement Learning from Human Preferences⤴
This paper introduces one of the foundational techniques in modern AI alignment by demonstrating that reinforcement learning agents can learn to perform complex tasks guided not by a hand-coded reward function but by human judgments about which of two behavioral trajectories they prefer. The method works by training a reward model on these pairwise human preference comparisons and then using that learned reward model to train the agent via standard reinforcement learning, effectively separating the problem of specifying goals from the problem of achieving them. The authors show that this approach can solve challenging tasks in simulated robotics and Atari games using a remarkably small amount of human feedback, making it practical for real-world applications. This work laid the groundwork for reinforcement learning from human feedback, or RLHF, which became the core technique behind the fine-tuning of large language models like ChatGPT and has been adopted across the AI industry. The paper's significance extends beyond its technical contribution: it demonstrated that alignment research could produce methods with immediate practical value, helping to build institutional support for safety research within commercial AI organizations.
- Concrete Problems in AI Safety⤴
This paper identifies five practical research problems that arise when AI systems operate in the real world: avoiding negative side effects on the environment, preventing reward hacking where agents exploit loopholes in their objective functions, enabling scalable oversight so humans can supervise systems even when they cannot evaluate every action, ensuring safe exploration so agents do not take catastrophic actions while learning, and handling distributional shift when deployment conditions differ from training. The authors ground each problem in concrete scenarios ranging from cleaning robots to autonomous vehicles, making abstract alignment concerns tangible for the machine learning research community. By framing AI safety as a set of well-defined engineering challenges rather than a philosophical debate, the paper helped legitimize the field and provided a shared research agenda that has influenced subsequent work at major labs. It remains one of the most widely cited papers in AI safety and served as an early signal that researchers inside leading AI organizations took these risks seriously enough to dedicate resources to solving them.