Alignment Workshop

San Francisco Alignment Workshop

27–28 February 2023, San Francisco

In February 2023, researchers from a number of top industry AI labs (OpenAI, DeepMind, Anthropic) and universities (Cambridge, NYU) co-organized a two-day workshop on the problem of AI alignment, attended by 80 of the world’s leading machine learning researchers. We’re now making recordings and transcripts of the talks available online. The content ranged from very concrete to highly speculative, and the recordings include the many questions, interjections and debates which arose throughout. We hope that they provide a window into how leading researchers are grappling with one of the most pressing problems of our time.

If you're a machine learning researcher interested in attending follow-up workshops similar to the San Francisco alignment workshop, you can fill out this form.

Main talks

Ilya Sutskever—Opening Remarks: Confronting the Possibility of AGI
Jacob Steinhardt—Aligning Massive Models: Current and Future Challenges
Ajeya Cotra—“Situational Awareness” Makes Measuring Safety Tricky
Paul Christiano—How Misalignment Could Lead to Takeover
Jan Leike—Scaling Reinforcement Learning from Human Feedback
Chris Olah—Looking Inside Neural Networks with Mechanistic Interpretability
Dan Hendrycks—Surveying Safety Research Directions

Transcript

Lightning talks (Day 1)

Jason Wei—Emergent abilities of language models
Martin Wattenberg —Emergent world models and instrumenting AI systems
Been Kim—Alignment, setbacks and beyond alignment
Jascha Sohl-Dickstein—More intelligent agents behave less coherently
Ethan Perez—Model-written evals
Daniel Brown—Challenges and progress towards efficient and causal preference-based reward learning
Boaz Barak—For both alignment and utility: focus on the medium term
Ellie Pavlick—Comparing neural networks' conceptual representations to humans’
Percy Liang—Transparency and standards for language model evaluation

(Note: some parts of the video with audience questions are inaudible)

Lightning talks (Day 2)

Sam Bowman—Measuring progress on scalable oversight for large language models
Zico Kolter—"Safe Mode": the case for (manually) verifying the output of LLMs
Roger Grosse—Understanding LLM generalization using influence functions
Scott Niekum—Models of human preferences for learning reward functions
Aleksander Madry—Faster datamodels as a new approach to alignment
Andreas Stuhlmuller—Iterated decomposition: improving science Q&A by supervising reasoning processes
Paul Christiano—Mechanistic anomaly detection
Lionel Levine—Social dynamics of reinforcement learners
Vincent Conitzer—Foundations of cooperative AI lab
Scott Aaronson—Cryptographic backdoors in large language models

(Note: some parts of the video with audience questions are inaudible)

Attendees

Organizers

Ilya Sutskever

Richard Ngo

Sam Bowman

Tim Lillicrap

David Krueger

Jan Leike

Participants

Scott Aaronson

Joshua Achiam

Samuel Albanie

Jimmy Ba

Boaz Barak

James Bradbury

Noam Brown

Daniel Brown

Yuri Burda

Danqi Chen

Mark Chen

Eunsol Choi

Yejin Choi

Paul Christiano

Jeff Clune

Vincent Conitzer

Ajeya Cotra

Allan Dafoe

David Dohan

Anca Dragan

David Duvenaud

Jacob Eisenstein

Owain Evans

Amelia Glaese

Adam Gleave

Roger Grosse

He He

Dan Hendrycks

Jacob Hilton

Andrej Karpathy

Been Kim

Durk Kingma

Zico Kolter

Shane Legg

Lionel Levine

Omer Levy

Percy Liang

Ryan Lowe

Aleksander Madry

Tegan Maharaj

Nat McAleese

Sheila McIlraith

Roland Memisevic

Jesse Mu

Karthik Narasimhan

Scott Niekum

Chris Olah

Avital Oliver

Jakub Pachocki

Ellie Pavlick

Ethan Perez

Alec Radford

Jack Rae

Aditi Raghunathan

Aditya Ramesh

Mengye Ren

William Saunders

Rohin Shah

Jascha Sohl-Dickstein

Jacob Steinhardt

Andreas Stuhlmüller

Alex Tamkin

Nate Thomas

Maja Trebacz

Ashish Vaswani

Victor Veitch

Martin Wattenberg

Greg Wayne

Jason Wei

Hyung Won Chung

Yuhuai Wu

Jeffrey Wu

Diyi Yang

Wojciech Zaremba

Brian Ziebart