Discovering Language Model Behaviors with Model-Written Evaluations Ethan Perez author Sam Ringer author Kamile Lukosiute author Karina Nguyen author Edwin Chen author Scott Heiner author Craig Pettit author Catherine Olsson author Sandipan Kundu author Saurav Kadavath author Andy Jones author Anna Chen author Benjamin Mann author Brian Israel author Bryan Seethor author Cameron McKinnon author Christopher Olah author Da Yan author Daniela Amodei author Dario Amodei author Dawn Drain author Dustin Li author Eli Tran-Johnson author Guro Khundadze author Jackson Kernion author James Landis author Jamie Kerr author Jared Mueller author Jeeyoon Hyun author Joshua Landau author Kamal Ndousse author Landon Goldberg author Liane Lovitt author Martin Lucas author Michael Sellitto author Miranda Zhang author Neerav Kingsland author Nelson Elhage author Nicholas Joseph author Noemi Mercado author Nova DasSarma author Oliver Rausch author Robin Larson author Sam McCandlish author Scott Johnston author Shauna Kravec author Sheer El Showk author Tamera Lanham author Timothy Telleen-Lawton author Tom Brown author Tom Henighan author Tristan Hume author Yuntao Bai author Zac Hatfield-Dodds author Jack Clark author Samuel R Bowman author Amanda Askell author Roger Grosse author Danny Hernandez author Deep Ganguli author Evan Hubinger author Nicholas Schiefer author Jared Kaplan author 2023-07 text Findings of the Association for Computational Linguistics: ACL 2023 Anna Rogers editor Jordan Boyd-Graber editor Naoaki Okazaki editor Association for Computational Linguistics Toronto, Canada conference publication perez-etal-2023-discovering 10.18653/v1/2023.findings-acl.847 https://aclanthology.org/2023.findings-acl.847/ 2023-07 13387 13434