A Pretrainer’s Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity Shayne Longpre author Gregory Yauney author Emily Reif author Katherine Lee author Adam Roberts author Barret Zoph author Denny Zhou author Jason Wei author Kevin Robinson author David Mimno author Daphne Ippolito author 2024-06 text Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) Kevin Duh editor Helena Gomez editor Steven Bethard editor Association for Computational Linguistics Mexico City, Mexico conference publication longpre-etal-2024-pretrainers 10.18653/v1/2024.naacl-long.179 https://aclanthology.org/2024.naacl-long.179/ 2024-06 3245 3276