Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research Luca Soldaini author Rodney Kinney author Akshita Bhagia author Dustin Schwenk author David Atkinson author Russell Authur author Ben Bogin author Khyathi Chandu author Jennifer Dumas author Yanai Elazar author Valentin Hofmann author Ananya Jha author Sachin Kumar author Li Lucy author Xinxi Lyu author Nathan Lambert author Ian Magnusson author Jacob Morrison author Niklas Muennighoff author Aakanksha Naik author Crystal Nam author Matthew Peters author Abhilasha Ravichander author Kyle Richardson author Zejiang Shen author Emma Strubell author Nishant Subramani author Oyvind Tafjord author Evan Walsh author Luke Zettlemoyer author Noah Smith author Hannaneh Hajishirzi author Iz Beltagy author Dirk Groeneveld author Jesse Dodge author Kyle Lo author 2024-08 text Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Lun-Wei Ku editor Andre Martins editor Vivek Srikumar editor Association for Computational Linguistics Bangkok, Thailand conference publication soldaini-etal-2024-dolma 10.18653/v1/2024.acl-long.840 https://aclanthology.org/2024.acl-long.840/ 2024-08 15725 15788