Russell Authur


2024

pdf bib
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini | Rodney Kinney | Akshita Bhagia | Dustin Schwenk | David Atkinson | Russell Authur | Ben Bogin | Khyathi Chandu | Jennifer Dumas | Yanai Elazar | Valentin Hofmann | Ananya Jha | Sachin Kumar | Li Lucy | Xinxi Lyu | Nathan Lambert | Ian Magnusson | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Abhilasha Ravichander | Kyle Richardson | Zejiang Shen | Emma Strubell | Nishant Subramani | Oyvind Tafjord | Evan Walsh | Luke Zettlemoyer | Noah Smith | Hannaneh Hajishirzi | Iz Beltagy | Dirk Groeneveld | Jesse Dodge | Kyle Lo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

pdf bib
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld | Iz Beltagy | Evan Walsh | Akshita Bhagia | Rodney Kinney | Oyvind Tafjord | Ananya Jha | Hamish Ivison | Ian Magnusson | Yizhong Wang | Shane Arora | David Atkinson | Russell Authur | Khyathi Chandu | Arman Cohan | Jennifer Dumas | Yanai Elazar | Yuling Gu | Jack Hessel | Tushar Khot | William Merrill | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Valentina Pyatkin | Abhilasha Ravichander | Dustin Schwenk | Saurabh Shah | William Smith | Emma Strubell | Nishant Subramani | Mitchell Wortsman | Pradeep Dasigi | Nathan Lambert | Kyle Richardson | Luke Zettlemoyer | Jesse Dodge | Kyle Lo | Luca Soldaini | Noah Smith | Hannaneh Hajishirzi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.

2023

pdf bib
PaperMage: A Unified Toolkit for Processing, Representing, and Manipulating Visually-Rich Scientific Documents
Kyle Lo | Zejiang Shen | Benjamin Newman | Joseph Chang | Russell Authur | Erin Bransom | Stefan Candra | Yoganand Chandrasekhar | Regan Huff | Bailey Kuehl | Amanpreet Singh | Chris Wilhelm | Angele Zamarron | Marti A. Hearst | Daniel Weld | Doug Downey | Luca Soldaini
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Despite growing interest in applying natural language processing (NLP) and computer vision (CV) models to the scholarly domain, scientific documents remain challenging to work with. They’re often in difficult-to-use PDF formats, and the ecosystem of models to process them is fragmented and incomplete. We introduce PaperMage, an open-source Python toolkit for analyzing and processing visually-rich, structured scientific documents. PaperMage offers clean and intuitive abstractions for seamlessly representing and manipulating both textual and visual document elements. PaperMage achieves this by integrating disparate state-of-the-art NLP and CV models into a unified framework, and provides turn-key recipes for common scientific document processing use-cases. PaperMage has powered multiple research prototypes of AI applications over scientific documents, along with Semantic Scholar’s large-scale production system for processing millions of PDFs. GitHub: https://github.com/allenai/papermage