Rodney Kinney


2024

pdf bib
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini | Rodney Kinney | Akshita Bhagia | Dustin Schwenk | David Atkinson | Russell Authur | Ben Bogin | Khyathi Chandu | Jennifer Dumas | Yanai Elazar | Valentin Hofmann | Ananya Jha | Sachin Kumar | Li Lucy | Xinxi Lyu | Nathan Lambert | Ian Magnusson | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Abhilasha Ravichander | Kyle Richardson | Zejiang Shen | Emma Strubell | Nishant Subramani | Oyvind Tafjord | Evan Walsh | Luke Zettlemoyer | Noah Smith | Hannaneh Hajishirzi | Iz Beltagy | Dirk Groeneveld | Jesse Dodge | Kyle Lo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

pdf bib
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld | Iz Beltagy | Evan Walsh | Akshita Bhagia | Rodney Kinney | Oyvind Tafjord | Ananya Jha | Hamish Ivison | Ian Magnusson | Yizhong Wang | Shane Arora | David Atkinson | Russell Authur | Khyathi Chandu | Arman Cohan | Jennifer Dumas | Yanai Elazar | Yuling Gu | Jack Hessel | Tushar Khot | William Merrill | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Valentina Pyatkin | Abhilasha Ravichander | Dustin Schwenk | Saurabh Shah | William Smith | Emma Strubell | Nishant Subramani | Mitchell Wortsman | Pradeep Dasigi | Nathan Lambert | Kyle Richardson | Luke Zettlemoyer | Jesse Dodge | Kyle Lo | Luca Soldaini | Noah Smith | Hannaneh Hajishirzi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.

2020

pdf bib
S2ORC: The Semantic Scholar Open Research Corpus
Kyle Lo | Lucy Lu Wang | Mark Neumann | Rodney Kinney | Daniel Weld
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We introduce S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.

2018

pdf bib
Construction of the Literature Graph in Semantic Scholar
Waleed Ammar | Dirk Groeneveld | Chandra Bhagavatula | Iz Beltagy | Miles Crawford | Doug Downey | Jason Dunkelberger | Ahmed Elgohary | Sergey Feldman | Vu Ha | Rodney Kinney | Sebastian Kohlmeier | Kyle Lo | Tyler Murray | Hsu-Han Ooi | Matthew Peters | Joanna Power | Sam Skjonsberg | Lucy Lu Wang | Chris Wilhelm | Zheng Yuan | Madeleine van Zuylen | Oren Etzioni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 3 (Industry Papers)

We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions between them (e.g., authorships, citations, entity mentions). We reduce literature graph construction into familiar NLP tasks (e.g., entity extraction and linking), point out research challenges due to differences from standard formulations of these tasks, and report empirical results for each task. The methods described in this paper are used to enable semantic features in www.semanticscholar.org.