OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Jiacheng Liu; Taylor Blanton; Yanai Elazar; Sewon Min; Yen-Sung Chen; Arnavi Chheda-Kothary; Huy Tran; Byron Bischoff; Eric Marsh; Michael Schmitz; Cassidy Trier; Aaron Sarnat; Jenna James; Jon Borchardt; Bailey Kuehl; Evie Yu-Yen Cheng; Karen Farley; Taira Anderson; David Albright; Carissa Schoenick; Luca Soldaini; Dirk Groeneveld; Rock Yuren Pang; Pang Wei Koh; Noah A. Smith; Sophie Lebrecht; Yejin Choi; Hannaneh Hajishirzi; Ali Farhadi; Jesse Dodge

doi:10.18653/v1/2025.acl-demo.18

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, Jesse Dodge

Abstract

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

Anthology ID:: 2025.acl-demo.18
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Pushkar Mishra, Smaranda Muresan, Tao Yu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 178–188
Language:
URL:: https://aclanthology.org/2025.acl-demo.18/
DOI:: 10.18653/v1/2025.acl-demo.18
Award:: Best Demo
Bibkey:
Cite (ACL):: Jiacheng Liu, Taylor Blanton, Yanai Elazar, Sewon Min, Yen-Sung Chen, Arnavi Chheda-Kothary, Huy Tran, Byron Bischoff, Eric Marsh, Michael Schmitz, Cassidy Trier, Aaron Sarnat, Jenna James, Jon Borchardt, Bailey Kuehl, Evie Yu-Yen Cheng, Karen Farley, Taira Anderson, David Albright, Carissa Schoenick, Luca Soldaini, Dirk Groeneveld, Rock Yuren Pang, Pang Wei Koh, Noah A. Smith, Sophie Lebrecht, Yejin Choi, Hannaneh Hajishirzi, Ali Farhadi, and Jesse Dodge. 2025. OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 178–188, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens (Liu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-demo.18.pdf

PDF Cite Search Fix data