Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus Jesse Dodge author Maarten Sap author Ana Marasović author William Agnew author Gabriel Ilharco author Dirk Groeneveld author Margaret Mitchell author Matt Gardner author 2021-11 text Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing Marie-Francine Moens editor Xuanjing Huang editor Lucia Specia editor Scott Wen-tau Yih editor Association for Computational Linguistics Online and Punta Cana, Dominican Republic conference publication dodge-etal-2021-documenting 10.18653/v1/2021.emnlp-main.98 https://aclanthology.org/2021.emnlp-main.98/ 2021-11 1286 1305