Data, Data Everywhere: A Guide for Pretraining Dataset Construction Jupinder Parmar author Shrimai Prabhumoye author Joseph Jennings author Bo Liu author Aastha Jhunjhunwala author Zhilin Wang author Mostofa Patwary author Mohammad Shoeybi author Bryan Catanzaro author 2024-11 text Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing Yaser Al-Onaizan editor Mohit Bansal editor Yun-Nung Chen editor Association for Computational Linguistics Miami, Florida, USA conference publication parmar-etal-2024-data 10.18653/v1/2024.emnlp-main.596 https://aclanthology.org/2024.emnlp-main.596/ 2024-11 10671 10695