CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages Thuat Nguyen author Chien Van Nguyen author Viet Dac Lai author Hieu Man author Nghia Trung Ngo author Franck Dernoncourt author Ryan A Rossi author Thien Huu Nguyen author 2024-05 text Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Nicoletta Calzolari editor Min-Yen Kan editor Veronique Hoste editor Alessandro Lenci editor Sakriani Sakti editor Nianwen Xue editor ELRA and ICCL Torino, Italia conference publication nguyen-etal-2024-culturax https://aclanthology.org/2024.lrec-main.377/ 2024-05 4226 4237