A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages Jorge Palomar-Giner author Jose Javier Saiz author Ferran Espuña author Mario Mina author Severino Da Dalt author Joan Llop author Malte Ostendorff author Pedro Ortiz Suarez author Georg Rehm author Aitor Gonzalez-Agirre author Marta Villegas author 2024-05 text Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) Nicoletta Calzolari editor Min-Yen Kan editor Veronique Hoste editor Alessandro Lenci editor Sakriani Sakti editor Nianwen Xue editor ELRA and ICCL Torino, Italia conference publication palomar-giner-etal-2024-curated https://aclanthology.org/2024.lrec-main.31/ 2024-05 335 349