Severino Da Dalt
2024
A CURATEd CATalog: Rethinking the Extraction of Pretraining Corpora for Mid-Resourced Languages
Jorge Palomar-Giner
|
Jose Javier Saiz
|
Ferran Espuña
|
Mario Mina
|
Severino Da Dalt
|
Joan Llop
|
Malte Ostendorff
|
Pedro Ortiz Suarez
|
Georg Rehm
|
Aitor Gonzalez-Agirre
|
Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We present and describe two language resources in this paper: CATalog 1.0, the largest text corpus in Catalan to date, and CURATE (Corpus Utility for RAting TExt), a modular, parallelizable pipeline used for processing and scoring documents based on text quality that we have optimised to run in High Performance Cluster (HPC) environments. In the coming sections we describe our data preprocessing pipeline at length; traditional pipelines usually implement a set of binary filters such that a given document is either in or out. In our experience with Catalan, in lower-resource settings it is more practical to instead assign a document a soft score to allow for more flexible decision-making. We describe how the document score is calculated and highlight its interpretability by showing that it is significantly correlated with human judgements as obtained from a comparative judgement experiment. We additionally describe the different subcorpora that make up CATalog 1.0.
FLOR: On the Effectiveness of Language Adaptation
Severino Da Dalt
|
Joan Llop
|
Irene Baucells
|
Marc Pamies
|
Yishi Xu
|
Aitor Gonzalez-Agirre
|
Marta Villegas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models have amply proven their great capabilities, both in downstream tasks and real-life settings. However, low- and mid-resource languages do not have access to the necessary means to train such models from scratch, and often have to rely on multilingual models despite being underrepresented in the training data. For the particular case of the Catalan language, we prove that continued pre-training with vocabulary adaptation is a better alternative to take the most out of already pre-trained models, even if these have not seen any Catalan data during their pre-training phase. We curate a 26B tokens corpus and use it to further pre-train BLOOM, giving rise to the FLOR models. We perform an extensive evaluation to assess the effectiveness of our method, obtaining consistent gains across Catalan and Spanish tasks. The models, training data, and evaluation framework are made freely available under permissive licenses.
Search
Fix data
Co-authors
- Aitor González-Agirre 2
- Joan Llop 2
- Marta Villegas 2
- Irene Baucells 1
- Ferran Espuña 1
- show all...