A Language Model Trained on Uruguayan Spanish News Text

Juan Pablo Filevich, Gonzalo Marco, Santiago Castro, Luis Chiruzzo, Aiala Rosá


Abstract
This paper presents a language model trained from scratch exclusively on a brand new corpus consisting of about 6 GiB of Uruguayan newspaper text. We trained the model for 30 days on a single Nvidia P100 using the RoBERTa-base architecture but with considerably fewer parameters than other standard RoBERTa models. We evaluated the model on two NLP tasks and found that it outperforms BETO, the widely used Spanish BERT pre-trained model. We also compared our model on the masked-word prediction task with two popular multilingual BERT-based models, Multilingual BERT and XLM-RoBERTa, obtaining outstanding results on sentences from the Uruguayan press domain. Our experiments show that training a language model on a domain-specific corpus can significantly improve performance even when the model is smaller and was trained with significantly less data than more standard pre-trained models.
Anthology ID:
2024.tdle-1.5
Volume:
Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Federico Gaspari, Joss Moorkens, Itziar Aldabe, Aritz Farwell, Begona Altuna, Stelios Piperidis, Georg Rehm, German Rigau
Venues:
TDLE | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
53–60
Language:
URL:
https://aclanthology.org/2024.tdle-1.5
DOI:
Bibkey:
Cite (ACL):
Juan Pablo Filevich, Gonzalo Marco, Santiago Castro, Luis Chiruzzo, and Aiala Rosá. 2024. A Language Model Trained on Uruguayan Spanish News Text. In Proceedings of the Second International Workshop Towards Digital Language Equality (TDLE): Focusing on Sustainability @ LREC-COLING 2024, pages 53–60, Torino, Italia. ELRA and ICCL.
Cite (Informal):
A Language Model Trained on Uruguayan Spanish News Text (Filevich et al., TDLE-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.tdle-1.5.pdf