Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study

Pablo Rodríguez; Silvia Paniagua Suárez; Pablo Gamallo; Susana Sotelo

doi:10.18653/v1/2025.findings-acl.240

Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study

Pablo Rodríguez, Silvia Paniagua Suárez, Pablo Gamallo, Susana Sotelo Docio

Abstract

Recent advances in Large Language Models (LLMs) have led to remarkable improvements in language understanding and text generation. However, challenges remain in enhancing their performance for underrepresented languages, ensuring continual learning without catastrophic forgetting, and developing robust evaluation methodologies. This work addresses these issues by investigating the impact of Continued Pretraining (CPT) on multilingual models and proposing a comprehensive evaluation framework for LLMs, focusing on the case of Galician language. Our first contribution explores CPT strategies for languages with limited representation in multilingual models. We analyze how CPT with Galician corpora improves text generation while assessing the trade-offs between linguistic enrichment and task-solving capabilities. Our findings show that CPT with small, high-quality corpora and diverse instructions enhances both task performance and linguistic quality. Our second contribution is a structured evaluation framework based on distinguishing task-based and language-based assessments, leveraging existing and newly developed benchmarks for Galician. Additionally, we contribute new Galician LLMs, datasets for evaluation and instructions, and an evaluation framework.

Anthology ID:: 2025.findings-acl.240
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4622–4637
Language:
URL:: https://aclanthology.org/2025.findings-acl.240/
DOI:: 10.18653/v1/2025.findings-acl.240
Bibkey:
Cite (ACL):: Pablo Rodríguez, Silvia Paniagua Suárez, Pablo Gamallo, and Susana Sotelo Docio. 2025. Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study. In Findings of the Association for Computational Linguistics: ACL 2025, pages 4622–4637, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Continued Pretraining and Interpretability-Based Evaluation for Low-Resource Languages: A Galician Case Study (Rodríguez et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.240.pdf

PDF Cite Search Fix data