"We Are (Language) Family”: Adapting Transformer models to related minority languages with linguistic data

Miguel López-Otal; Jorge Gracia

"We Are (Language) Family”: Adapting Transformer models to related minority languages with linguistic data

Abstract

Transformer-based language models, despite their widespread use, remain mostly unavailable for low-resourced languages (LRLs), due to their lack of texts for pre-training. While solutions have emerged to remedy this, they still almost exclusively rely on raw text corpora, which may be almost non-existent for some languages. A recent line of work has attempted to circumvent this by replacing these with linguistics-based materials, such as grammars, to adapt LRLs to these models. However, many approaches tend to work with languages that are typologically very distant to each other.In this work we investigate whether adapting closely related languages, belonging to the same family, with linguistics-based data can facilitate this process. For this, we look into the adaptation of two Spanish-based Transformer encoders –a monolingual and multilingual models– to Aragonese, a low-resourced Romance language spoken in Northern Spain, with whom it shares similar syntax but differing lexical and morphological phenomena. We rely on several knowledge injection methods, with which we report results, for a monolingual model, above some baselines in a set of Natural Language Understanding (NLU) benchmarks, proving the efficiency of relying on linguistics materials –or combined with a small amount of text– when languages belong to the same family.

Anthology ID:: 2026.loreslm-1.26
Volume:: Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Editors:: Hansi Hettiarachchi, Tharindu Ranasinghe, Alistair Plum, Paul Rayson, Ruslan Mitkov, Mohamed Gaber, Damith Premasiri, Fiona Anting Tan, Lasitha Uyangodage
Venue:: LoResLM
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 297–310
Language:
URL:: https://aclanthology.org/2026.loreslm-1.26/
DOI:
Bibkey:
Cite (ACL):: Miguel López-Otal and Jorge Gracia. 2026. "We Are (Language) Family”: Adapting Transformer models to related minority languages with linguistic data. In Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026), pages 297–310, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: “We Are (Language) Family”: Adapting Transformer models to related minority languages with linguistic data (López-Otal & Gracia, LoResLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.loreslm-1.26.pdf

PDF Cite Search Fix data