Miguel López-Otal
Also published as: Miguel López Otal
2026
"We Are (Language) Family”: Adapting Transformer models to related minority languages with linguistic data
Miguel López-Otal | Jorge Gracia
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Miguel López-Otal | Jorge Gracia
Proceedings of the Second Workshop on Language Models for Low-Resource Languages (LoResLM 2026)
Transformer-based language models, despite their widespread use, remain mostly unavailable for low-resourced languages (LRLs), due to their lack of texts for pre-training. While solutions have emerged to remedy this, they still almost exclusively rely on raw text corpora, which may be almost non-existent for some languages. A recent line of work has attempted to circumvent this by replacing these with linguistics-based materials, such as grammars, to adapt LRLs to these models. However, many approaches tend to work with languages that are typologically very distant to each other.In this work we investigate whether adapting closely related languages, belonging to the same family, with linguistics-based data can facilitate this process. For this, we look into the adaptation of two Spanish-based Transformer encoders –a monolingual and multilingual models– to Aragonese, a low-resourced Romance language spoken in Northern Spain, with whom it shares similar syntax but differing lexical and morphological phenomena. We rely on several knowledge injection methods, with which we report results, for a monolingual model, above some baselines in a set of Natural Language Understanding (NLU) benchmarks, proving the efficiency of relying on linguistics materials –or combined with a small amount of text– when languages belong to the same family.
2024
Croatian Idioms Integration: Enhancing the LIdioms Multilingual Linked Idioms Dataset
Ivana Filipović Petrović | Miguel López Otal | Slobodan Beliga
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Ivana Filipović Petrović | Miguel López Otal | Slobodan Beliga
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Idioms, also referred to as phraseological units in some language terminologies, are a subset within the broader category of multi-word expressions. However, there is a lack of representation of idioms in Croatian, a low-resourced language, in the Linguistic Linked Open Data cloud (LLOD). To address this gap, we propose an extension of an existing RDF-based multilingual representation of idioms, referred to as the LIdioms dataset, which currently includes idioms from English, German, Italian, Portuguese, and Russian. This paper expands the existing resource by incorporating 1,042 Croatian idioms in an Ontolex Lemon format. In addition, to foster translation initiatives and facilitate intercultural exchange, these added Croatian idioms have also been linked to other idioms of the LIdioms dataset, with which they share similar meanings despite their differences in the expression aspect. This addition enriches the knowledge base of the LLOD community with a new language resource that includes Croatian idioms.