LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization

Muhammad Adilazuarda, Samuel Cahyawijaya, Genta Winata, Ayu Purwarianti, Alham Aji


Abstract
Pretrained language models (PLMs) have shown remarkable generalization toward multiple tasks and languages. Nonetheless, the generalization of PLMs towards unseen languages is poor, resulting in significantly worse language performance, or even generating nonsensical responses that are comparable to a random baseline. This limitation has been a longstanding problem of PLMs raising the problem of diversity and equal access to language modeling technology. In this work, we solve this limitation by introducing LinguAlchemy, a regularization technique that incorporates various aspects of languages covering typological, geographical, and phylogenetic constraining the resulting representation of PLMs to better characterize the corresponding linguistics constraints. LinguAlchemy significantly improves the accuracy performance of mBERT and XLM-R on unseen languages by ~18% and ~2%, respectively compared to fully finetuned models and displaying a high degree of unseen language generalization. We further introduce AlchemyScale and AlchemyTune, extension of LinguAlchemy which adjusts the linguistic regularization weights automatically, alleviating the need for hyperparameter search. LinguAlchemy enables better cross-lingual generalization to unseen languages which is vital for better inclusivity and accessibility of PLMs.
Anthology ID:
2024.findings-emnlp.225
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3912–3928
Language:
URL:
https://aclanthology.org/2024.findings-emnlp.225
DOI:
Bibkey:
Cite (ACL):
Muhammad Adilazuarda, Samuel Cahyawijaya, Genta Winata, Ayu Purwarianti, and Alham Aji. 2024. LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 3912–3928, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
LinguAlchemy: Fusing Typological and Geographical Elements for Unseen Language Generalization (Adilazuarda et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-emnlp.225.pdf