Geographic Adaptation of Pretrained Language Models

Valentin Hofmann; Goran Glavaš; Nikola Ljubešić; Janet Pierrehumbert; Hinrich Schütze

doi:10.1162/tacl_a_00652

Geographic Adaptation of Pretrained Language Models

Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, Hinrich Schütze

Abstract

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

Anthology ID:: 2024.tacl-1.23
Volume:: Transactions of the Association for Computational Linguistics, Volume 12
Month:
Year:: 2024
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 411–431
Language:
URL:: https://aclanthology.org/2024.tacl-1.23/
DOI:: 10.1162/tacl_a_00652
Bibkey:
Cite (ACL):: Valentin Hofmann, Goran Glavaš, Nikola Ljubešić, Janet B. Pierrehumbert, and Hinrich Schütze. 2024. Geographic Adaptation of Pretrained Language Models. Transactions of the Association for Computational Linguistics, 12:411–431.
Cite (Informal):: Geographic Adaptation of Pretrained Language Models (Hofmann et al., TACL 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.tacl-1.23.pdf

PDF Cite Search Fix data