Geographical Erasure in Language Generation

Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cedric Archambeau, Danish Pruthi


Abstract
Large language models (LLMs) encode vast amounts of world knowledge. However, since these models are trained on large swaths of internet data, they are at risk of inordinately capturing information about dominant groups. This imbalance can propagate into generated language. In this work, we study and operationalise a form of geographical erasure wherein language models underpredict certain countries. We demonstrate consistent instances of erasure across a range of LLMs. We discover that erasure strongly correlates with low frequencies of country mentions in the training corpus. Lastly, we mitigate erasure by finetuning using a custom objective.
Anthology ID:
2023.findings-emnlp.823
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2023
Month:
December
Year:
2023
Address:
Singapore
Editors:
Houda Bouamor, Juan Pino, Kalika Bali
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
12310–12324
Language:
URL:
https://aclanthology.org/2023.findings-emnlp.823
DOI:
10.18653/v1/2023.findings-emnlp.823
Bibkey:
Cite (ACL):
Pola Schwöbel, Jacek Golebiowski, Michele Donini, Cedric Archambeau, and Danish Pruthi. 2023. Geographical Erasure in Language Generation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12310–12324, Singapore. Association for Computational Linguistics.
Cite (Informal):
Geographical Erasure in Language Generation (Schwöbel et al., Findings 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.findings-emnlp.823.pdf