Low-resource Bilingual Dialect Lexicon Induction with Large Language Models

Ekaterina Artemova, Barbara Plank


Abstract
Bilingual word lexicons map words in one language to their synonyms in another language. Numerous papers have explored bilingual lexicon induction (BLI) in high-resource scenarios, framing a typical pipeline that consists of two steps: (i) unsupervised bitext mining and (ii) unsupervised word alignment. At the core of those steps are pre-trained large language models (LLMs).In this paper we present the analysis of the BLI pipeline for German and two of its dialects, Bavarian and Alemannic. This setup poses a number of unique challenges, attributed to the scarceness of resources, relatedness of the languages and lack of standardization in the orthography of dialects. We analyze the BLI outputs with respect to word frequency and the pairwise edit distance. Finally, we release an evaluation dataset consisting of manual annotations for 1K bilingual word pairs labeled according to their semantic similarity.
Anthology ID:
2023.nodalida-1.39
Volume:
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
Month:
May
Year:
2023
Address:
Tórshavn, Faroe Islands
Editors:
Tanel Alumäe, Mark Fishel
Venue:
NoDaLiDa
SIG:
Publisher:
University of Tartu Library
Note:
Pages:
371–385
Language:
URL:
https://aclanthology.org/2023.nodalida-1.39
DOI:
Bibkey:
Cite (ACL):
Ekaterina Artemova and Barbara Plank. 2023. Low-resource Bilingual Dialect Lexicon Induction with Large Language Models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 371–385, Tórshavn, Faroe Islands. University of Tartu Library.
Cite (Informal):
Low-resource Bilingual Dialect Lexicon Induction with Large Language Models (Artemova & Plank, NoDaLiDa 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.nodalida-1.39.pdf