Generating bilingual example sentences with large language models as lexicography assistants

Raphael Merx, Ekaterina Vylomova, Kemal Kurniawan


Abstract
We present a study of LLMs’ performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels: French (high-resource), Indonesian (mid-resource), and Tetun (low-resource), with English as the target language. We evaluate the quality of LLMgenerated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility (Kilgarriff et al., 2008). Our findings reveal that while LLMs can generate reasonably good dictionary examples, their performance degrades significantly for lower-resourced languages. We also observe high variability in human preferences for example quality, reflected in low interannotator agreement rates. To address this, we demonstrate that in-context learning can successfully align LLMs with individual annotator preferences. Additionally, we explore the use of pre-trained language models for automated rating of examples, finding that sentence perplexity serves as a good proxy for “typicality” and “intelligibility” in higher-resourced languages. Our study also contributes a novel dataset of 600 ratings for LLM-generated sentence pairs, and provides insights into the potential of LLMs in reducing the cost of lexicographic work, particularly for low-resource languages.
Anthology ID:
2024.alta-1.5
Volume:
Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association
Month:
December
Year:
2024
Address:
Canberra, Australia
Editors:
Tim Baldwin, Sergio José Rodríguez Méndez, Nicholas Kuo
Venue:
ALTA
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–74
Language:
URL:
https://aclanthology.org/2024.alta-1.5/
DOI:
Bibkey:
Cite (ACL):
Raphael Merx, Ekaterina Vylomova, and Kemal Kurniawan. 2024. Generating bilingual example sentences with large language models as lexicography assistants. In Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association, pages 64–74, Canberra, Australia. Association for Computational Linguistics.
Cite (Informal):
Generating bilingual example sentences with large language models as lexicography assistants (Merx et al., ALTA 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.alta-1.5.pdf