Using Language Models to Disambiguate Lexical Choices in Translation

Josh Barua, Sanjay Subramanian, Kayo Yin, Alane Suhr


Abstract
In translation, a concept represented by a single word in a source language can have multiple variations in a target language. The task of lexical selection requires using context to identify which variation is most appropriate for a source text. We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English. We evaluate recent LLMs and neural machine translation systems on DTAiLS, with the best-performing model, GPT-4, achieving from 67 to 85% accuracy across languages. Finally, we use language models to generate English rules describing target-language concept variations. Providing weaker models with high-quality lexical rules improves accuracy substantially, in some cases reaching or outperforming GPT-4.
Anthology ID:
2024.emnlp-main.278
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4837–4848
Language:
URL:
https://aclanthology.org/2024.emnlp-main.278
DOI:
10.18653/v1/2024.emnlp-main.278
Bibkey:
Cite (ACL):
Josh Barua, Sanjay Subramanian, Kayo Yin, and Alane Suhr. 2024. Using Language Models to Disambiguate Lexical Choices in Translation. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4837–4848, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Using Language Models to Disambiguate Lexical Choices in Translation (Barua et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.278.pdf
Data:
 2024.emnlp-main.278.data.zip