LexGen: Domain-aware Multilingual Lexicon Generation

Ayush Maheshwari; Atul Kumar Singh; N J Karthika; Krishnakant Bhatt; Preethi Jyothi; Ganesh Ramakrishnan

doi:10.18653/v1/2025.acl-long.365

LexGen: Domain-aware Multilingual Lexicon Generation

Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, Ganesh Ramakrishnan

Abstract

Lexicon or dictionary generation across domains has the potential for societal impact, as it can potentially enhance information accessibility for a diverse user base while preserving language identity. Prior work in the field primarily focuses on bilingual lexical induction, which deals with word alignments using mapping-based or corpora-based approaches. However, these approaches do not cater to domain-specific lexicon generation that consists of domain-specific terminology. This task becomes particularly important in specialized medical, engineering, and other technical domains, owing to the highly infrequent usage of the terms and scarcity of data involving domain-specific terms especially for low-resource languages. We propose a new model to generate dictionary words for 6 Indian languages in the multi-domain setting. Our model consists of domain-specific and domain-generic layers that encode information, and these layers are invoked via a learnable routing technique. We also release a new benchmark dataset consisting of >75K translation pairs across 6 Indian languages spanning 8 diverse domains. We conduct both zero-shot and few-shot experiments across multiple domains to show the efficacy of our proposed model in generalizing to unseen domains and unseen languages. Additionally, we also perform a human post-hoc evaluation on unseen languages. The source code and dataset is present at https://github.com/Atulkmrsingh/lexgen.

Anthology ID:: 2025.acl-long.365
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7364–7375
Language:
URL:: https://aclanthology.org/2025.acl-long.365/
DOI:: 10.18653/v1/2025.acl-long.365
Bibkey:
Cite (ACL):: Ayush Maheshwari, Atul Kumar Singh, N J Karthika, Krishnakant Bhatt, Preethi Jyothi, and Ganesh Ramakrishnan. 2025. LexGen: Domain-aware Multilingual Lexicon Generation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7364–7375, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: LexGen: Domain-aware Multilingual Lexicon Generation (Maheshwari et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.365.pdf

PDF Cite Search Fix data