Strengthening the WiC: New Polysemy Dataset in Hindi and Lack of Cross Lingual Transfer

Haim Dubossarsky, Farheen Dairkee


Abstract
This study addresses the critical issue of Natural Language Processing in low-resource languages such as Hindi, which, despite having substantial number of speakers, is limited in linguistic resources. The paper focuses on Word Sense Disambiguation, a fundamental NLP task that deals with polysemous words. It introduces a novel Hindi WSD dataset in the modern WiC format, enabling the training and testing of contextualized models. The primary contributions of this work lie in testing the efficacy of multilingual models to transfer across languages and hence to handle polysemy in low-resource languages, and in providing insights into the minimum training data required for a viable solution. Experiments compare different contextualized models on the WiC task via transfer learning from English to Hindi. Models purely transferred from English yield poor 55% accuracy, while fine-tuning on Hindi dramatically improves performance to 90% accuracy. This demonstrates the need for language-specific tuning and resources like the introduced Hindi WiC dataset to drive advances in Hindi NLP. The findings offer valuable insights into addressing the NLP needs of widely spoken yet low-resourced languages, shedding light on the problem of transfer learning in these contexts.
Anthology ID:
2024.lrec-main.1332
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
15341–15349
Language:
URL:
https://aclanthology.org/2024.lrec-main.1332
DOI:
Bibkey:
Cite (ACL):
Haim Dubossarsky and Farheen Dairkee. 2024. Strengthening the WiC: New Polysemy Dataset in Hindi and Lack of Cross Lingual Transfer. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15341–15349, Torino, Italia. ELRA and ICCL.
Cite (Informal):
Strengthening the WiC: New Polysemy Dataset in Hindi and Lack of Cross Lingual Transfer (Dubossarsky & Dairkee, LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.1332.pdf