Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language

Jai Riley; Bradley Hauer; Nafisa Sadaf Hriti; Guoqing Luo; Amirreza Mirzaei; Ali Rafiei; Hadi Sheikhi; Mahvash Siavashpour; Mohammad Tavakoli; Ning Shi; Grzegorz Kondrak

Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language

Jai Riley, Bradley Hauer, Nafisa Sadaf Hriti, Guoqing Luo, Amirreza Mirzaei, Ali Rafiei, Hadi Sheikhi, Mahvash Siavashpour, Mohammad Tavakoli, Ning Shi, Grzegorz Kondrak

Abstract

High-quality sense-annotated datasets are vital for evaluating and comparing WSD systems. We present a novel approach to creating parallel sense-annotated datasets, which can be applied to any language that English can be translated into. The method incorporates machine translation, word alignment, sense projection, and sense filtering to produce silver annotations, which can then be revised manually to obtain gold datasets. By applying our method to Farsi, Chinese, and Bengali, we produce new parallel benchmark datasets, which are vetted by native speakers of each language. Our automatically-generated silver datasets are of higher quality than the annotations obtained with recent multilingual WSD systems, particularly on non-European languages.

Anthology ID:: 2025.coling-main.419
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6270–6284
Language:
URL:: https://aclanthology.org/2025.coling-main.419/
DOI:
Bibkey:
Cite (ACL):: Jai Riley, Bradley Hauer, Nafisa Sadaf Hriti, Guoqing Luo, Amirreza Mirzaei, Ali Rafiei, Hadi Sheikhi, Mahvash Siavashpour, Mohammad Tavakoli, Ning Shi, and Grzegorz Kondrak. 2025. Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6270–6284, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language (Riley et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.419.pdf

PDF Cite Search Fix data