Amir Reza Mirzaei


2025

pdf bib
Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language
Jai Riley | Bradley M. Hauer | Nafisa Sadaf Hriti | Guoqing Luo | Amir Reza Mirzaei | Ali Rafiei | Hadi Sheikhi | Mahvash Siavashpour | Mohammad Tavakoli | Ning Shi | Grzegorz Kondrak
Proceedings of the 31st International Conference on Computational Linguistics

High-quality sense-annotated datasets are vital for evaluating and comparing WSD systems. We present a novel approach to creating parallel sense-annotated datasets, which can be applied to any language that English can be translated into. The method incorporates machine translation, word alignment, sense projection, and sense filtering to produce silver annotations, which can then be revised manually to obtain gold datasets. By applying our method to Farsi, Chinese, and Bengali, we produce new parallel benchmark datasets, which are vetted by native speakers of each language. Our automatically-generated silver datasets are of higher quality than the annotations obtained with recent multilingual WSD systems, particularly on non-European languages.