Hadi Sheikhi


2025

pdf bib
Semi-Automated Construction of Sense-Annotated Datasets for Practically Any Language
Jai Riley | Bradley M. Hauer | Nafisa Sadaf Hriti | Guoqing Luo | Amir Reza Mirzaei | Ali Rafiei | Hadi Sheikhi | Mahvash Siavashpour | Mohammad Tavakoli | Ning Shi | Grzegorz Kondrak
Proceedings of the 31st International Conference on Computational Linguistics

High-quality sense-annotated datasets are vital for evaluating and comparing WSD systems. We present a novel approach to creating parallel sense-annotated datasets, which can be applied to any language that English can be translated into. The method incorporates machine translation, word alignment, sense projection, and sense filtering to produce silver annotations, which can then be revised manually to obtain gold datasets. By applying our method to Farsi, Chinese, and Bengali, we produce new parallel benchmark datasets, which are vetted by native speakers of each language. Our automatically-generated silver datasets are of higher quality than the annotations obtained with recent multilingual WSD systems, particularly on non-European languages.

2024

pdf bib
UAlberta at SemEval-2024 Task 1: A Potpourri of Methods for Quantifying Multilingual Semantic Textual Relatedness and Similarity
Ning Shi | Senyu Li | Guoqing Luo | Amirreza Mirzaei | Ali Rafiei | Jai Riley | Hadi Sheikhi | Mahvash Siavashpour | Mohammad Tavakoli | Bradley Hauer | Grzegorz Kondrak
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

We describe our systems for SemEval-2024 Task 1: Semantic Textual Relatedness. We investigate the correlation between semantic relatedness and semantic similarity. Specifically, we test two hypotheses: (1) similarity is a special case of relatedness, and (2) semantic relatedness is preserved under translation. We experiment with a variety of approaches which are based on explicit semantics, downstream applications, contextual embeddings, large language models (LLMs), as well as ensembles of methods. We find empirical support for our theoretical insights. In addition, our best ensemble system yields highly competitive results in a number of diverse categories. Our code and data are available on GitHub.

2023

pdf bib
3D-EX: A Unified Dataset of Definitions and Dictionary Examples
Fatemah Almeman | Hadi Sheikhi | Luis Espinosa Anke
Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing

Definitions are a fundamental building block in lexicography, linguistics and computational semantics. In NLP, they have been used for retrofitting word embeddings or augmenting contextual representations in language models. However, lexical resources containing definitions exhibit a wide range of properties, which has implications in the behaviour of models trained and evaluated on them. In this paper, we introduce 3D-EX, a dataset that aims to fill this gap by combining well-known English resources into one centralized knowledge repository in the form of <term, definition, example> triples. 3D-EX is a unified evaluation framework with carefully pre-computed train/validation/test splits to prevent memorization. We report experimental results that suggest that this dataset could be effectively leveraged in downstream NLP tasks. Code and data are available at https://github.com/F-Almeman/3D-EX.