AnlamVer: Semantic Model Evaluation Dataset for Turkish - Word Similarity and Relatedness

Gökhan Ercan, Olcay Taner Yıldız


Abstract
In this paper, we present AnlamVer, which is a semantic model evaluation dataset for Turkish designed to evaluate word similarity and word relatedness tasks while discriminating those two relations from each other. Our dataset consists of 500 word-pairs annotated by 12 human subjects, and each pair has two distinct scores for similarity and relatedness. Word-pairs are selected to enable the evaluation of distributional semantic models by multiple attributes of words and word-pair relations such as frequency, morphology, concreteness and relation types (e.g., synonymy, antonymy). Our aim is to provide insights to semantic model researchers by evaluating models in multiple attributes. We balance dataset word-pairs by their frequencies to evaluate the robustness of semantic models concerning out-of-vocabulary and rare words problems, which are caused by the rich derivational and inflectional morphology of the Turkish language.
Anthology ID:
C18-1323
Volume:
Proceedings of the 27th International Conference on Computational Linguistics
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3819–3836
Language:
URL:
https://aclanthology.org/C18-1323
DOI:
Bibkey:
Cite (ACL):
Gökhan Ercan and Olcay Taner Yıldız. 2018. AnlamVer: Semantic Model Evaluation Dataset for Turkish - Word Similarity and Relatedness. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3819–3836, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
AnlamVer: Semantic Model Evaluation Dataset for Turkish - Word Similarity and Relatedness (Ercan & Yıldız, COLING 2018)
Copy Citation:
PDF:
https://aclanthology.org/C18-1323.pdf
Data
AnlamVer