Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models

Mohammad Taher Pilehvar; Dimitri Kartsaklis; Victor Prokhorov; Nigel Collier

doi:10.18653/v1/D18-1169

Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models

Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, Nigel Collier

Abstract

Rare word representation has recently enjoyed a surge of interest, owing to the crucial role that effective handling of infrequent words can play in accurate semantic understanding. However, there is a paucity of reliable benchmarks for evaluation and comparison of these techniques. We show in this paper that the only existing benchmark (the Stanford Rare Word dataset) suffers from low-confidence annotations and limited vocabulary; hence, it does not constitute a solid comparison framework. In order to fill this evaluation gap, we propose Cambridge Rare word Dataset (Card-660), an expert-annotated word similarity dataset which provides a highly reliable, yet challenging, benchmark for rare word representation techniques. Through a set of experiments we show that even the best mainstream word embeddings, with millions of words in their vocabularies, are unable to achieve performances higher than 0.43 (Pearson correlation) on the dataset, compared to a human-level upperbound of 0.90. We release the dataset and the annotation materials at https://pilehvar.github.io/card-660/.

Anthology ID:: D18-1169
Volume:: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing
Month:: October-November
Year:: 2018
Address:: Brussels, Belgium
Editors:: Ellen Riloff, David Chiang, Julia Hockenmaier, Jun’ichi Tsujii
Venue:: EMNLP
SIG:: SIGDAT
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1391–1401
Language:
URL:: https://aclanthology.org/D18-1169/
DOI:: 10.18653/v1/D18-1169
Bibkey:
Cite (ACL):: Mohammad Taher Pilehvar, Dimitri Kartsaklis, Victor Prokhorov, and Nigel Collier. 2018. Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1391–1401, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: Card-660: Cambridge Rare Word Dataset - a Reliable Benchmark for Infrequent Word Representation Models (Pilehvar et al., EMNLP 2018)
Copy Citation:
PDF:: https://aclanthology.org/D18-1169.pdf
Attachment:: D18-1169.Attachment.zip

PDF Cite Search Attachment Fix data