Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources

Yuri Kiselev, Dmitry Ustalov, Sergey Porshnev


Abstract
Collaboratively created lexical resources is a trending approach to creating high quality thesauri in a short time span at a remarkably low price. The key idea is to invite non-expert participants to express and share their knowledge with the aim of constructing a resource. However, this approach tends to be noisy and error-prone, thus making data cleansing a highly topical task to perform. In this paper, we study different techniques for synset deduplication including machine- and crowd-based ones. Eventually, we put forward an approach that can solve the deduplication problem fully automatically, with the quality comparable to the expert-based approach.
Anthology ID:
2016.gwc-1.25
Volume:
Proceedings of the 8th Global WordNet Conference (GWC)
Month:
27--30 January
Year:
2016
Address:
Bucharest, Romania
Editors:
Christiane Fellbaum, Piek Vossen, Verginica Barbu Mititelu, Corina Forascu
Venue:
GWC
SIG:
SIGLEX
Publisher:
Global Wordnet Association
Note:
Pages:
162–168
Language:
URL:
https://aclanthology.org/2016.gwc-1.25
DOI:
Bibkey:
Cite (ACL):
Yuri Kiselev, Dmitry Ustalov, and Sergey Porshnev. 2016. Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources. In Proceedings of the 8th Global WordNet Conference (GWC), pages 162–168, Bucharest, Romania. Global Wordnet Association.
Cite (Informal):
Eliminating Fuzzy Duplicates in Crowdsourced Lexical Resources (Kiselev et al., GWC 2016)
Copy Citation:
PDF:
https://aclanthology.org/2016.gwc-1.25.pdf