Cleaning noisy wordnets

Benoît Sagot; Darja Fišer

Cleaning noisy wordnets

Abstract

Automatic approaches to creating and extending wordnets, which have become very popular in the past decade, inadvertently result in noisy synsets. This is why we propose an approach to detect synset outliers in order to eliminate the noise and improve accuracy of the developed wordnets, so that they become more useful lexico-semantic resources for natural language applications. The approach compares the words that appear in the synset and its surroundings with the contexts of the literals in question they are used in based on large monolingual corpora. By fine-tuning the outlier threshold we can influence how many outlier candidates will be eliminated. Although the proposed approach is language-independent we test it on Slovene and French that were created automatically from bilingual resources and contain plenty of disambiguation errors. Manual evaluation of the results shows that by applying a threshold similar to the estimated error rate in the respective wordnets, 67% of the proposed outlier candidates are indeed incorrect for French and a 64% for Slovene. This is a big improvement compared to the estimated overall error rates in the resources, which are 12% for French and 15% for Slovene.

Anthology ID:: L12-1666
Volume:: Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
Month:: May
Year:: 2012
Address:: Istanbul, Turkey
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 3468–3472
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1127_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Benoît Sagot and Darja Fišer. 2012. Cleaning noisy wordnets. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12), pages 3468–3472, Istanbul, Turkey. European Language Resources Association (ELRA).
Cite (Informal):: Cleaning noisy wordnets (Sagot & Fišer, LREC 2012)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2012/pdf/1127_Paper.pdf

PDF Cite Search Fix data