Multi-word Entity Classification in a Highly Multilingual Environment

Sophie Chesney, Guillaume Jacquet, Ralf Steinberger, Jakub Piskorski


Abstract
This paper describes an approach for the classification of millions of existing multi-word entities (MWEntities), such as organisation or event names, into thirteen category types, based only on the tokens they contain. In order to classify our very large in-house collection of multilingual MWEntities into an application-oriented set of entity categories, we trained and tested distantly-supervised classifiers in 43 languages based on MWEntities extracted from BabelNet. The best-performing classifier was the multi-class SVM using a TF.IDF-weighted data representation. Interestingly, one unique classifier trained on a mix of all languages consistently performed better than classifiers trained for individual languages, reaching an averaged F1-value of 88.8%. In this paper, we present the training and test data, including a human evaluation of its accuracy, describe the methods used to train the classifiers, and discuss the results.
Anthology ID:
W17-1702
Volume:
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017)
Month:
April
Year:
2017
Address:
Valencia, Spain
Venue:
MWE
SIG:
SIGLEX
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–20
Language:
URL:
https://aclanthology.org/W17-1702
DOI:
10.18653/v1/W17-1702
Bibkey:
Cite (ACL):
Sophie Chesney, Guillaume Jacquet, Ralf Steinberger, and Jakub Piskorski. 2017. Multi-word Entity Classification in a Highly Multilingual Environment. In Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 11–20, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Multi-word Entity Classification in a Highly Multilingual Environment (Chesney et al., MWE 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1702.pdf
Data
DBpedia