Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data

Ábel Elekes, Adrian Englhardt, Martin Schäler, Klemens Böhm


Abstract
Word embeddings are powerful tools that facilitate better analysis of natural language. However, their quality highly depends on the resource used for training. There are various approaches relying on n-gram corpora, such as the Google n-gram corpus. However, n-gram corpora only offer a small window into the full text – 5 words for the Google corpus at best. This gives way to the concern whether the extracted word semantics are of high quality. In this paper, we address this concern with two contributions. First, we provide a resource containing 120 word-embedding models – one of the largest collection of embedding models. Furthermore, the resource contains the n-gramed versions of all used corpora, as well as our scripts used for corpus generation, model generation and evaluation. Second, we define a set of meaningful experiments allowing to evaluate the aforementioned quality differences. We conduct these experiments using our resource to show its usage and significance. The evaluation results confirm that one generally can expect high quality for n-grams with n > 3.
Anthology ID:
K18-1041
Volume:
Proceedings of the 22nd Conference on Computational Natural Language Learning
Month:
October
Year:
2018
Address:
Brussels, Belgium
Editors:
Anna Korhonen, Ivan Titov
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
423–432
Language:
URL:
https://aclanthology.org/K18-1041/
DOI:
10.18653/v1/K18-1041
Bibkey:
Cite (ACL):
Ábel Elekes, Adrian Englhardt, Martin Schäler, and Klemens Böhm. 2018. Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 423–432, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data (Elekes et al., CoNLL 2018)
Copy Citation:
PDF:
https://aclanthology.org/K18-1041.pdf