COIN – an Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings

Andrew Schneider, Lihong He, Zhijia Chen, Arjun Mukherjee, Eduard Dragut


Abstract
Social media is the ultimate challenge for many natural language processing tools. The constant emergence of linguistic constructs challenge even the most sophisticated NLP tools. Predicting word embeddings for out of vocabulary words is one of those challenges. Word embedding models only include terms that occur a sufficient number of times in their training corpora. Word embedding vector models are unable to directly provide any useful information about a word not in their vocabularies. We propose a fast method for predicting vectors for out of vocabulary terms that makes use of the surrounding terms of the unknown term and the hidden context layer of the word2vec model. We propose this method as a strong baseline in the sense that 1) while it does not surpass all state-of-the-art methods, it surpasses several techniques for vector prediction on benchmark tasks, 2) even when it underperforms, the margin is very small retaining competitive performance in downstream tasks, and 3) it is inexpensive to compute, requiring no additional training stage. We also show that our technique can be incorporated into existing methods to achieve a new state-of-the-art on the word vector prediction problem.
Anthology ID:
2022.coling-1.350
Volume:
Proceedings of the 29th International Conference on Computational Linguistics
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea
Venue:
COLING
SIG:
Publisher:
International Committee on Computational Linguistics
Note:
Pages:
3984–3993
Language:
URL:
https://aclanthology.org/2022.coling-1.350
DOI:
Bibkey:
Cite (ACL):
Andrew Schneider, Lihong He, Zhijia Chen, Arjun Mukherjee, and Eduard Dragut. 2022. COIN – an Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3984–3993, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Cite (Informal):
COIN – an Inexpensive and Strong Baseline for Predicting Out of Vocabulary Word Embeddings (Schneider et al., COLING 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.coling-1.350.pdf
Data
SSTSentEval