Continuous Word Embedding Fusion via Spectral Decomposition

Tianfan Fu, Cheng Zhang, Stephan Mandt


Abstract
Word embeddings have become a mainstream tool in statistical natural language processing. Practitioners often use pre-trained word vectors, which were trained on large generic text corpora, and which are readily available on the web. However, pre-trained word vectors oftentimes lack important words from specific domains. It is therefore often desirable to extend the vocabulary and embed new words into a set of pre-trained word vectors. In this paper, we present an efficient method for including new words from a specialized corpus, containing new words, into pre-trained generic word embeddings. We build on the established view of word embeddings as matrix factorizations to present a spectral algorithm for this task. Experiments on several domain-specific corpora with specialized vocabularies demonstrate that our method is able to embed the new words efficiently into the original embedding space. Compared to competing methods, our method is faster, parameter-free, and deterministic.
Anthology ID:
K18-1002
Volume:
Proceedings of the 22nd Conference on Computational Natural Language Learning
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venue:
CoNLL
SIG:
SIGNLL
Publisher:
Association for Computational Linguistics
Note:
Pages:
11–20
Language:
URL:
https://aclanthology.org/K18-1002
DOI:
10.18653/v1/K18-1002
Bibkey:
Cite (ACL):
Tianfan Fu, Cheng Zhang, and Stephan Mandt. 2018. Continuous Word Embedding Fusion via Spectral Decomposition. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 11–20, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):
Continuous Word Embedding Fusion via Spectral Decomposition (Fu et al., CoNLL 2018)
Copy Citation:
PDF:
https://aclanthology.org/K18-1002.pdf