Robust Backed-off Estimation of Out-of-Vocabulary Embeddings

Nobukazu Fukuda, Naoki Yoshinaga, Masaru Kitsuregawa


Abstract
Out-of-vocabulary (oov) words cause serious troubles in solving natural language tasks with a neural network. Existing approaches to this problem resort to using subwords, which are shorter and more ambiguous units than words, in order to represent oov words with a bag of subwords. In this study, inspired by the processes for creating words from known words, we propose a robust method of estimating oov word embeddings by referring to pre-trained word embeddings for known words with similar surfaces to target oov words. We collect known words by segmenting oov words and by approximate string matching, and we then aggregate their pre-trained embeddings. Experimental results show that the obtained oov word embeddings improve not only word similarity tasks but also downstream tasks in Twitter and biomedical domains where oov words often appear, even when the computed oov embeddings are integrated into a bert-based strong baseline.
Anthology ID:
2020.findings-emnlp.434
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2020
Month:
November
Year:
2020
Address:
Online
Editors:
Trevor Cohn, Yulan He, Yang Liu
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4827–4838
Language:
URL:
https://aclanthology.org/2020.findings-emnlp.434
DOI:
10.18653/v1/2020.findings-emnlp.434
Bibkey:
Cite (ACL):
Nobukazu Fukuda, Naoki Yoshinaga, and Masaru Kitsuregawa. 2020. Robust Backed-off Estimation of Out-of-Vocabulary Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4827–4838, Online. Association for Computational Linguistics.
Cite (Informal):
Robust Backed-off Estimation of Out-of-Vocabulary Embeddings (Fukuda et al., Findings 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.findings-emnlp.434.pdf
Data
NCBI DiseaseSSTWikiText-103WikiText-2