Word Embedding Binarization with Semantic Information Preservation

Samarth Navali; Praneet Sherki; Ramesh Inturi; Vanraj Vala

doi:10.18653/v1/2020.coling-main.108

Word Embedding Binarization with Semantic Information Preservation

Samarth Navali, Praneet Sherki, Ramesh Inturi, Vanraj Vala

Abstract

With growing applications of Machine Learning in daily lives Natural Language Processing (NLP) has emerged as a heavily researched area. Finding its applications in tasks ranging from simple Q/A chatbots to Fully fledged conversational AI, NLP models are vital. Word and Sentence embedding are one of the most common starting points of any NLP task. A word embedding represents a given word in a predefined vector-space while maintaining vector relations with similar or dis-similar entities. As such different pretrained embedding such as Word2Vec, GloVe, fasttext have been developed. These embedding generated on millions of words are however very large in terms of size. Having embedding with floating point precision also makes the downstream evaluation slow. In this paper we present a novel method to convert continuous embedding to its binary representation, thus reducing the overall size of the embedding while keeping the semantic and relational knowledge intact. This will facilitate an option of porting such big embedding onto devices where space is limited. We also present different approaches suitable for different downstream tasks based on the requirement of contextual and semantic information. Experiments have shown comparable result in downstream tasks with 7 to 15 times reduction in file size and about 5 % change in evaluation parameters.

Anthology ID:: 2020.coling-main.108
Volume:: Proceedings of the 28th International Conference on Computational Linguistics
Month:: December
Year:: 2020
Address:: Barcelona, Spain (Online)
Editors:: Donia Scott, Nuria Bel, Chengqing Zong
Venue:: COLING
SIG:
Publisher:: International Committee on Computational Linguistics
Note:
Pages:: 1256–1265
Language:
URL:: https://aclanthology.org/2020.coling-main.108/
DOI:: 10.18653/v1/2020.coling-main.108
Bibkey:
Cite (ACL):: Samarth Navali, Praneet Sherki, Ramesh Inturi, and Vanraj Vala. 2020. Word Embedding Binarization with Semantic Information Preservation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1256–1265, Barcelona, Spain (Online). International Committee on Computational Linguistics.
Cite (Informal):: Word Embedding Binarization with Semantic Information Preservation (Navali et al., COLING 2020)
Copy Citation:
PDF:: https://aclanthology.org/2020.coling-main.108.pdf

PDF Cite Search Fix data