NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Pravesh Koirala, Nobal B. Niraula


Abstract
Word Embedding maps words to vectors of real numbers. It is derived from a large corpus and is known to capture semantic knowledge from the corpus. Word Embedding is a critical component of many state-of-the-art Deep Learning techniques. However, generating good Word Embeddings is a special challenge for low-resource languages such as Nepali due to the unavailability of large text corpus. In this paper, we present NPVec1 which consists of 25 state-of-art Word Embeddings for Nepali that we have derived from a large corpus using Glove, Word2Vec, FastText, and BERT. We further provide intrinsic and extrinsic evaluations of these Embeddings using well established metrics and methods. These models are trained using 279 million word tokens and are the largest Embeddings ever trained for Nepali language. Furthermore, we have made these Embeddings publicly available to accelerate the development of Natural Language Processing (NLP) applications in Nepali.
Anthology ID:
2021.repl4nlp-1.18
Volume:
Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021)
Month:
August
Year:
2021
Address:
Online
Editors:
Anna Rogers, Iacer Calixto, Ivan Vulić, Naomi Saphra, Nora Kassner, Oana-Maria Camburu, Trapit Bansal, Vered Shwartz
Venue:
RepL4NLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
174–184
Language:
URL:
https://aclanthology.org/2021.repl4nlp-1.18
DOI:
10.18653/v1/2021.repl4nlp-1.18
Bibkey:
Cite (ACL):
Pravesh Koirala and Nobal B. Niraula. 2021. NPVec1: Word Embeddings for Nepali - Construction and Evaluation. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021), pages 174–184, Online. Association for Computational Linguistics.
Cite (Informal):
NPVec1: Word Embeddings for Nepali - Construction and Evaluation (Koirala & Niraula, RepL4NLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.repl4nlp-1.18.pdf
Video:
 https://aclanthology.org/2021.repl4nlp-1.18.mp4
Code
 nowalab/nepali-word-embeddings