NPVec1: Word Embeddings for Nepali - Construction and Evaluation

Word Embedding maps words to vectors of real numbers. It is derived from a large corpus and is known to capture semantic knowledge from the corpus. Word Embedding is a critical component of many state-of-the-art Deep Learning techniques. However, generating good Word Embeddings is a special challenge for low-resource languages such as Nepali due to the unavailability of large text corpus. In this paper, we present NPVec1 which consists of 25 state-of-art Word Embeddings for Nepali that we have derived from a large corpus using Glove, Word2Vec, FastText, and BERT. We further provide intrinsic and extrinsic evaluations of these Embeddings using well established metrics and methods. These models are trained using 279 million word tokens and are the largest Embeddings ever trained for Nepali language. Furthermore, we have made these Embeddings publicly available to accelerate the development of Natural Language Processing (NLP) applications in Nepali.


Introduction
Recent Deep Learning (DL) techniques provide state-of-the-art performances in almost all Natural Language Processing (NLP) tasks such as Text Classification (Conneau et al., 2016;Yao et al., 2019;Zhou et al., 2015), Question Answering (Peters et al., 2018;Devlin et al., 2018), Named Entity Recognition (Huang et al., 2015;Lample et al., 2016) and Sentiment Analysis (Zhang et al., 2018;Severyn and Moschitti, 2015). DL techniques are attractive due to their capacity of learning complex and intricate features automatically from the raw data (Li et al., 2020). This significantly reduces the required time and effort for feature engineering, a costly step in traditional feature-based approaches which further requires considerable amount of engineering and domain expertise. Thus, DL techniques are very useful for low-resource languages such as Nepali.
Many Deep Learning techniques require Word Embeddings to represent each word by a vector of real numbers. Word Embeddings learn a meaningful representation of words directly from a large unlabeled corpus using co-occurrence statistics . The closer the word representations to actual meanings, the better the performance. Consequently, Word Embeddings have received special attention from the research community and are predominantly used in current NLP researches.
Word Embeddings can generally be divided into two categories: Context-Independent embeddings such as GloVe (Pennington et al., 2014), Word2Vec (Mikolov et al., 2013), and fastText , and Context-Dependent embeddings such as BERT (Bidirectional Encoder Representations from Transformers) (Devlin et al., 2018) and ELMo (Embeddings from Language Models) (Peters et al., 2018). Context-dependent word embedding is generated for a word as a function of the sentence it occurs in. Thus, it can learn multiple representations for polysemous words (Peters et al., 2018). To learn these deep contextualized representations, BERT uses a transformer based architecture pretrained on Masked Language Modelling and Next Sentence Prediction tasks, whereas, ELMo uses a Bidirectional LSTM architecture for combining both forward and backward language models.
In this paper, we present NPVec1, a suite of Word Embedding resources for Nepali, a lowresource language, which is the official language and de-facto lingua franca of Nepal. It is spoken by more than 20 million people mainly in Nepal and many other places in the world including Bhutan, India, and Myanmar (Niraula et al., 2020). Even though Word Embeddings can be directly learned from raw texts in an unsupervised fashion, gathering a large amount of data for its training remains a huge challenge in itself for a low-resource lan-guage such as Nepali. In addition, Nepali is a morphologically rich language which has multiple agglutinative suffixes as well as affix inflections and thus proves challenges during its preprocessing i.e. tokenization, normalization and stemming.
We have collected data over many years and combined it with multiple other publicly available data sets to generate a suite of Word Embeddings, i.e. NPVec1, using GloVe, Word2Vec, fastText and BERT. It consists of 25 Word Embeddings corresponding to different preprocessing schemes. In addition, we perform the intrinsic and extrinsic evaluations of the generated Word Embeddings using well established methods and metrics. Our pretrained Embedding models and resources are made publicly available 1 for the acceleration and development of NLP research and application in Nepali language.
The novel contributions of this study are: • First formal analyses of different Word Embeddings in Nepali language using intrinsic and extrinsic methods.
• First study of effects of preprocessing such as normalization, tokenization and stemming in different Word Embeddings in Nepali language.
• First contextualized word embedding (BERT) generation and evaluation in Nepali language.
• The largest Word2Vec, GloVe, fastText and BERT based Word Embeddings ever trained and made available for Nepali language to date.
The rest of this paper is organized as follows. We review related works in Section 2. We describe the data collection and corpus construction in Section 3. We describe our experiments to develop Word Embedding methods in Section 4. We present model evaluations in Section 5 and conclusion and future directions in Section 6.

Related Works
Word Embeddings provide continuous word representations and are the building blocks of many NLP applications. They capture distributional information of words from a large corpora. This information helps the generalization of machine learning models especially when the data set is limited . Word Embedding tools, technologies and pre-trained models are widely available for resource rich languages such as English (Mikolov et al., 2013;Pennington et al., 2014; and Chinese (Li et al., 2018;Chen et al., 2015). Due to the wide use of Word Embeddings, pre-trained models are increasingly available for resource poor languages such as Portuguese (Hartmann et al., 2017), Arabic (Elrazzaz et al., 2017;Soliman et al., 2017), and Bengali (Ahmad and Amin, 2016).
Most Word Embedding algorithms are unsupervised. Which means that they can be trained for any language as long as the corpus data is available. One such effort is by Grave et al. (2018) who generated and made available word vectors for 157 languages, including Nepali, using Wikipedia and Common Crawl data. The pre-trained models for Skip-gram and CBOW are available at https: //fasttext.cc. Another useful resource is http: //vectors.nlpl.eu/repository which is a community repository for Word Embeddings maintained by Language Technology Group at the University of Oslo (Kutuzov et al., 2017). It currently hosts 209 pre-trained word Embeddings for most languages but not Nepali.
Word Embeddings for Nepali are derived in small scale by Grave et al. (2018) using fastText and by Lamsal (2019) using Word2Vec. Both of these efforts have major limitations. First, they have limited diversity in the corpus. Grave et al. use Wikipedia and Common Crawl data while Lamsal uses news corpus. Second, their corpus is very small compared to ours (Section 3). Third, they do not provide any evaluation of the generated models. Fourth, they have done limited or no prepossessing on the data. We show later in Section 3.3 that tokenization and text normalization are critical for processing morphologically rich Nepali text. In contrast, we have conducted a large scale study of Word Embeddings in more diverse and large data sets using GloVe, fastText, and Word2Vec. Our corpus is nearly four times bigger than the corpus used by aforementioned approaches (see Section 3). We have constructed 8 inputs for each combination of binary variables: Tokenization, Normalization and Stemming which has resulted in 24 pre-trained Embeddings for GloVe, Word2Vec, and fastText combined. Additionally, we have trained BERT for one of these preprocess-ing schemes and performed intrinsic and extrinsic evaluations for each of these 25 models.

Corpus Preparation
In this Section, we present our data sources and preprocessing techniques for the corpus. To help readers understand the Nepali words used in this paper, we have provided a gloss in Section 8 with their transliterations and English translations.

The Corpus
Our corpus consists of a mixture of news, Wikipedia articles, and OSCAR (Ortiz Suárez et al., 2019) corpus. We summarize the data sets in Table 1.

News Corpus
We crawled Nepali online news media over a year and collected more than 700,000 unique news articles (∼ 3GB). As expected, the news articles cover diverse topics including politics, sports, technology, society, and so on. We obtained another news data set from IEEE DataPort (Lamsal, 2020) (1.7GB).

OSCAR Nepali Corpus
We obtained the shuffled data in deduplicated form (1.2GB) for Nepali language from OSCAR (Open Super-large Crawled ALMAnaCH coRpus) (Ortiz Suárez et al., 2019). 2 It is a large multilingual corpus obtained by language classification and filtering of the Common Crawl corpus. Common Crawl 3 is a non-profit organization which collects data through web crawling and makes it publicly available.

Deduplication
We collected data from multiple sources which might have crawled the same data. Furthermore, there were some boilerplate text in the data. Thus, it was important to remove duplicate texts from the corpus. To remove these duplicates, we followed an approach similar to Grave et al. (2018). With this approach, we computed hash for each sentence and collected the sentence only if the hash was not known before. We were able to remove ∼ 22% duplicated sentences from our corpus.

Preprocessing
After removing duplicates, we discarded sentences with less than 10 characters as they provide little context to learn Word Embeddings. We also removed punctuations and replaced numbers with a special NN token. We then applied following Normalization, Tokenization and Stemming preprocessing techniques to derive corpus for the study.

Normalization
Analogous to how there are different cases (lower/upper) in English with no phonetic differences, there are different written vowels sounds in Nepali which, when spoken, are indistinguishable from each other. For example: the two different words न पाल (Nepali) and न पा ल are spoken the same way even though their written representations differ. Thus, people often mistakenly use multiple written version of the same words which introduces noise in the data set. Normalization, in the context of this study, is identification of all these nuances and mapping them to a same word.

Tokenization
Nepali language has multiple post-positional and agglutinative suffixes like ल , मा, बाट, द ख etc., which can be compounded together with nouns and pronouns to produce new words. For example, the word न पाल (Nepalese) can be compounded as न पाल ल (Nepalese did), न पाल ह (Nepalese+plural), न पाल को (Of Nepalese), so on and so forth. Thus, these different words can be tokenized as न पाल + ल , न पाल + ह , न पाल + को which serves to drastically reduce the vocabulary size without the loss of any linguistic functionality. Tokenization, in this context, means the same.

Stemming
In addition, there are also other case markers and bound suffixes that primarily inflect verbs to produce new words. For example, from the same root word खा (eat), words such as खायो (ate), खा द (eating), खाएको (had eaten), खाएर (after eating), etc can be constructed. Stemming, in this context, means the reduction of all such inflected words to their base forms.
For the purpose of this study, we have improved upon the preprocessing techniques developed by Koirala and Shakya (2018) for preprocessing (normalizing, tokenizing and stemming) our Corpus Tokens Types Genre Description Our News Corpus 216M 3.3M News Online news Lamsal (Lamsal, 2020)   corpus. Specifically, we generated eight corpus corresponding to different combination of these three preprocessing techniques. The final eight corpus are listed in Table 2.

Context-independent Word Embeddings
We chose three state-of-the-art methods for obtaining context-independent Word Embeddings, namely Word2vec, fastText and GloVe. Word embeddings from these methods were learned with the same parameters for fair comparison. We fixed vector dimension to 300 and set minimum word frequency, window size, and the negative sampling size to 5 respectively. Word2vec and fastText models were trained via the Gensim (Řehůřek and Sojka, 2010) implementation using skipgram method. Whereas, GloVe embeddings were trained via the tool provided by StanfordNLP 4 .

Context-dependent Word Embeddings
We chose BERT to learn context-dependent embeddings. We trained a BERT model using the Huggingface's transformers library (Wolf et al., 2019). BERT model, unlike the other word embedding models, was only trained in one pre-processing scheme i.e. 4 https://github.com/stanfordnlp/GloVe base+normalized+tokenized (BNT) 5 due to resource constraints. Due to the same reason, we reduced both the number of hidden layers and the attention heads to 6 and the hidden dimensions to 300 unlike the original implementation of 12 hidden layers and attention heads and 768 hidden dimensions. The maximum sequence size was chosen to be 512 whereas maximum vocabulary size for the BERT's wordpiece tokenizer was set to 30,000. Our implementation of BERT has 22.5M parameters (in contrast to the 110M parameters of the original implementation i.e. BERT-base) and unlike BERT's original implementation, where it is pre-trained on the task of Masked Language Modelling (MLM) and Next Sentence Prediction, we only pre-trained it for the MLM objective for just a single epoch due to limited computing resources.

Intrinsic Evaluation
Intrinsic evaluation of word embedding models is commonly performed in tasks such as analogies (Grave et al., 2018). There is, however, no such data set available for Nepali language. Thus, we followed the clustering approach suggested in  (Soliman et al., 2017) which requires a manually constructed data set of terms in different themes (clusters). The goal then is to recover these themes (clusters) using the learned word representations. We constructed following two data sets for the evaluation purposes.

Relatedness Set
This set consisted of twenty one word examples each from two different topics i.e. kitchen and nature. The kitchen topic included words such as च न (sugar) न न (salt) भाडो (pot) etc. whereas, the nature topic included words such as हमाल (mountain), पहाड (hill), खोला (river) etc. The Relatedness data set is presented in Table 3.

Sentiment Set
This set consisted of nineteen examples each of positive and negative sentiments. The positive sentiment set included words such as रा ो (good), ठ लो (big), ाय(justice), etc. whereas the negative sentiment set included their antonyms such as नरा ो (bad), सानो(small), अ ाय(injustice) etc. The Sentiment data set is presented in Table 3.
Ideally word embeddings should capture both word relatedness and word similarity properties of a word. These two terms are related but are not the same Banjade et al., 2015). For example, chicken and egg are less similar (living vs non-living) but are highly related as they often appear together. Relatedness and Sentiment sets were developed to evaluate the models in these these two aspects.
For each of these cases (sentiment and relatedness), K-Means clustering was applied to the constituent words to generate two clusters (i.e. K=2). The obtained clusters were evaluated using the purity metric which is further elaborated in Section 5.1.3. Since Word2Vec and GloVe cannot handle out-of-vocabulary (OOV) words, unlike fastText and BERT, the average of all corresponding word vectors were used to represent the OOV words.
While Word2Vec, fastText and GloVe models provide a simple word to vector mapping, BERT's learned representations are a bit different and thus, need to be extracted accordingly. For the sake of simplicity, we have averaged the hidden state of the last two hidden layers to get the embeddings for each word token. The words were run without any context.

Purity
The purity metric is an extrinsic cluster evaluation technique (Manning et al., 2008) which requires a gold standard data set. It measures the extent to which a cluster contains homogeneous elements. The purity metric ranges from 0 (bad clustering) to 1 (perfect clustering). Thus, the higher the purity score, the better the results.

Results for Intrinsic Evaluation
The results for the intrinsic evaluations are listed in Table 4. All models performed better in recovering original clusters in the Relatedness Set compared to that of the Sentiment Set i.e. they have higher purity scores in the Relatedness Set than the Sentiment Set. This is expected as semantically opposite words often appear in a very similar context (e.g. This is a new model vs. This is an old model). Relying on neighboring terms alone would provide little context to capture the semantic meaning of a word. Of all three models, however, GloVe performed the best in the sentiment set by an average of 10% (except in the BNTS scheme). This seem to make it more suitable for tasks such as Sentiment Analyses. Interestingly, BERT model did not perform well compared to other models in the Relatedness set. It, however, provided very competitive score in the Sentiment Set.
Models in the BNT scheme scored highest in  both of the intrinsic data sets. Purity for relatedness task for all of the three models in this scheme was 1 whereas GloVe model obtained the global best score of 0.69 in the sentiment set in this scheme. In general, it seems that applying the Normalization scheme has a positive effect on model's capacity to learn the representation which makes sense because Normalization reduces differently spelled versions of the same word to a single representation. Purity dropped significantly for all tasks in all schemes that included Stemming. This may be attributed to the possible over-stemming of the words (under-stemming doesn't seem to be a problem because the model is performing well in the Base scheme).

Extrinsic Evaluation
The primary objective of extrinsic evaluation for this study was to compare how the word embeddings helped generalize the training of other supervised models with very few data labels. For this purpose, a feed-forward neural network architecture was used for a classification objective in a multi-class classification setup.

Data
The data set for classification was derived from a publicly available Github repository i.e. Nepali News Dataset 6 . It consists of Nepali news articles in 10 different categories. Each category has 1000 articles. As mentioned, the goal of extrinsic evaluation here is to see how the learned word representations help the generalization of machine learning model for text classification task when limited training data set is available, a practical scenario for low resource language.

Architecture
We implemented a very simple text classification model using Keras 7 . For each example (news article), we only used the first five hundred tokens and obtained their embedding vectors from the word embedding model under the study. These vectors were then fed to a Keras model where they were first pooled together by a one-dimensional averaging layer and then passed to a hidden layer with 64 units with the ReLU activation and then to the output layer of 10 units with Sigmoid activation. Binary crossentropy function was used to calculate the loss and the model was trained using the Adam Optimizer (Kingma and Ba, 2014) for 60 epochs each. In case of BERT, we averaged the hidden states from the last two hidden layers to get the embeddings, whereas, for getting the baseline results, instead of using any pre-trained word vectors, a trainable Keras embedding layer was used in front of the architecture mentioned above which automatically learns the word embeddings by only using the provided training examples.

Results for Extrinsic Evaluation
Macro Precision, Recall and F 1 metrics were used for the evaluation of the classification model. On average, the F 1 scores for word embedding models exceeded the baseline scores by a margin of 5 percent. This suggests that the use of pre-trained word embeddings helps to generalize classification models better than simply using the embeddings learned from the training set. Interestingly, the global maximum F 1 score was obtained in the Base scheme i.e. with no preprocessing applied, and Normalization seemed to make no difference to the score. This can be attributed to the fact that our data set came from highly reputed newspapers i.e. all word spellings were grammatically correct. We foresee significant increase due to Normalization in data sets such as tweets, social media posts and blogs where grammatical errors are more frequent.
Similarly, Tokenization schemes seemed to drop the classification scores for embedding models but increase the scores for the baseline models in general. This leads us to believe that the representations of the post-positions and agglunitative suffixes, which are the most frequently occurring words in Nepali language, learned by the Word Embedding models may be partial to particular top-ics. We suggest the omission of post-positions and other frequently occurring words from the data set before using these embeddings in a classification setting.
The standard deviation in the F-scores of Word2Vec model, fastText and GloVe model across the different pre-processing schemes are 2.4%, 1% and 1.4% respectively, which suggests that fastText might be more resilient to problems like over-stemming. We thus recommend the usage of fastText models in applications where it is desirable to stem words. Interestingly, BERT model, while produced competitive results, did not exceed our expectations on the classification task. We expect a raise in performance of this model if trained in the architecture proposed in its original implementation i.e. 12 attention heads and 12 hidden layers unlike our slimmed down version of 6 attention heads and 6 hidden layers trained for only one epoch. Training on more data and with more epochs are potential future directions to this end.

Conclusion and Future Work
In this paper, we trained 25 Word Embedding models for Nepali language with multiple preprocessing schemes and made them publicly available for accelerating NLP research in low-resource language Nepali 8 . This, to our knowledge, is the first formal and large scale study of Word Embeddings in Nepali. We compared the performances of these models using intrinsic and extrinsic evaluation tasks. Our findings clearly indicate that these word embedding models perform exceptionally well in identifying related words compared to discovering semantically similar words. We also suggest that further comparisons be made with an improved stemmer, which has fewer over-stemming error rates than what we've used, to study the effects of over-stemming in word embeddings. Performance of these Word Embeddings in clustering of related words also suggest us that these models will obtain good results in tasks such as Named Entity Recognition and POS Tagging. This is something that we would like to explore in future.
As far as our study with BERT goes, we obviously recommend training the original BERT architecture, rather than what we have used, with more data. For comparison, the original BERT model