LU-BZU at SemEval-2021 Task 2: Word2Vec and Lemma2Vec performance in Arabic Word-in-Context disambiguation

This paper presents a set of experiments to evaluate and compare between the performance of using CBOW Word2Vec and Lemma2Vec models for Arabic Word-in-Context (WiC) disambiguation without using sense inventories or sense embeddings. As part of the SemEval-2021 Shared Task 2 on WiC disambiguation, we used the dev.ar-ar dataset (2k sentence pairs) to decide whether two words in a given sentence pair carry the same meaning. We used two Word2Vec models: Wiki-CBOW, a pre-trained model on Arabic Wikipedia, and another model we trained on large Arabic corpora of about 3 billion tokens. Two Lemma2Vec models was also constructed based on the two Word2Vec models. Each of the four models was then used in the WiC disambiguation task, and then evaluated on the SemEval-2021 test.ar-ar dataset. At the end, we reported the performance of different models and compared between using lemma-based and word-based models.


Introduction
As a word may denote multiple meanings (i.e., senses) in different contexts, disambiguating them is important for many NLP applications, such as information retrieval, machine translation, summarization, among others. For example, the word "table" in sentences like "I am cleaning the table", "I am serving the table", "I am emailing the table", refer to "furniture", "people", and "data" respectively. Disambiguating the sense that a word denotes in a given sentence is important for understanding the semantics of this sentence.
To automatically disambiguate word senses in a given context, many approaches have been proposed based on supervised, semi-supervised, or unsupervised learning models. Supervised and semi-supervised methods rely on full, or partial, labeling of the word senses in the training corpus to construct a model (Lee and Ng, 2002;Klein et al., 2002). On the other hand, unsupervised approaches induce senses from unannotated raw corpora and do not use lexical resources. The problem in such approaches, is that unsupervised learning of word embeddings produces a single vector for each word in all contexts, and thus ignoring its polysemy. Such approaches are called static Word Embeddings. To overcome the problem, two types of approaches are suggested (Pilehvar and Camacho-Collados, 2018): multi-prototype embeddings, and contextualized word embeddings. The latter suggests to model context embeddings as a dynamic contextualized word representation in order to represent complex characteristics of word use. Proposed architectures such as ELMo (Peters et al., 2018), ULMFiT (Howard and Ruder, 2018), GPT (Radford et al., 2018), T5 (Raffel et al., 2019), and BERT (Devlin et al., 2018), achieved breakthrough performance on a wide range of natural language processing tasks. In multi-prototype embeddings, a set of embedding vectors are computed for each word, representing its senses. In (Pelevina et al., 2017), multi-prototype embeddings are produced based on the embeddings of a word. As such, a graph of similar words is constructed, then similar words are grouped into multiple clusters, each cluster representing a sense. As for Mancini et al. (2016), multi-prototype embeddings are produced by learning word and sense embeddings jointly from both, a corpus and a semantic network. In this paper we aim at using static word embeddings for WiC disambiguation.
Works on Arabic Word Sense Disambiguation (WSD) are limited, and the proposed approaches are lacking a decent or common evaluation framework. Additionally, there are some specificities of the Arabic language that might not be known in other languages. Although polysemy and disambiguating are challenging issues in all languages, they might be more challenging in the case of Arabic (Jarrar et al., 2018;Jarrar, 2021) and this for many reasons. For example, the wordšāhd ( ) could bešāhid ( ) which means a witness, oř sāhada ( ) which means watch. As such, disambiguating words senses in Arabic, is similar to disambiguating senses of English words written without vowels. Second, Arabic is a highly inflected and derivational language. As such, thousands of different word forms could be inflected and derived from the same stem. Therefore, words in word embeddings models will be considered as different, which may affect the accuracy and the utility of their representation vectors, as the same meaning could be incarnated in distributed word forms in corpora, which has led some researchers to think that using lemma-based models might be better than using word-based embeddings in Arabic (Salama et al., 2018;Shapiro and Duh, 2018). This idea will be discussed later in sections 5 and 6. Alkhatlan et al. (2018) suggested an Arabic WSD approach based on Stem2Vec and Sense2Vec. The Stem2Vec is produced by training word embeddings after stemming a corpus, whereas the Sense2Vec is produced based on the Arabic Word-Net sense inventory, such that each synset is represented by a vector. To determine the sense of a given word in a sentence, the sentence vector is compared with every sense vector, then the sense with maximum similarity is selected.
Laatar et al. (2017) did not use either stemming or lemmatization. Instead, they proposed to determine the sense of a word in context by comparing the context vector with a set of sense vectors, then the vector with the maximum similarity is selected. The context vector is computed as the sum of vectors of all words in a given context, which are learnt from a corpus of historical Arabic. On the other hand, sense vectors are produced based on dictionary glosses. Each sense vector is computed as the sum of vectors (learnt from the historical Arabic corpus) of all words in the gloss of a word.
In this paper, we present a set of experiments to evaluate the performance of using Lemma2Vec vs CBOW Word2Vec in Arabic WiC disambigua-tion. The paper is structured as follows: Section 2 presetns the background of this work. Section 3 overviews the WiC disambiguation system. Section 4 and Section 5, respectively, present the Word2Vec and Lemma2Vec models. In Section 6 we present the experiments and the results; and in section 7 we summarize our conclusions and future work.

Background
Experiments presented in this paper are part of the SemEval shared task for Word-in-Context disambiguation (Martelli et al., 2021).
The task aims at capturing the polysemous nature of words without relying on a fixed sense inventory. A common evaluation dataset is provided to participants in five languages, including Arabic, our target language in this paper. The dataset was carefully designed to include all parts of speeches and to cover many domains and genres. The Arabic dataset (called multilingual ar-ar) consists of two sets: a train set of 1000 sentence pairs for which tags (TRUE or FALSE) are revealed, and a test set of 1000 sentence pairs for which tags were kept hidden during the competition. Figure 1 gives two examples of sentence pairs in the dev.ar-ar dataset. Each sentence pair has a word in common for which start and end positions in sentences are provided. Participants in the shared task were asked to infer whether the target word carries the same meaning (TRUE) or not (FALSE) in the two sentences.

System Overview
This section describes our method to Arabic WiC disambiguation based on two types of embeddings: CBOW Word2Vec and Lemma2Vec. Given two sentences, s 1 and s 2 , and two words, v i from s 1 and w j from s 2 , the objective is to check whether v i and w j have the same meaning. To this end, our system extracts contexts of v i and w j from the sentence pair, represents them in two vectors and finally compares the two resulting vectors using the cosine similarity. The context of a word w of size n (denoted by context(w, n)) is composed of the words that surround the word w, with n words on the left and n words on the right (n varying between 1 and 10 in conducted experiments). To represent context(w, n) in a vector space, two methods are proposed: first one is based on CBOW Word2Vec embeddings vectors (Mikolov et al., 2013) of the words appearing in the context, whereas the second is based on the Lemma2Vec of lemmas of words appearing in the context. To select the best way to represent the context(w, n) by a vector, classification experiments were conducted using (i) different pooling operations, min, max, mean, and std to combine words/lemmas vectors of the context, (ii) different threshold values (between 0.55 and 0.85) and (iii) the removal of functional words (also called stop words). The later are used to express grammatical relationships among other words, they are characterized by they high frequency in the corpus which might affect the WiC disambiguation accuracy. The cosine similarity is then used to compare vectors of context(v i , n) and context(w j , n). Figure 2 illustrates how the cosine similarity is calculated from context(v i , 3) and context(w j , 3).  Classification experiments on SemEval-2021 ar-ar datasets were conducted using the following two CBOW Word2Vec models and two corresponding Lemma2Vec models: (i) Wiki-CBOW, a pretrained Word2Vec model from the set of AraVec models (Soliman et al., 2017) , (ii) our CBOW Word2Vec model that we trained ourselves, (iii) Lemma2Vec model that we constructed based on the Wiki-CBOW model, and (iv) Lemma2Vec that we constructed based on our CBOW Word2Vec model. Based on these four models, four experiments were conducted to tune the following parameters: context size (context size), threshold, pooling operation (pooling) and removing of functional words (stop words). Before training the Word2Vec model, several normalization and preprocessing steps were performed. First, all diacritics, punctuations, Madda character, digits (Hindi and Arabic), Latin characters (including accented letters) were removed. Second, some special Arabic letters are unified. Third, sequences of repeated characters with length larger than 2 were reduced to one character; repeated spaces were also replaced by one space. Fourth, different forms of Alifs ( ) are replaced with ( ). Spaces followed by a period character and new lines were considered to be end of sentence marks. The split method in Python is used in tokenization. The vocabulary size of the resulted model is 334,161.

Constructing the Lemma2Vec models
Two Lemma2Vec models were produced, based on both: the Wiki-CBOW Word2Vec model, and our CBOW Word2Vec model. Each vocabulary in each of the Word2Vec models was lemmatized first. Then a vector for each lemma ( i.e., Lemma2Vec) is calculated as following: first all word forms belonging to this lemma are fetched, then their Word2Vec vectors are combined through a mean pooling operation. The lemmatization process was performed using in-house tools and lexicographic databases 1 belonging to Birzeit University (Jarrar, 2021;. In case of a word cannot be lemmatized due to misspelling, incorrect tokenization or in case of foreign word (not included in our database), then the corresponding Lemma2Vec is considered to be its Word2Vec vector.

Experiments Results and Discussion
Given our Arabic WiC disambigation method described in Section 3, and given the SemEval multilingual dev.ar-ar dataset provided by SemEval-2021 (Martelli et al., 2021), four classification experiments were conducted using the cosine similarity and based on the two Word2Vec models and the two Lemma2Vec models. The objective is to tune the following parameters for each model: context size (ranging from 1 to 10), threshold (we determined empirically the range from 0.55 to 0.85 with 0.1 step size), pooling (min, max, mean and std), and stop words (yes, no). Then the values of parameters corresponding to the high F1-scores for TRUE (T) and FALSE (F) classes are selected in order to classify sentence pairs in the test.ar-ar dataset. For each model we did  the following to find the high F1-scores for T and F: For each context size (between 1 and 10) and for each value of the stop words (yes or no) we plotted 8 line plots (4 for T and 4 for F) for each of the four pooling operations (mean, max, min and std) and for threshold ranging from 0.55 to 0.85 (i.e., 20 plots for each model, resulting 80 plots). Figures 3a, 3b, 3c and 3d show the best 4 F1scores line plots for each of the four models, and Table 2 shows the effective F1-scores values for T and F classes as well as precision and recall values (best results marked in bold). The values of parameters corresponding to the best result were then used in classifying the test.ar-ar dataset. The accuracies are reported in Table 2 as well.
As shown in Figure 3, the Lemma2Vec models have the tendency to perform better with shorter context sizes compared with the Word2Vec models. A possible reason may be that, in case of Lemma2Vec, the narrow meaning of words is affected due to the increase number of words involved in Lemma2Vec vector calculation. The impact of Lemma2Vec on the narrow meaning of words is discussed in the next subsection.
The results with yes for stop words are slightly better but not significant. Additionally, the min pooling was generally the best operation to combine the context vectors, and the results of both min and max pooling were close to each other.

Lemma2Vec-Word2Vec Error Analyses
This subsection discusses the performance of using lemma-based vs. word-based models in the WiC disambiguation task, which we summarize in Table 3 and Table 4 Table 3 presents the results of experiments 1 and 2 (using Word2Vec and Lemma2Vec of Wiki-CBOW) whereas Table 4 presents the results of experiments 3 and 4 (using Word2Vec and Lemma2Vec of our CBOW model). In each table, we compare between cases that were correctly or wrongly classified by both models. For example, the second row in Table 3    hyperparameter.
To understand the gain and loss by the lemmabased models, we manually analyzed most cases. Figure 4 illustrates such cases. The first sentence pair in Figure 4 was correctly classified by the Lemma2Vec (in Exp4) and wrongly by the Word2Vec (in Exp3). This illustrates that the lemma vector as a generalized model for its inflections (i.e., a mean of word forms' vectors) was better in deciding that both contexts are similar and that the two word forms have the same meaning. However, the second example in Figure 4 illustrates the other way. The Lemma2Vec was too general, and the Word2Vec was specific enough, to decide that the two word forms, in the two contexts, are different. The word from al-ǧins ( ) could mean both genus and sex; however the other word form al-aǧnās ( ), is semantically distinctive by its own morphology -as it can only be plural of genus, and cannot be plural of sex.
To conclude, although Lemma2Vec outperforms Word2Vec in some cases (mostly in the TRUE sentence pairs class), it underperforms Word2Vec in others cases (mostly in the FALSE sentence pairs class). Since the distribution of TRUE and FALSE is equal in the datasets, the overall performance of both models is close to each other. Nevertheless, in case of an application scenario where a large proportion of sentence pairs is expected to be TRUE, we recommend the use of Lemma2Vec, otherwise the Word2Vec.

Conclusions and Further Work
We presented a set of experiments to evaluate the performance of using Word2Vec and Lemma2Vec models in Arabic WiC disambiguation, without using external resources or any context/sense embedding model. Different models were constructed based on two different corpora, and different types of parameters were tuned. The final results demonstrated that Lemma2Vec models are slightly better than Word2Vec models for Arabic WiC disambiguation. More specifically, we found that Lemma2Vec outperforms Word2Vec for TRUE sentence pairs, but underperforms it for FALSE sentence pairs.
We plan to extend our work to use our Lemma2Vec model to build a multi-prototype embeddings using the large lexicographic database available at Birzeit University. We plan also to fine tune the recently released Arabic BERT models, such as (Safaya et al., 2020;Antoun et al., 2020;Abdelali et al., 2021;Inoue et al., 2021), using the same database.