Will_Go at SemEval-2020 Task 3: An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on BERT

Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our team will go won the 1st place in Finnish language track of subtask1, the second place in English track of subtask1.


Introduction
Computing the meaning difference between words in the semantic level is a task that has been widely discussed. In the area of natural language processing (NLP) like information retrieval (IR), there are many specific applications using similarity, such as text memorization (Lin and Hovy, 2003), text categorization (Ko et al., 2004), Text Q&A (Mohler and Mihalcea, 2009), etc.
The task3 of SemEval-2020 1 focuses on the influence of context when humans perceive similar words (Armendariz et al., 2020a). As we all know, polysemous words have different meanings in a totally different context, which the current translation system can recognize very well. However, many translation systems can't exactly predict the subtle variance on the meanings of words, which is also caused by a different context.
Task3 has two sub-tasks. In subtask1, we are required to predict the extent of change in scores of similarity between two words in different contexts by human annotators. In subtask2, we only predict the absolute score as is in the subtask1 rather than the difference in scores, and we would only discuss subtask1.
Our team uses different algorithms to calculate the distance between two embedding vectors generated by BERT (Devlin et al., 2018) and defines it as the similarity. So we can get the change in similarity by subtraction between two distances. However, this methodology did not get exciting performance in the task evaluation, so we improve this by blending different BERT models, which we will introduce later in Section 3.2.

Related Work
There are many methods and models to estimate the similarity between long paragraphs. Most of them treat it as a binary classification problem, Hatzivassiloglou et al. (Hatzivassiloglou et al., 1999) compute the linguistic vector of features including primitive features and composite features, then they build criteria by feature vectors to classify paragraphs. As for similarity between short sentences, Foltz et al. (Foltz et al., 1998) suggest a method that provides a characterization of the extent of semantic similarity between two adjacent short sentences by comparing their high-dimensional semantic vectors, which is also a Latent Semantic Analysis (LSA) model. Both LSA and Hyperspace Analogues to Language This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. * These authors contributed equally to this work. 1 https://competitions.codalab.org/competitions/20905 (HAL) (Burgess et al., 1998) are all corpus-based model, the latter one uses lexical co-occurrence to generate high-dimensional semantic vectors set, where words in this set can be represented as a vector or high-dimensional point so that their similarities can be measured by computing their distances. Although computing similarity between words are less difficult than between texts, there still exist some sophisticated problems. Similarity between words is not only in morphology but more significantly in semantic meaning. The first step of reckoning the similarity between words is using Word2Vec (Mikolov et al., 2013a), which is a group of corpus-based models to generate word embedding, and mainly utilizes two architectures: continuous bag-of-words (CBOW) and continuous Skip-gram. In the CBOW model, the distributed representations of context are made as input to the model and predict the center words, while the Skip-gram model uses the center words as its input and predict the context, which predicts one word in many times to produce several context words. Therefore, the Skip-gram model can learn efficiently from context and performs better than the CBOW model, but it takes much more consumption in training time than the CBOW model. But since hierarchical softmax and negative sampling (Mikolov et al., 2013b) were proposed to optimize the Skip-gram model when training large-scale data.
Word2Vec cannot be used for computing similarity between polysemous words because it generates only one vector for a word, while Embedding from Language Model (ELMo) (Peters et al., 2018) inspired by semi-supervised sequence tagging (Peters et al., 2017) can handle this issue. ELMo is consist of bidirectional LSTM (Hochreiter and Schmidhuber, 1997), which makes the ELMo have an understanding of both next and previous word, it obtains contextualized word embedding by weight summation over the output of hidden layers. Compared with the LSTM used in ELMo, Bidirectional Encoder Representation from Transformer (BERT) (Devlin et al., 2018) is a stack of Transformer Encoder (Vaswani et al., 2017), which can be computed in parallel ways and save much time in training. There are two BERT versions with different size, one is BERT Base, which has 12 encode layers with 768 hidden units and 12 attention heads, and the other is BERT Large, which has 24 encode layers with 1024 hidden units and 16 attention heads, achieved state-of-the-art results according to that paper.

Data
The source of our test data is from the CoSimLex dataset (Armendariz et al., 2019), which is based on the well known SimLex999 dataset (Hill et al., 2014) and provides pairs of words.
In task3 Each language datafile has eight columns, namely word1, word2, context1, context2, word1 context1, word2 context1, word1 context2, word2 context2, and their meanings are first word, second word, first context, second context, the first word in the first context, the second word in the first context, the first word in the second context, the second word in the second context respectively. In addition, word1 and word2 may have a lexical difference between word1 context and word2 context.

Methodology
The BERT model architecture is based on a multilayer bidirectional Transformer as Figure 1.Instead of the traditional left-to-right language modeling objective, BERT is trained on two tasks: predicting randomly masked tokens and predicting whether two sentences follow each other. BERT model gets a lot of state-of-the-art performance in many tasks, and we also use the BERT model in our strategy. We approach this task as one of tasks that calculates the similarity between two words. In our model, context data would be firstly added into BERT like the following Figure 2. Inspired by CoSimLex (Armendariz et al., 2020b), our model would calculate the distance by several algorithms immediately when it obtained embedding of each token, then we predict the graded effect of context in word similarity as in the following steps: • Step1: Choose the corresponding two embeddings of word1 context1 and word2 context1, compute the distance in several algorithms as SC 1 .
• Step2: Substitute the words in Step 1 with word1 context2 and word2 context2, and repeat the last step, then we get the SC 2 .
• Step3: By subtraction, we can get the change on similarity C = SC 1 − SC 2 • Step4: Change the distance computing algorithm and repeat Step1 ∼ Step3.
• Step5: After Step1 ∼ Step4, we can obtain a vector of change, C 1 , C 2 , · · · , C n , where n denotes the number of distance calculating algorithms used in our model and w i denotes the manual parameter, we get the final change Here we provide a flow chart Figure 3 to show the process from Step1 to Step4.

Experiment
We trained one standard BERT Large model and one multilingual BERT Base model by MXNet (Chen et al., 2015). The dataset we trained BERT Large model is openwebtext book corpus wiki en cased, which were maintained by GluonNLP 2 , and we trained Multilingual BERT (M-BERT) (Pires et al., 2019) Base model by wiki multilingual uncased dataset that also provided by GluonNLP. It takes much time to train the BERT model, so we recommend utilizing the well trained BERT model from bert-embedding 3 . After configuring the models, we can follow the section 3.2 by giving the input from section 3.1 and get the experiment results which will be introduced in Section 4. Task3 has four language tracks, namely English, Croatian, Finnish, Slovenian. We use the BERT Large model in the English track, and Multilingual BERT Base model in the other three tracks.
In section3.2, we use several algorithms to compute similarity. Here we introduce two main algorithms that used in our experiments.
• Euclidean Distance that calculates the square root of square distance in each dimension.

Results
In our experiment targeted at subtask1, the English language track uses the Bert Large model, the Euclidean distance is 0.718 and the cosine distance is 0.752, the Blend result is 0.768, and the online LB ranks second; Croatian, Finnish, and Slovenian languages all use the Multi-lingual Bert model. The Croatian language track' Euclidean distance of 0.590, the cosine distance is 0.587, the Blend result is 0.594, and the online LB ranks sixth. The Finnish language uses the Euclidean distance of 0.750, the cosine distance is 0.671, the Blend result is 0.772, and the online LB ranks 1, The Slovenian language uses a Euclidean distance of 0.576, a cosine distance of 0.603, a Blend result of 0.583, and an online LB ranking seventh. We sort the result out the following

Conclusion
In our paper, we propose a model that computes the similarity and similarity change by blending cosine similarity and euclidean distance, which calculated by two word embedding vectors. We firstly transform words in dataset that we introduce in section 3.1. into the word embedding vectors by BERT that we discuss in section 3.2, then we calculate the distance between two vectors, finally we blend the two distances computed by different algorithms as the final predict result. In the subtask1 of task3, our team will go won a champion in Finnish track and the second place in English track.