BabelEnconding at SemEval-2020 Task 3: Contextual Similarity as a Combination of Multilingualism and Language Models

This paper describes the system submitted by our team (BabelEnconding) to SemEval-2020 Task 3: Predicting the Graded Effect of Context in Word Similarity. We propose an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. Our hypothesis is that evidence from additional languages can leverage the correlation with the human generated scores. BabelEnconding was applied to both subtasks and ranked among the top-3 in six out of eight task/language combinations and was the highest scoring system three times.


Introduction
Word similarity is a key task in Natural Language Processing (NLP) applications. Language models, such as word embeddings (Mikolov et al., 2013) create vector representations for the words that are able to capture syntactic and semantic relationships. These representations became very popular in the last few years as they have boosted the performance of several NLP tasks. However, since each word is represented by a fixed vector these techniques have problems dealing with polysemous words and identifying subtle meaning changes between different sentences. On the other hand, state-of-the-art language models, like BERT (Devlin et al., 2019) provide a contextualized word representation -the representation of a word relies on its context, which means that the same word may have different representations through the sentences. Thus, BERT models are more suitable for handling polysemous words.
Task 3 in SemEval 2020 -Predicting the Graded Effect of Context in Word Similarity (Armendariz et al., 2020a) was motivated by this improvement on language models. The task aims at the design of a similarity measure which captures the human perception of the meaning of words. For that purpose, task organizers built and annotated datasets in four languages -English, Croatian, Finnish, and Slovenian. Each entry in a dataset consists of two target words and two contexts, where each one is a piece of text containing both target words. The global task is divided into two subtasks: 1) predicting the change in the human annotator's scores of similarity when presented with the same pair of words within two different contexts; and 2) predicting the human scores of similarity for a pair of words within two different contexts.
In this paper, we describe BabelEnconding, an approach that relies on machine translation and multilingual language models to evaluate the contextual similarity of pairs of words. Our hypothesis is that having similarity information from more languages helps decide on how similar the words are.
Considering the eight combinations of language/subtask, BabelEnconding was ranked among the top-3 competitors six times, and was the top scoring method in three cases. Our additional experiments in English and Croatian showed that adding more languages noticeably improved the results for Croatian in both subtasks. In English, the gain was small and happened only in Subtask 2.

Background and Related Work
The Distributional Hypothesis (Harris, 1954) states that the meaning of a word changes depending on the context it is used. At the same time, this hypothesis also states that if two words tend to be used in the same This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/. contexts, then they are likely to be more similar. This claim inspired many solutions in NLP that based solely on the distribution of words in the corpus (Fernández et al., 2016;Wang et al., 2020;Lüddecke et al., 2019). Word embeddings and language models, for example, are among these solutions. The idea is to represent words in a vector space in such a way that the semantic similarity between words is preserved. In the past few years, techniques to build language models became very popular. Word2vec (Mikolov et al., 2013) is an efficient and fast training method for word embeddings, based on co-occurrence statistics. The authors devised two model architectures for the word vectors training -continuous bag of words and skip-grams. Both approaches consist of neural networks trained to predict neighbor contextual words. Despite its ability in mapping linguistic regularities present in documents, this language model produces a unique representation for each word in the vocabulary, which prevents the differentiation of word senses.
In state-of-the-art language models, such as BERT (Devlin et al., 2019), the context of a word is taken into account in its representation. These models are trained over a large corpus to predict missing tokens which are removed from the original sentences. An advantage of BERT over Word2vec is that it creates different representations for the same word depending on the context in which the word appears. Another advantage of BERT-like models is that they can be specialized for a specific task with few training epochs.
Solutions for measuring contextual similarity between word pairs and word-sense disambiguation benefited from BERT-like language models. Enriched models were designed (Levine et al., 2019;Peters et al., 2019;Scarlini et al., 2020), and new datasets such as the Word-in-Context Dataset (Pilehvar and Camacho-Collados, 2018) and CoSimLex (Armendariz et al., 2020b) were assembled. Word-sense disambiguation can also take advantage of multilingualism. Some works have employed parallel/comparable corpora (Banea and Mihalcea, 2011;Dandala et al., 2013) and translation (Carpuat, 2013) to that task. Multilingual resources, such as Multi-SimLex (Vulic et al., 2020), were also developed and yielded improvements compared with the monolingual version.

BabelEnconding
Our proposed solution, called BabelEnconding, works in two phases and its overall process is depicted in Figure 1. The input is a pair of words and two sentences (contexts) containing both words of interest. More formally, let S 1 = {w 1 1 , w 1 2 , . . . , w 1 i } and S 2 = {w 2 1 , w 2 2 , . . . , w 2 j } be two sentences, where there is a pair of words p = w a , w b ∈ S 1 and S 2 . For example, let S 1 = "Her prison cell was almost an improvement over her room at the last hostel" and S 2 = "His job didn't leave much room for a personal life. He knew much more about human cells than about human feelings" be two sentences, where the pair of words p = room, cell .
In the first phase of BabelEnconding, both input sentences S 1 and S 2 are translated into a set of k languages L = {l 1 , l 2 , . . . , l k }. This process will produce a set of translated sentences S l i = S l i 1 , S l i 2 , which corresponds to the translation of the original sentences, into each language l i ∈ L. Then, the words of interest are identified in the translated text, generating two sets p l i s 1 and p l i s 2 . In this example,  Figure 1: Overview of BabelEnconding considering L = {Italian, Portuguese}, the pairs of words of interest are translated as p IT s 1 = cella, stanza , p PT s 1 = cela, quarto from S 1 and p IT s 2 = spazio, cellule , p PT s 2 = espao, clulas from S 2 . In the second phase, with the translated sentences, we evaluate the similarity between the pair of words in p for each language in L separately in two ways: (i) using word embeddings and (ii) BERT. Finally, BabelEnconding calculates a weighted average between word embeddings and BERT similarities. These similarities are used to address both subtasks.
Word Embedding Similarity consists in taking the cosine similarity between the word vectors of the two words in each language. We rely on pre-trained monolingual word embeddings to represent the words. The context is not used in this similarity measure since there is a fixed vector for each word.
BERT Similarity requires inferring the word embedding representation of words in BERT models, taking context into consideration. The context is the sentence (S m ) containing the two words. This process was done summing the last four hidden layers of the BERT model. This choice was made based on the good results achieved by Devlin et al. (2019) in the Named Entity Recognition task.
BabelEnconding Similarity consists on a weighted average between word embedding and BERT similarities scaled in multiple languages. Equation 1 shows how BabelEnconding calculates the similarity between words w 1 and w 2 within sentence S m . In this equation, α and β are the weights given to BERT and Word Embedding similarities, respectively. ( Our hypothesis is that having similarity information from more languages helps decide on how similar they are. The underlying assumption is that if two words are translated to the same word in other language, they are more likely to be more similar. Translation also helps identifying dissimilarity between words as it can help to disambiguate terms.
Preliminary tests showed that, once both words occur together in the same context, the similarity between words tended to be undesirably high when using just BERT representations. This effect can be attributed to BERT's attention mechanism. Thus, a combination of BERT and fixed word embeddings was designed to alleviate this issue.

Experimental Setup
Dataset. The dataset used in our experiments was CoSimLex (Armendariz et al., 2020b) which consists of 340 sentence pairs in English (EN), 112 in Croatian (HR), 111 in Slovene (SL), and 24 in Finnish (FI). Please refer to that paper for details on the annotation methodology.
Tools and Resources. The official experiments used Google Translator API 1 . Here, we also report a comparison with Bing Microsoft Translator 2 . The multilingual uncased version of BERT 3 trained on Wikipedias in 102 languages was used. For word embeddings, we used FastText 4 which provides pre-trained embeddings for 157 languages. These embeddings were also trained on Wikipedia.
Evaluation Metrics. The evaluation metrics used to assess the quality of the participating systems measure the correlation between the scores assigned by human annotators and the scores automatically

Results
Results for the Official Runs. The system configurations that achieved the best results in the official runs are shown in Table 1. We varied the number of extra languages and the values for α and β. For Subtask 1, English and Slovenian performed better when no additional languages were used in the similarity computation. On the other hand, Croatian and Finnish performed better when all 11 additional languages were used. Moreover, these two languages were benefited when word embeddings were completely removed from BabelEnconding calculation. In Subtask 2, the use of all extra languages or a subset of the 11 languages showed the best results. A combination of BERT and word embeddings also proved to be beneficial for that task. In comparison with other participants, we achieved best results for Croatian, in both subtasks, and for Slovenian in Subtask 2. Table 2 summarizes the official results for both Subtask 1 and Subtask 2. The column Average shows the average of the results achieved by the teams among all languages and the column Rank shows the team's position in the ranking. As we can see, our method performed well in both subtasks, being ranked in first place considering the average of all languages.  How much each component of BabelEnconding contributes to the overall result? In order to assess the contribution of the components of BabelEnconding, we performed experiments varying the parameters α (which scales the contribution of BERT similarity), and β (which weighs the importance of word embedding similarity). As a general tendency, increasing α values tends to produce better correlation results, especially in Subtask 1. However, when the word embeddings component is removed (i.e., β=0), results tend to get worse, mainly in Subtask 2. Figure 2 shows the results for English and Finnish. The curves in (a) represent the typical case, which was found in English, Croatian, and Slovenian. The results for Finnish (b) in Subtask 2 followed a different pattern, in which evaluation scores are not affected by the presence of BERT on similarity computation. We believe this happened because Finnish is an agglutinative language, and since BERT's tokenization process uses Byte Pair Encoding, it tends to split Finnish words in too many tokens (Virtanen et al., 2019) yielding to poorer word representations. Do results improve as more languages are added? In order to evaluate the benefits of multilingualism, we performed an experiment in which the performance using only the source language (i.e., the language of the original sentence) is compared to the performance when more languages are incrementally added. Figure 3 shows the results for this experiment for the datasets in English and Croatian. The first set of points on the plot mark the case in which only the original language was used. The second set, shows the scores when each of 11 possible languages were added. From the third set of points onward, we kept the language(s) that brought the biggest gain and added one more. We repeated this process until the addition of a new language ceased to bring improvements. The combination of multiple languages was beneficial for Croatian, in both subtasks, and for English in Subtask 2. In Croatian, the addition of one language improved results in 9 out of 11 possible languages. The exceptions were Greek and Serbian, in which cases, the scores remained the same. By adding English, the score increased by eight percentage points. By adding further languages, the improvement was smaller but steady until it reached a plateau with six additional languages.
Does the translation mechanism impact the results? In order to evaluate the impact of different translation engines on BabelEnconding, we compared the performance of Google Translator and Bing Microsoft Translator. The four original datasets were translated into the 11 languages using both engines. Then, the translated datasets were used to perform the contextual similarity tasks with the same algorithm configuration (all languages considered, α = 0.7 and β = 0.3). The results are shown in Figure 4.

Conclusion
In this paper, we described our system submitted to SemEval-2020 Task 3. We designed an approach that relies on translation and multilingual language models in order to compute the contextual similarity between pairs of words. The key idea is that having similarity information from different languages may help decide on how similar the words are. Our system achieved competitive results in both subtasks, being ranked among the top-3 in most runs.
In these preliminary experiments, we could not establish in which cases more languages are helpful and we leave it as future work. Additionally, we are interested in understanding which factors contribute to improvement in the results -whether it is the amount of data used for training the language models or individual features of the language.