SzegedAI at SemEval-2021 Task 2: Zero-shot Approach for Multilingual and Cross-lingual Word-in-Context Disambiguation

In this paper, we introduce our system that we participated with at the multilingual and cross-lingual word-in-context disambiguation SemEval 2021 shared task. In our experiments, we investigated the possibility of using an all-words fine-grained word sense disambiguation system trained purely on sense-annotated data in English and draw predictions on the semantic equivalence of words in context based on the similarity of the ranked lists of the (English) WordNet synsets returned for the target words decisions had to be made for. We overcame the multi,-and cross-lingual aspects of the shared task by applying a multilingual transformer for encoding the texts written in either Arabic, English, French, Russian and Chinese. While our results lag behind top scoring submissions, it has the benefit that it not only provides a binary flag whether two words in their context have the same meaning, but also provides a more tangible output in the form of a ranked list of (English) WordNet synsets irrespective of the language of the input texts. As our framework is designed to be as generic as possible, it can be applied as a baseline for basically any language (supported by the multilingual transformed architecture employed) even in the absence of any additional form of language specific training data.


Introduction
A major obstacle in solving word sense disambiguation (WSD) problems in a supervised manner is the scarcity of annotated training corpora. As the construction of high quality sense-annotated training data can be extremely labor-intensive and difficult (Gale et al., 1992), the Word-in-Context (WiC) disambiguation task was recently proposed by Pilehvar and Camacho-Collados (2019) as a surrogate for the traditional WSD problem. While in the traditional fine-grained WSD setting, the aim is to assign a precise and often nuanced meaning to a word in its context according to some sense inventory, WiC is framed as a binary classification problem, where the task is to decide whether two target words originating from a pair of input sentences have the same meaning. This kind of binary decision can also be made in the absence of a nuanced sense inventory, making the annotation process less demanding and also more suitable across languages (Raganato et al., 2020).
In this paper, we analyze the utilization of multilingual transformer-based language models for performing both multi-lingual and cross-lingual WiC in the zero-shot setting, by employing nothing but English sense annotated training data and utilizing the model predictions in a transductive model that is capable of performing zero-shot WSD and WiC disambiguation for any language that is supported by the multilingual transformer encoder model that gets employed. Loureiro and Jorge (2019) showed that a simple, nearest neighbor approach relying on contextual word embeddings can achieve impressive WSD results in English. In our follow-up work (Berend, 2020), we demonstrated, how sparse contextualized word representations can be exploited for obtaining significant improvements over the LMMS approach introduced by Loureiro and Jorge (2019). Our shared task participation was focused on comparing the two techniques in a zero-shot multilingual and cross-lingual WiC evaluation setting.

System overview
At the core of our multi,-and cross-lingual WiC systems, we employed fine-grained WSD systems, originally intended to solely handle English texts. The two models that we employed were the LMMS (Loureiro and Jorge, 2019) and the S-LMMS (Berend, 2020) approaches. We dub the latter solution as S-LMMS, highlighting its resemblance to the LMMS approach and the fact that it operates with sparse contextualized word representations. Both LMMS and S-LMMS requires sense-labeled training data for constructing their respective fine-grained WSD models.
We provide a brief overview of the two approaches and encourage readers interested in more details to read the original papers (Loureiro and Jorge, 2019;Berend, 2020) introducing them. LMMS and S-LMMS both has in common, that they encode the inputs with a transformer model (BERT-large). LMMS constructs a prototype vector for each English synset based on the BERTencoded vectors of the sense-annotated training data and the actual contents of the English WordNet glosses. For a given token in its context, LMMS takes its BERT-encoded contextualized vector and finds the nearest synset prototype for determining its sense.
The way S-LMMS differs from LMMS is that it additionally incorporates a sparsity inducing dictionary learning step, which turns the contextualized word representations into a sparse format, i.e., to such vectors that contain a high fraction (> 90%) of zero coefficients. Additionally, the methodology for creating the synset prototype vectors has substantial differences between the two approaches, as LMMS uses the actual contextualized embeddings pertaining to a certain synset as prototypes, whereas S-LMMS distills a vectorial representation to each synset based on an information theoretic measure.
The important technical change that we performed over the previously described fine-grained WSD models, so that they can be employed in the cross-lingual setting, is that we replaced the BERT-large encoders that the LMMS and S-LMMS models use by default to the XLM-RoBERTa-large (Conneau et al., 2020) architecture. We shall refer to the variants of LMMS and S-LMMS that were obtained by relying on XLM-RoBERTa as an encoder as opposed to BERT-large as mLMMS and mS-LMMS, owing to the multilingual nature of XLM-RoBERTa. We used the transformers library (Wolf et al., 2020) for obtaining the contextualized multilingual embeddings for our experiments.
When performing fine-grained WSD in English, one can simply restrict the scope of predicting the most likely synset for some word to those that are deemed viable for a given word in WordNet. Addi-tionally, one can also filter the synsets over which the prediction is performed, based on the part-ofspeech category of a word in question. With these heuristics, it is possible to reduce the number of synsets that a word can belong to a few dozens of synsets even for the most ambiguous cases.
In order to test a solution that is as generic as possible, we did not integrate any of these heuristics into our framework, meaning that our models returned a ranked list over all the 117,659 English WordNet synsets to any word from some sentence. This way, our solution can also work basically any language (supported by the multilingual transformer employed), even in the absence of a multilingual sense-inventory resource such as BabelNet (Navigli and Ponzetto, 2010) and also when we have no access to the part-of-speech information, nor to a part-of-speech tagger for some language. These design choices ensures that we are able to handle a much wider range of languages as if we decided otherwise. To this end, we regard our approach a particularly good fit being used as a baseline for WSD related evaluations involving low-resource languages.
As mentioned previously, our * LMMS models assigned a ranked list of 117,659 English synsets to every target word irrespective of the language of the sentence it was written in. Since the ranking of the synsets for a given word was performed over all the synsets of WordNet, it would be too restrictive to expect that words with identical meaning should be assigned the exact same most likely English synset. To this end, we measured the similarity for a pair of ranked lists that a model returned for a pair of words in their contexts and decided about the semantic equivalence of the two words based on that similarity score. As the similarity scores calculated for the ranked lists of synsets that fit those pairs of words that have the same meaning are expected to be higher on average, we decided to determine a threshold for the similarity scores of the ranked lists above which we predicted the two words to have the same meaning, and to have a different meaning otherwise.
We experimented with three strategies for measuring the similarity of two ranked synset lists for a pair of words. Let S 1 and S 2 refer to the ranked lists of WordNet synsets assigned to two words. As the bottom of the ranking is arguably not as meaningful as its top-ranked elements, we decided to formulate S . These ranked lists differed from S 1 and S 2 in that they contained their top 100-ranked elements, respectively. 1 Since we only focus on the highest ranked synsets from S 1 and S 2 , it is almost sure that certain element from S (100) 1 are not included in S (100) 2 , and vice versa. As such, the usage of standard rank correlation scores would be inconvenient for measuring the similarity between ranked lists S (100) 1 and S (100) 2 . One motivation behind the introduction of ranking-biased overlap (RBO) (Webber et al., 2010) was particularly this, i.e. to provide such a distance metric that is capable of operating between non-conjoint rankings. RBO is an overlap-based metric, that can operate over such rankings when the ranked elements themselves are not totally identical. To this end one of our metric for measuring the similarity between S was to simply take their Jaccard similarity, i.e. the fraction of the size of their intersection and the elements in their union. As a third approach, we calculated the harmonic mean of the mean reciprocal rank (MRR) of the highest ranked synset from S . We then based our predictions with the similarity scores calculated by either of the above manner.
Instead of using some supervised approach, we determined a threshold for the similarity score for a pair of ranked synset lists S (100) 1 and S (100) 2 , above which we predicted that the words they got assigned to had identical meaning. We determined this threshold in a transductive manner, without using any of the labeled training or development set sentence pairs at all. For the cross-lingual evaluation it would have been impossible at the first place, as no annotated pairs of sentences were released during the shared task.
We used expectation maximization for determining the similarity threshold above which we predicted a pair of words to have the same meaning. That is, we took all the similarity scores that we calculated for a certain test set based on the S (100) 1 and S (100) 2 ranked synset lists, and fitted a Gaussian Mixture Model over the similarity scores. That way, we managed to fit a Gaussian distribution for 1 Experiments with different thresholds (10, 25, 50, 250 and 500) also provided similar results that we omit for brevity. the similarity scores of pairs of words with identical and different meanings. We identified the fitted Gaussian distribution with the higher expected value to be the one that corresponds to the distribution of similarity scores for those words that have identical meaning. As expectation maximization algorithms are prone to find local optima, we initialized each model 100 times and chose the one which resulted in the best log-likelihood score. Our decisions for a particular test sample was then based on the density functions on the similarity scores of the two classes determined by the best fitting model.

Experiments
We tested our approach on both the multilingual and the cross-lingual subtasks of the shared task (Martelli et al., 2021). The multilingual test sets consisted of sentence pairs that were written in the same language (either Arabic, English, French, Russian or Chinese), whereas, an input was comprised of an English and a non-English (either Arabic, French, Russian or Chinese) sentence for the cross-lingual scenario.
The fine-grained WSD model that we built our system on was trained over English senseannotated training data. We used two sources of training signal, the SemCor dataset as well as the Princeton WordNet Gloss Corpus (WNGC), which has been shown to improve fine-grained WSD results (Vial et al., 2019;Berend, 2020). Unless stated otherwise, we used these three sources of sense-annotated training data for obtaining our * LMMS models. 2

Monolingual all-words WSD experiments
We first evaluated LMMS and S-LMMS models on standard fine-grained all-words disambiguation data included in the unified evaluation framework from (Raganato et al., 2017). What we were interested here is the change in the standard WSD performance of these systems when replacing the English specific BERT-large model that LMMS and S-LMMS originally employ to XLM-RoBERTa-large. At this point we evaluated our fine-grained WSD performance in terms of F-score over the concatenation of the five standard evaluation benchmarks from SensEval2 (Edmonds and Cotton, 2001), Sen-sEval3 (Mihalcea et al., 2004), SemEval 2007Task 17 (Pradhan et al., 2007 (Navigli et al., 2013), SemEval 2015 Task 13 (Moro and Navigli, 2015). This test set consisted of 7,253 English test cases in total. Table 1 includes our results using the four different models that were using different layers from the transformer model that was employed for encoding the input texts. As expected, replacing the English specific transformer model to a multilingual encoder resulted in a decreased performance, however, the overall decrease was not very severe. Comparison of the results in Table 1a and Table 1b reveals that the performance of S-LMMS is less affected by the integration of the multilingual RoBERTa model in place of the English-only BERT model for encoding. Additionally, using the encodings from the 21th layer of the transformer models seem to provide a slight edge over the utilization of the concatenation of the last four layers irrespective of the encoder and the specific WSD model used. To this end, we participated in the shared task-related with such * LMMS models that were using the contextualized word representations from the 21th layer alone, as opposed to the average of the last four layers.

Evaluation on the shared task data
In Table 2, we list those test scores that we obtained by differently configured versions of our architecture. Our results span the different strategies for performing all-words fine-grained WSD (mLMMS/mS-LMMS) and different strategies for calculating the similarity between two ranked list of most likely synsets assigned to the test words (Jaccard/MRR/RBO) as described earlier in Section 2.
We can see from Table 2 the same phenomenon as for our monolingual fine-grained WSD evalua-tions in Table 1, i.e., the mS-LMMS approach had a clear advantage over LMMS for both the multilingual and the cross-lingual evaluation settings.
Regarding the effects of choosing different ways to calculate the similarity scores between a pair of ranked lists of synsets, the application of the Jaccard similarity and the RBO metric-based similarity seems to perform very similarly, with the mean reciprocal rank based similarity scoring slightly underperforming the other two alternatives. Overall, the results seem to be balanced over the languages, with the choice of the fine-grained WSD system being more influential to the final results as the choice of the similarity calculation between the ranked lists of synsets returned by them to a pair of test words.
For training our * LMMS models, we decided to experiment with the integration of a recent source of sense tagged training dataset, UWA (Loureiro and Camacho-Collados, 2020), which is a senseannotated corpus containing unambiguous words from Wikipedia and OpenWebTex. We relied on the recommended version of the UWA corpus which contains 10 example sentences for each unambiguous word. By expanding the number of sense annotated training text, it becomes possible to increase the coverage of the fine-grained WSD systems. We investigated the downstream effects for our WiC system of extending the amount of sense annotated training data used by our fine-grained WSD systems.
Our evaluation results over the same set of models as in Table 2, with the only difference that we additionally used the UWA10 sense-annotated corpus for creating our all-words WSD models are included in Table 3. This additional training corpus was not always helpful, however, increased our  average accuracy by a slight (≈ 1%) margin.

Conclusions
In this paper, we introduced our cross,-and multilingual WiC framework that we approached from an all-words fine-grained word sense disambiguation perspective. As such, our model not only provides a yes or no answer for a pair of words in their contexts, but also provides a more tangible explanation for it in the form of the similarity between the ranked lists of English WordNet synsets assigned to the target words. During the design of our approach, we made such choices that would make our framework conveniently applicable to new languages without the need for any training data. Although the results of our framework lags behind the top performing systems, due to of its convenient applicability to new languages and the fact that practically no additional training data is required for applying it to new and possibly low-resourced languages, we think it can provide an easy to use baseline in further WiC-related research efforts.

Acknowledgments
The research presented in this paper was supported by the Ministry of Innovation and the National Research, Development and Innovation Office within the framework of the Artificial Intelligence Na-  tional Laboratory Programme. The author is grateful for the fruitful discussions with Tamás Szakálos whose research was supported by the project "Integrated program for training new generation of scientists in the fields of computer science", no EFOP-3.6.3-VEKOP-16-2017-0002. The project has been supported by the European Union and co-funded by the European Social Fund.