UAlberta at SemEval-2021 Task 2: Determining Sense Synonymy via Translations

We describe the University of Alberta systems for the SemEval-2021 Word-in-Context (WiC) disambiguation task. We explore the use of translation information for deciding whether two different tokens of the same word correspond to the same sense of the word. Our focus is on developing principled theoretical approaches which are grounded in linguistic phenomena, leading to more explainable models. We show that translations from multiple languages can be leveraged to improve the accuracy on the WiC task.


Introduction
This paper describes the University of Alberta systems for SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (Martelli et al., 2021). We focus on the monolingual (English) variant of the task, which is the same as the original WiC task (Pilehvar and Camacho-Collados, 2018). An instance of the WiC task consists of two sentences that share a focus word in common; the word may be inflected differently in each sentence (e.g. "they had searched his flat a few days before" and "the production of lithium from salt flats") but will share the same lemma and part of speech. A WiC task system must decide, given such a pair of sentences, whether the focus tokens have the same meaning in both sentences. Systems are compared in terms of their accuracy, the percentage of test instances correctly identified as TRUE (same meaning) or FALSE (different meaning). The dataset includes training, development, and testing splits; as our methods are unsupervised, we do not use the training data.
The goal of this paper is an exploration of the use of translation information for the WiC task. The intuition underlying our work is that distinctions in meaning tend to be reflected in distinctions in translation. We have previously presented methods leveraging translation information to improve word sense disambiguation , and most frequent sense detection (Hauer et al., 2019), and have demonstrated that word senses which share translations are, in general, semantically related (Hauer and Kondrak, 2020a). We have also presented theoretical formalizations of lexico-semantic phenomena which view synonymy and translation as two aspects of semantic equivalence (Hauer and Kondrak, 2020b). Our team additionally presented a method based on translation information  for the SemEval-2020 Task 2 on Predicting Multilingual and Cross-Lingual Lexical Entailment (Glavaš et al., 2020). In this task, we investigate whether translation can be used to detect semantic equivalence in context, just as in the aforementioned prior task we investigated whether translation can be used to detect lexical entailment between word types. Our focus is on developing principled theoretical approaches which are grounded in linguistic phenomena, leading to more explainable models.
Our more complex methods depend upon a mapping between word senses and translations, as different senses of a word often translate differently. We obtain such a mapping from BabelNet (Navigli and Ponzetto, 2012), which combines information from Princeton WordNet (Fellbaum, 1998), multilingual lexical resources, and translations produced by MT models. WordNet is comprised of synonym sets, or synsets, which BabelNet enriches with translations. Each of the resulting multi-lingual synsets, or multi-synsets, contain lexicalizations of a single concept in various languages, allowing the translations of a given sense of a word to be identified. We treat BabelNet as an imperfect implementation of a universal multi-wordnet with the theoretical properties described by Hauer and Kondrak (2020b). Our results can be interpreted as a proof-ofconcept for the use of contextual translations as indicators of semantic similarity. We show that the methods that we develop for the WiC task can leverage translations to improve over baselines, especially when multiple target languages are considered. While it is not our objective to compete with state-of-the-art supervised methods, we consider this to be a positive result, and a strong lead for future work on contextual semantic analysis.
This paper is structured as follows: Section 2 provides an overview of relevant prior literature. Section 3 discusses the theoretical model underlying our work. Section 4 outlines our methods. Section 5 describes our experiments and results.

Related Work
Methods for WiC task can be roughly divided into two paradigms: contextualized-embedding-based systems, and word sense disambiguation-based systems. Pilehvar and Camacho-Collados (2018) introduce the WiC dataset as a benchmark for evaluating context sensitive word representations. Soler et al. (2019) achieve improvements by combining similarity scores from different types of contextual word and sentence embeddings. Liu et al. (2020) propose a method to enhance contextual representations by leveraging other pre-trained contextual or static embeddings. Another approach to WiC task is to employ a word sense disambiguation (WSD) system to tag the target words with senses from a pre-defined sense inventory and subsequently make a decision based on the predicted synsets of the target words. Loureiro and Jorge (2019b) use the LMMS sense embeddings (Loureiro and Jorge, 2019a) to disambiguate the target words. A simple approach of checking if the disambiguated senses are equal lead to competitive performance in the SemDeep-5 WiC challenge (Anke et al., 2019). SENSEMBERT (Scarlini et al., 2020a) and ARES (Scarlini et al., 2020b) embeddings, when used as features in a BERT-based model, also achieve competitive results on the WiC task.
Our methods combine elements of both paradigms. We employ contextual embeddings in our proposed translation-based methods. However, we take the embeddings of the translations of the target words instead of the target words themselves. Similarly to WSD based approaches, our methods also analyze the common synsets of the focus tokens and their translations, with the goal of identifying a probable shared synset. The most similar prior work to our approach is that of Pessutto et al. (2020) at the graded word similarity task (Armendariz et al., 2020) of SemEval 2020, who propose a translation-based approach to evaluate the contextual similarity of a pair of words. They hypothesize that leveraging similarity information from more languages would allow greater accuracy. We follow a similar intuition in our work.

Theoretical Solution
We first present a theoretical solution, which provides the foundation for the development of our actual methods described in Section 4. We assume that the two source sentences S 1 and S 2 in each instance of the WiC task can be translated into any natural language as sentences T 1 and T 2 . Furthermore, we assume that the literal lexical translations t 1 and t 2 of the focus word s can be identified in T 1 and T 2 , respectively. For example, in Figure 1, the focus word s in the English sentences S 1 and S 2 is the noun differential, and word alignment identifieś ecart and différentiel as t 1 and t 2 . Note that the two translations may have the same POS and lemma, a scenario we denote as t 1 = t 2 .

Substitution Test
Our theoretical solution is based on the notion of the linguistic substitution test for verifying the synonymy of senses (Hauer and Kondrak, 2020b), which takes as input two sentences which differ only in a single word, and returns TRUE if and only if the two sentences have the same meaning. In other words, it decides whether the substitution of one word with another changes the meaning of the sentence. Note that this substitution test is not sufficient to decide the WiC task, as the input sentences for this task share a single word, rather than differ in a single word. The substitution test can be implemented by consulting a native speaker, or approximated by a computer program. In Section 4, we discuss an implementation based on contextual embeddings.
An example of a valid input to the substitution test would be the sentences I work at the plant and I work at the factory. For this input, the substitution test would return TRUE, since the word substitution does not change the meaning of the sentence. The sentences I work at the plant and I work at the flower would likewise constitute a valid input; however, given these sentences, the substitution test would return FALSE, since the sentences differ semantically.

Translation Criss-Cross
In order to apply the substitution test to an instance of the WiC task, we first translate the two source input sentences S 1 and S 2 into a target language, producing two target sentences T 1 and T 2 . We identify the two lexical translations t 1 and t 2 of the focus word s in T 1 and T 2 . Assuming that the translations are correct and literal, the senses of s in S 1 and t 1 in T 1 will be synonymous, as well as the senses of s in S 2 and t 2 in T 2 . If t 1 and t 2 have the same POS but different lemmas, we can replace t 1 with t 2 in T 1 to produce a sentence T 1 which differs from T 1 in a single word. The application of the substitution test to (T 1 , T 1 ) returns TRUE if and only if the sense of t 2 in T 1 is synonymous with the sense of s in S 1 , which implies that, in addition to s and t 1 , the multi-synset containing the sense of s in S 1 must also include t 2 .
Using our running example in Figure 1, T 1 would be created by replacingécarts with différentiel in T 1 . This produces les différentiel de taux d'intérêt croissant, which, while not necessarily grammatical, can still be evaluated by the substitution test to decide whether the substitution alters the semantic content of the sentence. (Or, equivalently, whetherécart and différentiel are synonymous in this particular context.) We repeat the process with the roles of T 1 and T 2 reversed. That is, we construct T 2 by replacing t 2 with t 1 in T 2 in order to verify whether the sense of t 1 in T 2 is synonymous with the sense of s in S 2 . If the substitution test returns FALSE for either of the two target sentence pairs, we can conclude that the two multi-synsets that correspond to the senses of s in S 1 and S 2 must be different. Therefore, this instance of the WiC task is resolved as FALSE. However, if the substitution test returns TRUE for both pairs of sentences, we cannot immediately resolve the instance of the WiC task, because there could exist two (or more) multi-synsets that all contain s, t 1 , and t 2 . To complicate maters, this partial solution to the WiC task can only be applied if t 1 and t 2 have the same POS but different lemmas.
A complete theoretical solution can be obtained by considering translations in multiple languages. If the focus word s is not used in the same sense in S 1 and S 2 , we would expect that in some language, the translations t 1 and t 2 will be different and not mutually replaceable in both sentences. This expectation is consistent with the speculation of Palmer et al. (2007) that translation into a sufficiently large set of language will eventually lexicalize every sense distinction. It is also supported by the findings of Bao et al. (2021) who found no evidence for the existence of universal colexifications, that is, pairs of concepts that are expressed by the same word in every natural language.

Multi-Synset Intersection
For each language F i in the set of all natural languages L, let t i 1 and t i 2 be the lexical translations of the focus word s in the first and second input sentences, respectively. Let T be the set consisting of the focus word, and all its lexical translations; that is W = {s} ∪ F i {t i 1 , t i 2 }. Assuming access to a perfect universal multi-wordnet, we define the set C to be the set of multi-synsets that contain all words in T .
The size of C provides clues to the resolution of the WiC task. We need to consider three cases: |C| = 0, |C| = 1, and |C| ≥ 2. With some caveats, these three cases roughly imply the following answers to the WiC task: FALSE, TRUE, and UNKNOWN, respectively. We discuss these three cases in turn.
If |C| = 0, then no single concept can be expressed by s and all its translations in T, according to the multi-wordnet. That is, there exist two translations of the focus word which cannot express the same concept, assuming the completeness of the multi-wordnet. Therefore, the two focus tokens must correspond to distinct multi-synsets, implying FALSE.
If |C| = 1, there exists exactly one multi-synset that contains the focus word and all its translations. Therefore, it is possible, albeit not guaranteed, that the focus word in both source sentences is used in the sense that corresponds to that unique multisynset. In order to be sure, we could apply the criss-cross method described in Section 3.2.
|C| ≥ 2 would imply that there exist two concepts which are colexified (expressed by a single word) in all languages. Following Bao et al. (2021), we assume that universal collocations are at best extremely rare. Even if they exist at all, we could still apply the solution described in Section 3.2 to decide the WiC task. Of course, if we are considering translations into only a small number of languages, the possibility of |C| ≥ 2 is much more likely. In fact, we observe |C| = 3 in our running example, because three different BabelNet multisynsets contain the English focus word and its two French translations.

Methods
In this section we describe four methods based on the theoretical ideas in Section 3. All four methods rely on identifying lexical translations of the focus word in both source sentences. If the lexical translations cannot be recovered from the translated sentences for any of the target languages, all methods use the same backoff approach, which is to return FALSE for that test instance.

IDENT and CVAL
Our two simplest methods are IDENT and CVAL. IDENT is a baseline method which returns TRUE iff the lexical translations t 1 and t 2 have the same lemma and POS in all applicable target languages. CVAL is a method directly based on the cardinality of the set C as defined in Section 3.3. CVAL returns TRUE iff the translations of the focus word are identical in each language and |C| > 0.

Synonymy Check
We implement the substitution test as a heuristic synonymy check using dense contextualized embeddings. Such embeddings allow us to construct, for any word token in a given sentence, a vector in a continuous semantic space. The objective in designing such embeddings is that semantically similar tokens should have similar vectors, commonly measured by cosine similarity. Additional technical details of the embeddings are provided in Section 5.
Given a pair of sentences which differ only in the substitution of single word, we obtain dense contextualized embeddings of the distinguishing word in each sentence. We then calculate the cosine similarity between the two embeddings. If the similarity is greater than a threshold tuned on a development set, this is taken as an indication that replacing one of the distinguishing words with the other does not alter the meaning of the sentence, as the replacement word has the same meaning as the original word. This implementation of the substitution test is used as a subroutine by our remaining two methods.

SUB and CSUB
The SUB method attempts to apply the synonymy check to each pair of translated sentences T 1 and T 2 in each target language, without referring to the |C| value. If the translations of the focus word in T 1 and T 2 differ, we create the sentences T 1 and T 2 , as described in Section 3.2, and apply the synonymy check to (T 1 , T 1 ) and (T 2 , T 2 ). SUB returns TRUE if the synonymy check succeeds for all target languages for which the translations t 1 and t 2 can be identified. The synonymy check trivially succeeds if t 1 and t 2 have the same POS and lemma; intuitively, tokens which translate the same way are likely to have similar meanings. If either application of the synonymy check fails, SUB returns FALSE. In summary, this method is similar to the IDENT method, except that the synonymy check is applied if the translations differ.
CSUB combines CVAL with SUB. The only difference with the SUB method is that the synonymy check is not applied when |C| = 0. This is because the lack of any common multi-synset in a complete perfect multi-wordnet is theoretically sufficient to exclude the possibility of the two source focus tokens having the same sense.

Experiments
In this section, we describe the application of our methods to the English development and test sets. We begin by specifying various implementation details. Next, we describe our development experiments, including results and error analysis. Finally, we present our results on the test set. While our method is, in theory, applicable to any language, and even to cross-lingual subtasks, we focus exclusively on the English monolingual substask due to time and resource constraints.

Translation and Lemmatization
We use BabelNet Ponzetto, 2010, 2012) as our multi-wordnet; in particular, we make use of the BabelNet multi-synsets which are linked to Princeton Wordnet synsets. This allows us to exclude synsets that refer to named entities, rather than lexicalized concepts, to limit the impact of noise in BabelNet.
For translation, we use Google Translate, as it is fast and publicly available. In our analysis, we found the lexical translations obtained using Google Translate to be of generally high quality, which is important given our method's dependence on machine translation. We use French, Italian, and Russian as our languages of translation. The choice of the translation languages is based on the languages selected for the shared task, and also on the BabelNet coverage. French and Russian are two of the languages covered by the shared task. On the other hand, Italian seems to have the best BabelNet coverage among the non-English languages.
For lemmatization, we use TreeTagger (Schmid, 1999(Schmid, , 2013, with pre-trained lemmatization models for the source and all target languages. We lemmatize the bitexts to improve the quality of the word alignment.

Word Alignment
Following lemmatization, we align each input sentence with its translation in each target language.
To improve the quality of our unsupervised alignment, we obtain a large sentence-aligned parallel corpus (bitext) in the source and target languages. We then append to the bitext all of the lemmatized input sentences, and all of their lemmatized language translations. Finally, we apply an unsupervised knowledge-based alignment algorithm to the augmented bitext, and, for each sentence, identify the word or phrase in the translated sentence corresponding to the source focus word. Once each input sentence is aligned with its translation, we extract the lemmas aligned with each focus word token. These are the lexical translations of the focus word for this language.
To carry out the alignment, we use BabAlign , a state-of-the-art knowledgebased aligner. BabAlign leverages translation infor-mation from BabelNet to create synthetic training data and post-process the alignment produced using a base unsupervised alignment method. Specifically, we use FastAlign (Dyer et al., 2013) as the base aligner. When aligning input sentences with translations, we concatenate the sentences and their translations with the OpenSubtitles bitext (Lison and Tiedemann, 2016) for the corresponding language pair. For each language pair, we use the first 1M sentences of the OpenSubtitles bitext.

Contextual Embeddings
To obtain contextual representations for the purposes of deciding the substitution check, we use BERT (Devlin et al., 2019), a deep neural architecture trained with the masked language model. We chose BERT because it has been proven to capture the semantics of a word in context (Coenen et al., 2019). The context is the sentence containing the focus word. Specifically, we use cased multilingual BERT to generate contextualized embedding of focus words by summing up the last four hidden layers of the BERT model. This choice was based on the results achieved by Devlin et al. (2019) in the named entity recognition task, and by Soler et al. (2019) in the SemDeep-5 WiC shared task. 1 We use cased multilingual BERT embeddings with 768 dimensions, 12 layers, 12 attention heads, and 179M parameters. To implement the substitution check, we generate contextualized embeddings of the translations of the focus tokens, and their substitutes, by summing the last four hidden layers of the BERT model. Since BERT uses sub-tokens to generate embeddings, we analyzed the impact of two different sub-token selection techniques for predicting word similarity: using only the first subtoken, and using the mean over all the sub-tokens. In our development experiments, we found that the former yielded better results. Therefore, only the first sub-token is used to create contextualized embeddings for the substitution method. Table 1 shows the results of our development experiments. The baseline translation identity method IDENT does surprisingly well, outperforming both methods based on intersecting sets of multi-synsets, CVAL and CSUB. Indeed, these methods tend to suffer accuracy degradation as more languages of translation are added. We speculate that this is due to these methods being more vulnerable to noise (errors or omissions) in the multi-wordnet and in the extraction of lexical translations. However, the best performing method is SUB, which also shows improvement when combining all three languages of translation. Thus, it also shows the most promise for further improvement by adding additional languages.

Development Results
Our error analysis suggests that there are three principal causes of errors. First, translation may be non-literal. For example, in one instance, the adverb "unevenly" is translated into French as the adjective "inégale" ("unequal"), leading to a false negative. Second, distinct but synonymous translations may lead to false positives. In one instance, the focus word "stain" is translated as "souillé" in one sentence and "tachée" in the other. The focus tokens have distinct meanings, reflected in their distinct translations, "stain on a reputation" versus "stain on a surface". However, the translations pass the BERT-based synonymy check, since they can be synonymous in some contexts. Finally, in some cases, distinct senses of a word may nevertheless translate the same way. For example, in one instance, the focus word "superior" was used in two distinct meanings. Both these meanings can be expressed by the French word "supérieur", and indeed, "superior" was translated as "supérieur" in both sentences, resulting in a false positive. Table 2 shows our results on the test data. Consistent with our development experiments, the SUB method achieves the best performance with the combination of all three languages. The IDENT method once again performs surprisingly well despite its simplicity, outperforming the more complex CVAL and CSUB methods. Different from the development experiments, when only one language of translation is used, Russian yields substantially better performance compared to French or Italian across all four methods, and Italian likewise yields  better performance than French. Table 3 gives additional details for the results of the SUB method. For each of the three languages, and the combination of all three, we provide the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), as well as the accuracy. We observe that using multiple languages of translation results in a substantial reduction in false positives, at the possible expense of an increase in false negatives, while maintaining an overall higher accuracy.

Conclusion
Overall, our results provide a solid proof-ofconcept for the utility of multilingual translation for the WiC task. While not competitive with state-ofthe-art supervised methods, our results empirically verify the hypothesis that translations convey semantic information, and that this phenomenon has applications in lexical semantics. The IDENT and SUB methods consistently benefit from translation into multiple languages, and this result generalizes to unseen test data.