AM2iCo: Evaluating Word Meaning in Context across Low-Resource Languages with Adversarial Examples

Capturing word meaning in context and distinguishing between correspondences and variations across languages is key to building successful multilingual and cross-lingual text representation models. However, existing multilingual evaluation datasets that evaluate lexical semantics “in-context” have various limitations. In particular, 1) their language coverage is restricted to high-resource languages and skewed in favor of only a few language families and areas, 2) a design that makes the task solvable via superficial cues, which results in artificially inflated (and sometimes super-human) performances of pretrained encoders, and 3) no support for cross-lingual evaluation. In order to address these gaps, we present AM2iCo (Adversarial and Multilingual Meaning in Context), a wide-coverage cross-lingual and multilingual evaluation set; it aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts for 14 language pairs. We conduct a series of experiments in a wide range of setups and demonstrate the challenging nature of AM2iCo. The results reveal that current SotA pretrained encoders substantially lag behind human performance, and the largest gaps are observed for low-resource languages and languages dissimilar to English.

This most recent WiC evaluation approach is particularly attractive as 1) it bypasses the dependence on modeling predefined ontologies (entity linking) and explicit sense inventories (WSD), and 2) it is framed as a simple binary classification task: for a target word w appearing in two different contexts c 1 and c 2 , the system must decide whether w conveys the same meaning in both contexts, or not.
However, the current WiC evaluation still allows ample room for improvement: 1) current language coverage is limited, and biased towards resourcerich Indo-European languages; 2) coverage of lexical concepts, due to their paucity in languagespecific WordNets, is also limited; 3) XL-WiC is a monolingual resource available in different languages, i.e., it does not support cross-lingual assessments. Further, 4) the current WiC datasets offer low human upper bounds and inflated (even superhuman) performance for some languages. 1 This is due to superficial cues where 5) many examples in the current WiC datasets can be resolved relying either on the target word alone without any context or on the context alone, which eludes evaluation honing in on the interplay between target words and their corresponding contexts.
In order to address these limitations and provide a more comprehensive evaluation framework, we present AM 2 ICO (Adversarial and Multilingual Meaning in Context), a novel multilingual and cross-lingual WiC task and resource. It covers a typologically diverse set of 15 languages, see Table 2. Based on Wikipedia in lieu of WordNet, AM 2 ICO covers a wider set of ambiguous words which especially complements WiC on the long tail of entity names and adds challenges in generalization for a larger vocabulary than a restricted set of common words. More importantly, the use of Wikipedia enables WiC evaluation on low-resource languages (e.g., Basque, Georgian, Bengali, Kazakh). We also improve the WiC resource design; it now 1) includes adversarial examples and careful data extraction procedures to prevent the models from backing off to superficial clues, 2) results in a more challenging benchmark with truer and much wider gaps between current SotA pretrained encoders and human capability (see §2.3), and 3) enables crosslingual evaluation and analysis. The ample and diverse data in AM 2 ICO enables a wide spectrum of experiments and analyses in different scenarios. We evaluate SotA pretrained encoders, multilingual BERT and XLM-R, both off-the-shelf using a metric-based approach (i.e., without any task adaptation) and after task-specific fine-tuning. With fine-tuned models, we investigate zero-shot cross-lingual transfer as well as transfer from multiple source languages. In general, our results across these diverse scenarios firmly indicate a large gap between human and system performance across the board, which is even more prominent when dealing with resource-poor languages and languages dissimilar to English, holding promise to guide modeling improvements in the future.
In the hope that AM 2 ICO will be a challenging and valuable diagnostic and evaluation asset for future work in multilingual and cross-lingual representation learning, we release the data along with the full guidelines at https://github.com/ cambridgeltl/AM2iCo.

AM ICO: Cross-Lingual
Word-in-Context Evaluation Task Definition. AM 2 ICO is a standard binary classification task on pairs of word in context instances. Each pair consists of a target word with its context in English and a target word with its context in a target language. Formally, each dataset of AM 2 ICO spans a set of N examplesx i , i = 1, . . . , N for a language pair. Each examplê x i is in fact a pair of itemsx i = (x i,src , x i,trg ), where the item x i,src is provided in the source language L src and the item x i,trg is in the target language L trg . The item x i,src in turn is another pair x i,src = (w i,src , c i,src ); it contains a target word w i,src from L src and its (wider) context c i,src (also in L src ) in which that word appears, see Table 1; the same is valid for x i,trg . The classification task is then to judge whether the words w i,src and w i,trg occurring in the respective contexts c i,src and c i,trg have the same sense/meaning (i.e., whether they refer to the same entity/concept), or not.
Final Resource. The full AM 2 ICO resource comprises datasets for 14 language pairs, where English is paired with 14 target languages. For brevity, in the rest of the paper we refer to the dataset of each language pair simply with the L trg language code (e.g., ZH instead of EN-ZH); languages and codes are provided in Table 2.
As illustrative examples, we show a positive pair (label 'T') and a negative pair (label 'F') from the ZH AM 2 ICO dataset in Table 1 (Examples 1 and 2). In the positive example, both target words 'Apollo' and ' 阿波罗' in their contexts refer to the same concept: the Apollo spaceflight program. In the negative example, the Chinese target word ' 阿波 罗' refers to the Apollo aircraft, but the English target word 'Apollo' now refers to the Greek God.
In what follows we describe the creation of AM 2 ICO. We also demonstrate the benefits of AM 2 ICO and its challenging nature.

Data Creation
Wikipedia is a rich source of disambiguated contexts for multiple languages. The availability of Wikipedia's cross-lingual links provides a direct way to identify cross-lingual concept correspondence. The items x i,src and x i,trg are then extracted by taking the surrounding (sentential) context of a hyperlinked word in a Wikipedia article. We balance the context length by (i) discarding items longer than 100 words, and (ii) adding preceding and following sentences to the context for sentences shorter than 30 words. Using the Wikipedia dumps of our 15 languages (see Table 2), we create monolingual items x for each language. We select only ambiguous target words (w-s), that is, words no. English xi,src Chinese xi,trg Label 1 Bill Kaysing ( July 31 , 1922-April 21 , 2005 was an American writer who claimed that the six Apollo Moon landings between July 1969 and December 1972 were hoaxes , and so a founder of the Moon hoax movement . (After the launch of the Titan missiles, LC-16 was handed over to NASA for the training of the astronauts in the Gemini program and the static test run of the service module of the spacecraft in Apollo [...]) T 2 Nearer the house , screening the service wing from view , is a Roman triumphal arch , the " Temple of Apollo " , also known ( because of its former use a venue for cock fighting ) as " Cockpit Arch " , which holds a copy of the famed Apollo Belvedere .  1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 1,000 700 400 400 that link to at least two different Wikipedia pages. 2 For each word, we then create monolingual positive examples by pairing two items (i.e., wordcontext pairs) x i and x j in which the same target word w is linked to the same Wikipedia page, signaling the same meaning. In a similar fashion, monolingual negative examples are created by pairing two items where the same target word w is linked to two different Wikipedia pages. We ensure that there is roughly an equal number of positive and negative examples for each target word. Now, each monolingual example (i.e., pair of items)x contains the same word occurring in two different contexts. In order to create a cross-lingual dataset, we leverage the Wikipedia cross-lingual links; we simply (i) replace one of the two items from each English pair with an item in the target 2 To avoid rare words that are potentially unknown to nonexperts, we retain only words that are among the top 200k words by frequency in each respective Wikipedia. language, and (ii) replace one of the two items from each target language pair with an English item, where the cross-lingual replacements point to the same Wikipedia page as indicated by the cross-lingual Wiki links. Through this procedure, the final datasets cover a sufficient (and roughly comparable) number of examples containing ambiguous words both in English and in L trg . We also rely on data selection heuristics that improve the final data quality, discussed in §2.2 and §2.3.
Finally, in each cross-lingual dataset we reserve 1,000 examples for testing, 500 examples as dev data; the rest is used for training. The exception are 4 resource-poor languages, where all the data examples are divided between dev and test. All data portions in all datasets are balanced. We ensure zero overlap between train, dev, and test portions. The final AM 2 ICO statistics are given in Table 2.
Human Validation. We employ human annotators to assess the quality of AM 2 ICO. For each dataset, we recruit two annotators who each validate a random sample of 100 examples, where 50 examples are shared between the two samples and are used to compute inter-rater agreement. 3

Data Selection Heuristics
One critical requirement of AM 2 ICO is ensuring a high human upper bound. In the initial data creation phase, we observed several sources of confusion among human raters, typically related to some negative pairs being frequently labeled as positive; we identified two causes of this discrepancy and then mitigated it through data selection heuristics.
First, some common monosemous words still might get linked to multiple different Wikipedia pages, thus creating confusing negative pairs. For instance, some pronouns (e.g., 'he', 'it') and common nouns (e.g., 'daughter', 'son') may link to different entities as a result of coreference resolution. However, truly ambiguous words are typically directly defined in Wikipedia Disambiguation pages. We thus keep only the negative pairs that link to separate entries found in the Wikipedia Disambiguation pages. The second issue concerns concept granularity, as Wikipedia sometimes makes too fine-grained distinctions between concepts: e.g., by setting up separate pages for a country's name in different time periods. 4 We mitigate this issue by requiring that the negative pairs do not share common or parent Wikipedia categories.
The application of these heuristics during data creation (see §2.1) yields a substantial boost in human performance: e.g., the scores increase from 74% to 88% for ZH, and from 76% to 94% for DE.

Adversarial Examples
Another requirement is assessing to which extent models can grasp the meaning of a target word based on the (complex) interaction with its context. However, recently it was shown that SotA pretrained LMs exploit superficial cues while solving language understanding tasks due to spurious correlations seeping into the datasets (Gururangan et al., 2018;Niven and Kao, 2019). This hinders generalizations beyond the particular datasets and makes the models brittle to minor changes in the 3 The annotators were recruited via two crowdsourcing platforms, Prolific and Proz, depending on target language coverage. The annotators were native speakers of the target language, fluent in English, and with an undergraduate degree. 4 E.g., 'China' can be linked to the page 'Republic of China (1912China ( -1949' and to the page 'Empire of China (1915China ( -1916'.  input space (Jia and Liang, 2017;Iyyer et al., 2018). As verified later in §4, we found this to be the case also for the existing WiC datasets: just considering the target word and neglecting the context (or vice versa) is sufficient to achieve high performance.
To remedy this issue, we already ensured that models could not rely solely on target words in §2.1 by including both positive and negative examples for each ambiguous word in different contexts. Further, we now introduce adversarial negative examples in AM 2 ICO to penalize models that rely only on context without considering target words.
To create such negative examples, we sample a positive pair x i and instead of the original target word w i , we take another related wordw i in the same context c i as the new target word w i . We define the related word as a hyperlinked mention sharing the same parent Wiki category as the original target word: e.g., in Table 1 we change the target word ' 阿波罗' (Apollo) from Example 1 into the related word 'NASA', resulting in Example 3. Both words share a common Wiki parent category, 美国国家航空航天局 (NASA). The contexts of both examples deal with spaceships; hence, only a fine-grained understanding of lexical differences between the target words warrants the ability to recognize Apollo as identical to ' 阿波 罗' but different from 'NASA'. Overall, adversarial examples amount to roughly 1/4 of our dataset.

Data Statistics and Language Coverage
We summarize the main properties of AM 2 ICO while comparing against previous word-in-context datasets WiC, XL-WiC and MCL-WiC in Table 3. More detailed per-language scores are listed in Table 4. First, we emphasize the accrued reliability of AM 2 ICO, as both human accuracy and interannotator agreement are substantially higher than with WiC and XL-WiC (i.e., rising by~10 points).
Second, for a comparable overall dataset size we increase the number of examples and word types in resource-poor languages. If we consider their median across languages, AM 2 ICO has 8,570 and 8,520, respectively, around four times more than XL- WiC (1,676 and 1,201) and MCL-WiC (2000 and2072). XL-WiC are heavily skewed towards a small number of languages, namely German and French, and provides large datasets in those languages. MCL-WiC only offers training data for English. In contrast, AM 2 ICO provides a more balanced representation of its languages. Third, in AM 2 ICO we deliberately include longer contexts. While the data in WiC, XL-WiC and MCL-WiC are derived from concise dictionary examples, AM 2 ICO data reflect natural text where key information may be spread across a much wider context.
Our selection of languages is guided by the recent initiatives to cover a typologically diverse language sample . In particular, AM 2 ICO covers 15 languages, more than XL-WiC (12 languages) and MCL-WiC (5 languages). Diversity can be measured along multiple axes, such as family, geographic areas, and scripts (Ponti et al., 2019). AM 2 ICO includes 10 language families, namely: Afro-Asiatic (1 language), Austronesian (1), Basque (1), Indo-European (5), Japonic (1), Kartvelian (1), Koreanic (1), Sino-Tibetan (1), Turkic (2), Uralic (1). This provides a more balanced sample of the cross-lingual variation compared to XL-WiC (5 families) and MCL-WiC (3 families). Regarding geography, in addition to the areas covered by XL-WiC and MCL-WiC (mostly Europe and Eastern Asia), we also represent South-East Asia (with ID), the Middle East (TR), the Caucasus (KA), the Indian subcontinent (UR and BN), as well as central Asia (KK). Finally, AM 2 ICO also introduces scripts that were absent in other datasets, namely the Georgian alphabet and the Bengali script (a Northern Indian abugida), for a total of 8 distinct scripts.

Experimental Setup
We now establish a series of baselines on AM 2 ICO to measure the gap between current SotA models and human performance.  (Conneau et al., 2020), available in the HuggingFace repository (Wolf et al., 2020).
Classification. Given two contextualized representations e i,src and e i,trg for a pair of target words, two setups to make prediction are considered: the first, metric-based, is a non-parametric setup. In particular, we follow Pilehvar and Camacho-Collados (2019) and score the distance δ between the representations via cosine similarity. A threshold t from the development set is set via grid search across 0.02 intervals in the [0, 1] interval. Therefore, if δ(e i,src , e i,trg ) ≥ t the pair is classified as negative, and positive otherwise. On the other hand, the fine-tuning setup is parametric: following Raganato et al. (2020), we train a logistic regression classifier that takes the concatenation of the contextualized representations [e i,src ⊕ e i,trg ] as input. 7 The entire model (both the encoder and the classifier) is then fine-tuned to minimize the cross-entropy loss of the training set examples with Adam (Kingma and Ba, 2015). We perform grid search for the learning rate in [5e−6, 1e−5, 3e−5], and train for 20 epochs selecting the checkpoint with the best performance on the dev set.
Cross-lingual Transfer. In addition to supervised learning, we also carry out cross-lingual transfer experiments where data splits may belong to different language pairs. The goal is transferring knowledge from a source language pair s to a target language pair t . To simulate different scenarios of data paucity, in the fine-tuning setup we consider: 1) zero-shot transfer, where train and development sets belong to s and the test set to t ; 2) zero-shot + TLD 8 transfer, which is similar except for the dev set given in t ; 3) on top of zero-shot + TLD, we provide a small amount of training examples in   Table 4: Accuracy of MBERT and XLM-R on AM 2 ICO in a supervised learning setting. We report metric-based classification (MTR) results, as well as the scores in the fine-tuning setup (FT). The third group of rows (HM) displays human performance, in terms of both accuracy and inter-rater agreement. Results for the larger test sets for DE and RU are reported in Table 9 in the Appendix. ble 4. The metric-based approach achieves consistent scores across all languages, fluctuating within the range [57.1, 67.1] for MBERT and [55.5, 65.0] for XLM-R. This indicates that the pretrained encoder alone already contains some relevant linguistic knowledge, to a certain degree. In comparison, fine-tuning yields more unequal results, being more data-hungry. In particular, it performs worse than the metric-based approach on languages with small training data size (e.g., ID and EU in Table 4), whereas it surpasses the metric-based approach on languages with abundant examples (e.g., DE, RU). Table 4 also reveals that XLM-R is more sensitive to train data size than MBERT, often falling behind in both the metric-based and fine-tuning setups, especially for resource-poorer languages. These findings are in line with what  report for Multi-SimLex, which are grounded on lexical semantics similarly to AM 2 ICO. However, they contradict the received wisdom from experiments in other multilingual sentence-level tasks Conneau et al., 2020), where XLM-R outperforms MBERT in cross-lingual transfer. While the exact causes go beyond the scope of this work, we speculate that the two encoders excel in separate aspects of semantics, the lexical and the sentence level.

XLM-R vs MBERT.
Effect of Data Size on Fine-Tuning. To further investigate the effect of train data size on fine-tuning, we perform an in-depth analysis on some selected languages (DE, RU and JA). Note that we use the larger dev and test sets for DE and RU for this experiment. We study how performance changes as we vary the number of training examples from 500 to the full set. The results in Figure 1 indicate that, while fine-tuning starts lower than the metric-based baseline, it grows steadily and begins to take the lead from around 2,500 train examples.  Table 5. We select the training data of each of the five languages with most data (DE, RU, JA, ZH, AR) in turn for source-language fine-tuning. Subsequently, we report the average prediction performance across all remaining 9 target languages. First, we note that the TLD variant for hyperparameter selection does not yield gains. Second, the best choice of a source language appears to be German across the board, achieving an average score of 71.2 with MBERT and 72.0 with XLM-R. Nevertheless, this is simply due to its ample number of examples (50k). In fact, when controlling for this variable by equalizing the total size of each train split to 10k, see the bottom half of  ual languages in Table 6 (top section), however, reveals an even more intricate picture. In particular, the best source language for KK is RU and for JA is ZH, rather than DE. This can be explained by the fact that these pairs share their scripts, Cyrillic and Kanji / Hanzi, respectively, at least in part. This indicates that a resource-leaner but related language might sometimes be a more effective option as source language than a resource-rich one. It is also noteworthy that zero-shot transfer from DE outperforms supervised learning in most languages, except for those both resource-rich and distant (JA, ZH and AR).
Few-shot Transfer. To study the differences between training on s and t with controlled train data size, we plot the model performance on two target languages (RU and JA) as a function of the amount of available examples across different transfer conditions in Figure 2. Comparing supervised learning (based on target language data) with zeroshot learning (based on DE data), it emerges how the former is always superior if the number of examples is the same. However, zero-shot learning may eventually surpass the peak performance of supervised learning by taking advantage of a larger pool of examples: this is the case in RU, but not in JA. This illustrates a trade-off between quality (in-domain but possibly scarce data) and quantity (abundant but possibly out-of-domain data). Few-shot learning combines the desirable properties of both approaches. After pre-training a model on DE, it can be adapted on a small amount of target-language examples. Performance continues to grow with more shots; with as few as 1k JA examples it is comparable to supervised learning on 15k examples. Few-shot learning thus not only achieves the highest scores, but also leverages costly targetlanguage data in a sample-efficient fashion.  Table 6. We observe a substantial boost in performance across all the languages compared to both zero-shot transfer from any individual language and supervised learning (cf.   In previous datasets, at least one of the adversarial baselines reaches performance close to the FULL model: in WiC (EN), CTX has a gap of only 2 points. In XL-WiC (DE), TGT is only 1 point away from FULL. In MCL-WiC (EN), the gap between CTX and FULL is even below 1 point. This would also be the case in AM 2 ICO were it not for the extra adversarial examples (rows +A): by virtue of this change, the distance between FULL and the best adversarial baseline is 6.4 points in DE and 5.0 in ZH. Therefore, it is safe to conclude that a higher score on AM 2 ICO better reflects a deep semantic understanding by the model. Moreover, the last column of Table 7

Related Work
Cross-Lingual Evaluation of Word Meaning in Context. Going beyond readily available sense inventories required for WSD-style evaluations, the comprehensive benchmarks for evaluating word meaning in context cross-lingually are still few and far between. XL-WiC (Raganato et al., 2020) extends the original English WiC framework of Pilehvar and Camacho-Collados (2019) to 12 other languages, but supports only monolingual evaluation, and suffers from issues such as small gaps between human and system performance. The SemEval-2021 shared task MCL-WiC does focus on crosslingual WiC, but covers only five high-resource languages from three language families (English, French, Chinese, Arabic, Russian). Both XL-WiC and MCL-WiC mainly focus on common words and do not include less frequent concepts (e.g., named entities). Further, their language coverage and data availability are heavily skewed towards Indo-European languages.
There are several other 'non-WiC' datasets designed to evaluate cross-lingual context-aware lexical representations. Bilingual Contextual Word Similarity (BCWS) (Chi and Chen, 2018) challenges a model to predict graded similarity of crosslingual word pairs given sentential context, one in each language. In the Bilingual Token-level Sense Retrieval (BTSR) task (Liu et al., 2019), given a query word in a source language context, a system must retrieve a meaning-equivalent target language word within a target language context. 10 However, both BCWS and BTSR are again very restricted in terms of language coverage: BCWS covers only one language pair (EN-ZH), while BTSR contains two pairs (EN-ZH/ES). Further, they provide only test data: as such, they can merely be used as general intrinsic probes for pretrained models, but cannot support fine-tuning experiments and cannot fully expose the relevance of information available in pretrained models for downstream applications. This is problematic as intrinsic tasks in general do not necessarily correlate well with downstream performance (Chiu et al., 2016;Glavaš et al., 2019).
AM 2 ICO vs. Entity Linking. Our work is related to the entity linking (EL) task (Rao et al., 2013;Cornolti et al., 2013;Shen et al., 2014) similarly to how the original WiC (based on WordNet knowledge) is related to WSD. EL systems must map entities in context to a predefined knowledge base (KB). While WSD relies on the WordNet sense inventory, the EL task focuses on KBs such as Wikipedia and DBPedia. When each entity mention is mapped to a unique Wiki page, this procedure is termed wikification (Mihalcea and Csomai, 2007). The cross-lingual wikification task (Ji et al., 2015;Tsai and Roth, 2016) grounds multilingual mentions to English Wikipedia pages. Similar to WSD, EL evaluation is tied to a specific KB. It thus faces similar limitations of WSD in terms of restricting 10 BTSR could be seen as a contextualized version of the standard bilingual lexicon induction task (Mikolov et al., 2013;Søgaard et al., 2018;Ruder et al., 2019, inter alia). meanings and their distinctions to those predefined in the inventory. In comparison, AM 2 ICO leverages Wikipedia only as a convenient resource for extracting the examples, similar to how the original WiC work leverages WordNet. AM 2 ICO itself is then framed on natural text, without requiring the modeling of the KBs. Also, in comparison with EL, AM 2 ICO provides higher data quality and a more challenging evaluation of complex word-context interactions, achieved by a carefully designed data extraction and filtering procedure.

Conclusion
We presented AM 2 ICO, a large-scale and challenging multilingual benchmark for evaluating word meaning in context (WiC) across languages. AM 2 ICO is constructed by leveraging multilingual Wikipedias, and subsequently validated by humans. It covers 15 typologically diverse languages and a vocabulary substantially larger than all previous WiC datasets. As such, it provides more comprehensive and reliable quality estimates for multilingual encoders. Moreover, AM 2 ICO includes adversarial examples: resolving such examples requires genuine lexical understanding, as opposed to relying on spurious correlations from partial input. Finally, AM 2 ICO offers the possibility to perform cross-lingual evaluation, pairing context between different languages. Moreover, we explored the impact of language relatedness on model performance by transferring knowledge from multiple source languages. We established a series of baselines on AM 2 ICO, based on SotA multilingual models, revealing that the task is far from being 'solved' even with abundant training data. All models struggle especially when transferring to distant and resourcelean target languages. We hope that AM 2 ICO will guide and foster further research on effective representation learning across different languages.