Cambridge at SemEval-2021 Task 2: Neural WiC-Model with Data Augmentation and Exploration of Representation

This paper describes the system of the Cambridge team submitted to the SemEval-2021 shared task on Multilingual and Cross-lingual Word-in-Context Disambiguation. Building on top of a pre-trained masked language model, our system is first pre-trained on out-of-domain data, and then fine-tuned on in-domain data. We demonstrate the effectiveness of the proposed two-step training strategy and the benefits of data augmentation from both existing examples and new resources. We further investigate different representations and show that the addition of distance-based features is helpful in the word-in-context disambiguation task. Our system yields highly competitive results in the cross-lingual track without training on any cross-lingual data; and achieves state-of-the-art results in the multilingual track, ranking first in two languages (Arabic and Russian) and second in French out of 171 submitted systems.


Introduction
Polysemy still poses a great challenge to natural language processing (NLP) applications. Depending on its context, an ambiguous word can refer to multiple, potentially unrelated, meanings. Recently, as an application of Word Sense Disambiguation (WSD) (Navigli, 2009(Navigli, , 2012, Word-in-Context (WiC) disambiguation has been framed as a binary classification task to identify if the occurrences of a target word in two contexts correspond to the same meaning or not. The release of the WiC dataset (Pilehvar and Camacho-Collados, 2019), followed by the Multilingual Word-in-Context (XL-WiC) dataset (Raganato et al., 2020), has helped provide a common ground for evaluating and comparing systems while encouraging research in WSD and context-sensitive word embeddings.
In this paper, we describe our submission to the SemEval-2021 shared task on Multilingual and Cross-lingual Word-in-Context (MCL-WiC) Disambiguation (Martelli et al., 2021), which involves determining whether a word shared by two sentences in the same language (multilingual track) or across different languages (cross-lingual track) has the same meaning in both contexts. Compared to previous WiC and XL-WiC benchmarks, two new languages are introduced as well as a crosslingual track where systems are evaluated under a 'zero-shot' setting.
The MCL-WiC task directly classifies pairs of sentences with regard to the meaning of the shared word. By turning WSD into a binary comparison task, MCL-WiC avoids the need for sense tags of previous WSD shared tasks (Manandhar et al., 2010;Navigli et al., 2013;Moro and Navigli, 2015). It also resembles the Word Sense Alignment (WSA) task (Ahmadi et al., 2020) more closely, in which definitions from different dictionaries have to be aligned. Contextualised word embeddings and pre-trained Transformer-based (Vaswani et al., 2017) language models have been increasingly applied to these tasks and state-of-the-art results have been reported (Hadiwinoto et al., 2019;Vial et al., 2019;Levine et al., 2020;Raganato et al., 2020;Pais et al., 2020;Manna et al., 2020;Lenka and Seung-Bin, 2020).
In line with previous research, we develop a neural system based on pre-trained multilingual masked language model XLM-R (Conneau et al., 2020). Additionally, we introduce three distancebased features to be used together with the widely used sequence and token representations for MCL-WiC disambiguation. To further improve system performance, we apply automatic data augmentation and extract examples from multiple external resources. A two-step training strategy is then employed to make use of both in-domain and out-of-

Existing
There is never any point in trying to make oneself heard over noise.
We have formulated a programme to address the traffic noise impact of existing roads. T

Existing
There is never any point in trying to make oneself heard over noise.
He went to bed but could not fall asleep because of the noise. T

New
We have formulated a programme to address the traffic noise impact of existing roads.
He went to bed but could not fall asleep because of the noise. T

(b)
Existing Wages have declined sharply as a proportion of the subsistence minimum.
Agriculture, largely of a subsistence nature, is the main economic activity. T

Existing
Wages have declined sharply as a proportion of the subsistence minimum.
The third member of the Committee is paid a daily fee for each working day plus subsistence allowance.
F New Agriculture, largely of a subsistence nature, is the main economic activity.
The third member of the Committee is paid a daily fee for each working day plus subsistence allowance. domain 1 data.

F
In the remainder of the paper, we present the MCL-WiC disambiguation shared task in Section 2 and our approach in Section 3. In Section 4, we describe the experiments and present results on the development set. Section 5 summarises the official evaluation results. Finally, we provide an analysis of our system in Section 6 and conclude in Section 7.

Task Description
The MCL-WiC dataset used in the shared task consists of sentence pairs sharing the same target word in the same language or across different languages. The task considers five languages: Arabic (AR), Chinese (ZH), English (EN), French (FR) and Russian (RU); and contains five multilingual (EN-EN, AR-AR, FR-FR, RU-RU, ZH-ZH) 2 and four crosslingual (EN-AR, EN-FR, EN-RU, EN-ZH) subtracks. Training data is available for the multilingual EN-EN sub-track only, and development data is available for all five multilingual sub-tracks. No cross-lingual training or development data is provided. Statistics of the MCL-WiC dataset are presented in Table 1.
Results are computed using the accuracy measure, i.e. the ratio of correctly predicted instances (true positives or true negatives) to the total number of instances.

Data augmentation
Each instance in the (*)WiC datasets (i.e. WiC, XL-WiC and MCL-WiC) is composed of a target word and two sentences in which the target word occurs. We notice that there are cases where the same sentence appears in multiple instances. As shown in Table 2, two existing instances, which share the same target word, contain the same first sentence, but different second sentences. Therefore, we construct new instances by pairing the second sentences from these existing instances and assign labels based on the original labels: • If both existing instances are positive ('T', i.e. the target word is used in the same meaning), the resulting instance should be positive ('T') as well -see Example (a) in Table 2: where M (w sn ) indicates the meaning of the target word (w) used in sentence n (s n ).
• If one of them is positive ('T') and the other is negative ('F', i.e. the target word is used in a different meaning), the new instance should then be negative ('F') -see Example (b) in Table 2:

Model
Following Raganato et al. (2020), we use pretrained XLM-R as the underlying language model, which is a Transformer-based multilingual masked language model that has been trained on one hundred languages (Conneau et al., 2020). Unlike previous WiC and XL-WiC models employing a logistic regression classifier (Wang et al., 2019;Raganato et al., 2020), we add two additional layers on top of the Transformer-based model to perform classification: a linear layer with tanh activation, followed by another linear layer with sigmoid activation.
The model takes as input the two sentences in each instance. For the representation to be fed into the linear layers, we concatenate the representation corresponding to the first special token ([s]) of the input sequence, 3 the vector representations of the target word in the first ([w s 1 ]) and second sentences ([w s 2 ]), as well as the element-wise absolute difference, cosine similarity (cos) and Euclidean distance (dist) between these two vectors: For those cases where the target word is split into multiple sub-tokens, we take the averaged representation rather than the representation of its first subtoken, which has been used in previous work (Wang et al., 2019;Raganato et al., 2020). 4

Training strategy
Inspired by the success of multi-stage training for tasks like grammatical error correction (Kiyono et al., 2019;Omelianchuk et al., 2020;Yuan and Bryant, 2021) and machine translation (Zhang et al., 2018), we employ a two-step training strategy: 1) pre-training on out-of-domain data; and 2) fine-tuning with in-domain MCL-WiC data.
3 The [s] token in XLM-R is equivalent to the [CLS] token in BERT (Devlin et al., 2019). 4 Our preliminary experiments show that using the averaged representation is more effective than that of the first sub-token.
Sentence #1 I went outside to get some fresh air. (A2) Sentence #2 He has an air of authority.

Data
In addition to the MCL-WiC dataset provided by the shared task organisers, we introduce two types of out-of-domain data: 1) (*)WiC datasets: WiC and XL-WiC; and 2) sentence pairs constructed with examples extracted from datasets that have been annotated with both complexity and sense information: the Cambridge Advanced Learner's Dictionary (CALD) 5 and SeCoDa (Strohmaier et al., 2020).
WiC The English WiC dataset was created using example sentences from WordNet (Fellbaum, 1998), VerbNet (Kipper-Schuler, 2005), and Wiktionary. We extract 6,066 labelled instances (by removing those without gold labels) and use them for the shared task.   way, sentence pairs are encoded with additional word complexity information. In total, we generate 34,205 negative EN-EN instances. 7 An example is given in Table 3.

XL-WiC
SeCoDa is an English language corpus of words annotated with both complexity and word senses. The original data comes from three sources: professional News, WikiNews and Wikipedia articles. The senses are drawn from the CALD and come at two levels of granularity. To use this dataset for the MCL-WiC task, sentences sharing an annotated word are paired: if the word shares a sense, the pair of sentences is labelled as 'T'; otherwise, it is labelled as 'F'. We use the finer level of granularity for this assignment. Overall, we extract 10,712 labelled pairs (9,015 positive and 1,697 negative). All the data introduced in this section is regarded as out-of-domain data and therefore used in the pre-training stage, and the in-domain MCL-WiC training data is used in the fine-tuning stage. For development, we use only the EN-EN MCL-WiC development set (MCL-WiC-dev(EN) -see Table 4). A single model is developed to target all multilingual and cross-lingual tracks. It is worth noting that neither the multilingual AR-AR and RU-RU data nor the cross-lingual data is used for 7 Due to time limitations, we have not used any positive instances from CALD and leave it for future work. training, i.e. zero-shot.
The data augmentation method proposed in Section 3.1 is applied to the (*)WiC datasets, but not to CALD or SeCoDa. Detailed statistics of the corpora used in our experiments are presented in Table 4.

Training details
In our experiments, models are trained by minimising the binary cross-entropy loss between their prediction and the gold label. We use the AdamW optimiser (Loshchilov and Hutter, 2019) with a fixed learning rate of 1e-5 for all models. 8 We use a dropout layer with a dropout probability of 0.2. The input texts are processed in batches of 8 and are padded or truncated to a length of 182. 9 We select the model with the best validation accuracy on MCL-WiC-dev(EN). Each model is trained on one NVIDIA Tesla P100 GPU.

Results
We construct three pre-training sets using different combinations of the out-of-domain data and    (Table 5 #2). Interestingly, Raganato et al. (2020) also reported performance gains in all languages after adding multilingual data. Examples from different languages can still help models better generalise across languages. The addition of English data from CALD and SeCoDa is also beneficial, yielding further improvements in all languages except for ZH-ZH (#3). Finally, the predictions from all three systems are used in an ensemble model. For each final prediction, the majority vote (MV) of these predictions is taken, i.e. the prediction with the most votes is chosen as final prediction. The ensemble model yields the highest averaged score, as well as in EN-EN and ZH-ZH sub-tracks (#4), suggesting that all three systems (#1, #2 and #3) are complementary.

Official evaluation results
We submit our systems to all multilingual and cross-lingual tracks. The official results of our best submission for each track are reported in Table 6 and Table 7. Our ensemble system (System #4) achieves state of the art in the multilingual track, ranking first in both AR-AR and RU-RU sub-tracks without any AR or RU training data, and second in FR-FR out of 171 submitted systems. For the cross-lingual track, our 'zero-shot' system (System #1) is consistently within the top ten ranks out of 171 total submissions. 10 6 Analysis 6.1 Effect of two-step training We propose a two-step training strategy to make use of both in-domain and out-of-domain data. To investigate the effectiveness of both training steps, we undertake an ablation study, in which we remove one training step at a time. Table 8 presents the ablation test results of the system pretrained on WiC aug. + XL-WiC(FR&ZH) aug. , and fine-tuned on MCL-WiC-train aug. (i.e. System #1 in Table 5).
The results of the ablation study demonstrate the effectiveness of the two-step training strategy, and show that it is crucial to have both pre-training and fine-tuning stages. Performance drops in all multilingual sub-tracks when removing either step, except for removing the pre-training step in FR-FR (+0.70). This is interesting as the model is pretrained on data for only three sub-tracks, where FR-FR is one of them. For the other two languages, we observe the biggest performance decrease: ZH-ZH (−2.80) and EN-EN (−2.30). Overall, the fine-tuning stage seems more effective than the pre-training stage, though more data is used in the latter (10,798 for fine-tuning vs. 66,417 for pretraining), demonstrating the importance of having high-quality in-domain data.

Comparison of representations
In our system, the representation pooled out from the underlying pre-trained language model is a combination of three vector representations (of the first token and target word in both sentences), and three distance-based features: the element-wise absolute difference, cosine similarity and Euclidean distance between the target word in both sentences (see Section 3.2). We further experiment with different representations and present our results in Table 9.
We can see that our proposed representation yields the overall best performance across different languages, suggesting that the addition of all three distance-based features is indeed helpful.

Conclusion
In this paper, we presented the contribution of the Cambridge University team to the SemEval 2021 shared task on MCL-WiC Disambiguation. Using XLM-R, a pre-trained multilingual Transformerbased language model, as a starting point, we investigated automatic data augmentation, the use of multiple external datasets, multi-stage training strategies, and the representation of tokens and their distance. Our detailed analysis demonstrated the effectiveness of the two-step training strategy for making use of both in-domain and out-of-domain data, as well as the benefits of adding distancebased features to the representation for WiC disambiguation. Our best system yields highly competitive results in the cross-lingual track and achieves state-of-the-art results in the multilingual track, ranking first in two languages (AR and RU) and second in FR out of 171 total submissions.