PAW at SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation : Exploring Cross Lingual Transfer, Augmentations and Adversarial Training

We experiment with XLM RoBERTa for Word in Context Disambiguation in the Multi Lingual and Cross Lingual setting so as to develop a single model having knowledge about both settings. We solve the problem as a binary classification problem and also experiment with data augmentation and adversarial training techniques. In addition, we also experiment with a 2-stage training technique. Our approaches prove to be beneficial for better performance and robustness.


Introduction
Language is complex even for human beings, let alone for computers. The same word serves different purposes in different scenarios, thus increasing the complexity of the Word Sense Disambiguation (WSD). For example in English, the word "bank" can refer to a financial institution or the land alongside a river. Many works revolving around WSD have been done with the help of explicit word sense inventories like WordNet 1 and BabelNet 2 . With the advent of advanced deep learning models, it is desirous to develop systems that have a good understanding of languages without such gold standards of word sense. This unsupervised learning can help the model learn better latent representations of words in different contexts.
In this paper, we aim to develop a single system that has knowledge of both multilingual and cross-lingual word sense disambiguation by training models with the combined data for both settings. We present our approaches for WSD in the multilingual and cross-lingual domain. The task is treated as a binary-classification problem: whether words have the same sense in the two given pairs of sentences. We experiment with XLM-RoBERTa (Conneau et al., 2019), which is based on the Transformer architecture (Vaswani et al., 2017), as the backbone of our architectures in both the settings. In addition, we also leverage external data and different training techniques and data augmentation.
The rest of the paper is organized as follows : various related works have been discussed in section 2, followed by a brief description of the shared task dataset in section 3. The system overview and experimental settings are covered in sections 4 and 5. Sections 6 contain the results. Section 7 concludes the paper and also includes scope of future work.

Related Work
Silberer and Ponzetto (2010) make use of graph algorithms for the word sense disambiguation task. They build a multilingual co-occurence graph in which the multilingual nodes are connected with translation edges and labelled with the target word's translations as obtained from the corresponding contexts.
Authors in Banea and Mihalcea (2011), use multilingual vector space which is obtained by expanding monolingual features engineered from more than one language, in order to generate a more effective, robust and utilitarian vector representation. These engineered features are then used for WSD.
Languages like Arabic do not have as many resources in the available dataset as compared to more common languages like English. To tackle this issue for the Persian language, Lefever and Hoste (2011) follow a two phase approach -in the first phase, they utilize an English Word sense disambiguation system to assign "sense tags" to words appearing in English sentences and then in the following phase, they transfer the senses obtained in the previous phase to corresponding Persian words.
In the Semeval-2013 WSD task (Navigli et al., 2013), Rudnick et al. (2013) take a classificationbased approach to the Cross-Lingual WSD task. They build the HLTDI system in which they perform word alignment on the Europarl corpus. This helps them find samples in the training data which have ambiguous focus words. The paper describes three variants of the classifier -one is trained over local features, the second is trained over the data with translation of the focus word in the four target languages added to the feature vector and the final variant builds a Markov network of the first classifier in order to find the best translation.

Dataset
The dataset (Martelli et al., 2021) 3 provided by the shared task organizers consists of both multilingual and cross-lingual data in English (EN), Arabic (AR), French (FR), Russian (RU) and Chinese (ZH). The dataset consists of two sentences and the words in corresponding sentences (which need disambiguation) and the corresponding label.

System Overview
Our experiments revolve around Facebook's XLM RoBERTa model, which was an update to their XLM-100 Language Model. XLM RoBERTa is based on the transformer architecture consisting of multi-attention heads which apply a sequence-tosequence transformation on the input text sequence. The training procedure is inspired from RoBERTa (Liu et al., 2019) i.e. only the Masked Language Model objective is used. XLM RoBERTa is scaled up to 100 languages, thus becoming a good choice for multi-lingual datasets.
Another motivation to experiment with XLM RoBERTa comes from the facility of "Cross Lin-3 https://github.com/SapienzaNLP/mcl-wic gual Transfer", which can help with unbalanced data of different languages. Knowledge is transferred for all languages if the model is trained for a particular task using data of only one language. Thus, this feature saves effort of gathering more data to make the data distribution balanced.

Problem Formulation
We perform experiments keeping the model architecture constant across all experiments. The model accepts both the sentences concatenated together. The input to the model is formulated as : word 1 + < /s > +sentence 1 + < /s > +word 2 + < /s > +sentence 2 , where < /s > is the separator token in XLM RoBERTa vocabulary.
Dropout is applied on the pooled encoding output from the model. The dropout probability is set to 0.3. The dropout applied output is then passed through a linear layer which provides us with the logits corresponding to the 2 classes.

Data Augmentation
Data augmentations are considered an important technique to avoid overfitting of neural networks thus making them more generalised. Since our model architecture accepts both the sentences together, there is room to apply a simple data augmentation during training. Consider t 1 and t 2 are the 2 sentences for a particular data instance. The training data is augmented as t 1 t 2 and t 2 t 1 , where represents concatenation. We apply the augmentation taking care that no data leakage takes place in the validation data.

Two Stage Training
To leverage the property of Cross Lingual Transfer, we first train the model on the WiC dataset (Pilehvar and Camacho-Collados, 2018), which consists only of English data. Then we train the same model (trained on WiC) on the MCL WiC dataset. This technique instills some knowledge via cross lingual transfer, about WSD in the first stage and then builds on the knowledge using the shared task dataset.

Adversarial Training (AT)
Adversarial training is another technique that is used to increase the robustness of models, which also helps in better generalisation. Adversarial training in Computer Vision is done by directly perturbing the input images. However, text data  Many approaches for adversarial training in NLP have been developed. We experiment using Miyato et al. (2016) approach with little modification. In their approach, the word embeddings are normalized first. Required perturbations are created using the gradients obtained via backpropagation. Let the sequence of (normalized) word embedding vectors of a text be t. The model parameters are represented by θ. The probability of the text belonging to class y is given by p(y|t; θ). The adversarial perturbations z adv are computed as follows: where is a hyper-parameter controlling the size of the perturbations. The adversarial loss is defined as : log p(y n |t n + z adv,n ; θ) By using the gradients calculated from the above loss, the weights of the model are updated (the non-perturbed word embeddings of the model are updated). Our experiments deviate from the above method in the part that we do not normalize our pretrained word embedding of the model, since doing so might change the semantic meaning of the pretrained word embeddings. We perform adversarial training XLM RoBERTa model using = 1.

Test Time Augmentation (TTA)
The usage of the training data augmentation can be extended to test time as well. For a given data instance t1 and t2, the model predictions for t 1 t 2 and t 2 t 1 are combined using simple averaging of probabilities. Thus, this simple augmentation can help boost the performance of the model.

Experimental Setup
We make use of combined training and validation data provided by the shared task organizers. We perform a stratified 5 fold cross validation using the combined data. In all our experiments, we fine tune the entire model. Each fold is trained for 20 epochs using early stopping with patience of 6 and tolerance of 1e-3. The models are optimised using AdamW (Loshchilov and Hutter, 2017) with a learning rate of 5e-6 and a batch size of 16 4 . Inputs of maximum sequence length 172 are used in the model. The models have been implemented using Pytorch (Paszke et al., 2019) and Huggingface's Transformers (Wolf et al., 2019) library.

Results
Accuracy score is the official evaluation metric for the shared task. The test predictions are obtained by combining the predictions of all the 5 fold models (by averaging the predictions from all models). Table 1 lists down the cross validation accuracy scores of all the experiments. The test scores are categorised as cross-lingual and multilingual and are presented in tables and 2 and 3 respectively . For bench marking purpose, we also mention the best performances achieved by participants of the shared task.
A few observations can be made by looking at the results:

Conclusion and Future Work
We explore the performance of XLM RoBERTa at Word In Context Disambiguation both in the multilingual and cross lingual setting. We also explore different training techniques such as twostage training and adversarial training along with some simple augmentations to make our models robust and more generalized. Test Time Augmentations, based on training augmentation turn out to useful. For future work, we can explore the performance of ensembling different kinds of models trained with and without adversarial training together, so as to produce more robust results. It will also be interesting to experiment with larger backbone models in the current architecture.