TransWiC at SemEval-2021 Task 2: Transformer-based Multilingual and Cross-lingual Word-in-Context Disambiguation

Identifying whether a word carries the same meaning or different meaning in two contexts is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. Most of the previous work in this area rely on language-specific resources making it difficult to generalise across languages. Considering this limitation, our approach to SemEval-2021 Task 2 is based only on pretrained transformer models and does not use any language-specific processing and resources. Despite that, our best model achieves 0.90 accuracy for English-English subtask which is very compatible compared to the best result of the subtask; 0.93 accuracy. Our approach also achieves satisfactory results in other monolingual and cross-lingual language pairs as well.


Introduction
Words' semantics have a dynamic nature which depends on the surrounding context (Pilehvar and Camacho-Collados, 2019). Therefore, the majority of words tends to be polysemous (i.e. have multiple senses). For few examples, words such as "cell", "bank" and "report" can be mentioned. Due to this nature in natural language, it is important to focus on word-in-context sense while extracting the meaning of a word which appeared in a text segment. Also, this is a critical requirement to many applications such as question answering, document summarisation, information retrieval and information extraction.
Word Sense Disambiguation (WSD)-based approaches were widely used by previous research to tackle this problem (Loureiro and Jorge, 2019;Scarlini et al., 2020). WSD associates the word in a text with its correct meaning from a predefined sense inventory (Navigli, 2009). As such inventories, WordNet (Miller, 1995) and Babel-Net (Navigli and Ponzetto, 2012) were commonly used. However, these approaches fail to generalise into different languages as these inventories are often limited to high resource languages. Targeting this gap, SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation is designed to capture the word sense without relying on fixed sense inventories in both monolingual and cross-lingual setting. In summary, this task is designed as a binary classification problem which predicts whether the target word has the same meaning or different meaning in different contexts of the same language (monolingual setting) or different languages (cross-lingual setting).
This paper describes our submission to SemEval-2021 Task 2 (Martelli et al., 2021). Our approach is mainly focused on transformer-based models with different text pair classification architectures. We remodel the default text pair classification architecture and introduce several strategies that outperform the default text pair classification architecture for this task. For effortless generalisation across the languages, we do not use any language-specific processing and resources. In the subtasks where only a few training instances were available, we use few-shot learning and in the subtasks where there were no training instances were available, we use zero-shot learning taking advantage of the cross-lingual nature of the multilingual transformer models.
The remainder of this paper is organised as follows. Section 2 describes the related work done in the field of word-in-context disambiguation. Details of the task data sets are provided in Section 3. Section 4 describes the proposed architecture and Section 5 provides the experimental setup details. Following them, Section 6 demonstrates the obtained results and Section 7 concludes the paper with final remarks and future research directions.

Related Work
Unsupervised systems Majority of the unsupervised WSD systems use external knowledge bases like WordNet (Miller, 1995) and BabelNet (Navigli and Ponzetto, 2012). For each input word, its correct meaning according to the context can be found using graph-based techniques from those external knowledge bases. However, these approaches are only limited to the languages supported by used knowledge bases. More recent works like Hettiarachchi and Ranasinghe (2020a); Ranasinghe et al. (2019a) propose to use stacked word embeddings (Akbik et al., 2018) obtained by general purpose pretrained contextualised word embedding models such as BERT (Devlin et al., 2019) and Flair (Akbik et al., 2019) for unsupervised WSD. Despite their ability to scale over different languages, unsupervised approaches fall behind supervised systems in terms of accuracy.
Supervised systems Supervised systems rely on semantically-annotated corpora for training (Raganato et al., 2017;Bevilacqua and Navigli, 2019). Early approaches were based on traditional machine learning algorithms like support vector machines (Iacobacci et al., 2016). With the word embedding-based approaches getting popular in natural language processing tasks, more recent approaches on WSD were based on neural network architectures (Melamud et al., 2016;Raganato et al., 2017). However, they rely on large manuallycurated training data to train the machine learning models which in turn hinders the ability of these approaches to scale over unseen words and new languages. More recently, contextual representations of words have been used in WSD where the contextual representations have been employed for the creation of sense embeddings (Peters et al., 2018). However, they also rely on sense-annotated corpora to gather contextual information for each sense, and hence are limited to languages for which gold annotations are available. A very recent approach SensEmBERT (Scarlini et al., 2020) provide WSD by leveraging the mapping between senses and Wikipedia pages, the relations among BabelNet synsets and the expressiveness of contextualised embeddings, getting rid of manual annotations. However, SensEmBERT (Scarlini et al., 2020) only supports five languages making it difficult to use with other languages.
Considering the limitations of the above meth-ods, in this paper we propose an approach which is based on general purpose transformer models and does not rely on external knowledge bases. Also, our approach shows strong few-shot/zeroshot learning performance removing the hurdle of having manually-curated training data for each language pair.

Data
The data set used for SemEval-2021 Task 2 is designed targeting a binary classification problem following Pilehvar and Camacho-Collados (2019). To preserve the multilinguality and cross-linguality of the task, five different languages: English, Arabic, French, Russian and Chinese have been considered for data set preparation. In the monolingual setting, per instance, a sentence pair written in the same language is provided with a targeted lemma to predict whether it has the same meaning (True) or different meanings (False) in both sentences. In the cross-lingual setting, each sentence pair is written in two different languages with the same prediction requirement. Few samples from the monolingual and cross-lingual data sets are shown in Table 1.
The monolingual data set covers the language pairs: en-en, ar-ar, fr-fr, ru-ru and zh-zh. For each language, 8-instance trial data sets with labels were provided to give an insight into the task. As training data, 8,000 labelled instances were provided only for the English language and as dev data, 1,000 labelled instances were provided per each language. To use with final evaluation, for each language, 1,000-instance test data sets were provided.
The cross-lingual data set covers the language pairs: en-ar, en-fr, en-ru and en-zh. Similar to the monolingual data set, 8-instance trial data sets with labels were provided for each language pair. However, no training or dev data sets were provided for the cross-lingual setting. To use with the final evaluation, 1,000-instance test data sets were provided per each language pair.

TransWiC Architecture
The main motivation behind the TransWiC architecture is the success transformer-based architectures had in various natural language processing tasks like offensive language identification (Ranasinghe and Hettiarachchi, 2020;Ranasinghe et al., 2019c;Pitenis et al., 2020), offensive spans identification (Ranasinghe and Zampieri, 2021a;Ranasinghe et al., 2021), language detection (Jauhiainen et al.,  Zampieri, 2020, 2021b;Ranasinghe et al., 2020a). There-fore we took the general purpose transformers like BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), reworked their sentence pair classification architecture with so called strategies described below to perform well in word-in-context disambiguation task.
Preprocessing As a preprocessing step we add two tokens to the transformer model's vocabulary: <B> and <E>. We place them around the target word in both sentences. For example, the sentence "la souris mange le fromage" with the target word "souris" will be changed to "la <B> souris <E> mange le fromage".  output of the [CLS] token is fed into a softmax layer to predict the labels (Figure 1).
ii <B> Strategy -We concatenate the output of two <B> tokens of the two sentences and feed it into a softmax layer to predict the labels ( Figure  2a).
iii <B> + [CLS] Strategy -We concatenate the output of two <B> tokens of the two sentences with the [CLS] token and feed it into a softmax layer to predict the labels (Figure 2b).
iv <E> Strategy -Output of the two <E> tokens of the two sentences are concatenated and feed into a softmax layer to predict the labels ( Figure  2c).
v <E> + [CLS] Strategy -We concatenate the output of two <E> tokens of the two sentences with the [CLS] token and feed it into a softmax layer to predict the labels (Figure 2d).
vi Entity Pool Strategy -To effectively deal with rare words, transformer models use sub-word units or WordPiece tokens as the input to build the models (Devlin et al., 2019). Therefore, there is a possibility that one target word can be separated into several sub-words. In this strategy, we generate separate fixed-length embeddings for each target word by passing its sub-word outputs through a pooling layer. The pooled outputs are concatenated and fed into a softmax layer to predict the labels (Figure 2e).
vii Entity First Strategy -Similar to the previous strategy, instead of using all the sub-words of the target word, we only use the output of the first sub-word in this strategy. We feed the concatenation of these outputs into a softmax layer to predict the labels (Figure 2f).
viii Entity Last Strategy -Similar to the Entity First Strategy instead of the first sub-word, we use the last sub-word to represent the target word. We feed their concatenation into a softmax layer to predict the labels (Figure 2g).
ix [CLS] + Entity Pool Strategy -We concatenate the pooled outputs generated by Entity Pool Strategy with the [CLS] token and feed it into a softmax layer to predict the labels ( Figure  2h). token and feed it into a softmax layer to predict the labels (Figure 2i).
xi [CLS] + Entity Last Strategy -In this strategy, we concatenate the last sub-word output of the target words with [CLS] token and feed it into a softmax layer to predict the labels ( Figure  2j).

Experimental Setup
This section describes the training data and hyperparameter configurations used during the experiments.

Training Configurations
English-English For the English-English subtask, we performed training on the English-English training data for each strategy mentioned above. During the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated. We used the saved model from a particular strategy to get predictions for the English-English test set for that particular strategy.
Other Monolingual Since there were less training data available for non-English monolingual datasets, we followed a few-shot learning approach mentioned in Ranasinghe et al. (2020c,b). When we are starting the training for non-English monolingual language pairs, rather than training a model from scratch, we initialised the weights saved from the English-English experiment. Then we performed training on the dev data for each language pair separately. Similar to English-English experiments, during the training process, the parameters of the transformer model, as well as the parameters of the subsequent layers, were updated.
Crosslingual Since there were no training data available for cross-lingual datasets, we followed a zero-shot approach for them. Multilingual and cross-lingual transformer models like multilingual BERT and XLM-R show strong cross-lingual transfer learning performance. They can be trained on one language; typically a resource-rich language and can be used to perform inference on another language. The cross-lingual nature of the transformer models has provided the ability to do this (Ranasinghe et al., 2020c). Therefore, we used the models trained on the English-English dataset to get predictions for cross-lingual datasets.

Hyperparameter Configurations
We used a Nvidia Tesla K80 GPU to train the models. We divided the input dataset into a training set and a validation set using 0.8:0.2 split. We predominantly fine-tuned the learning rate and the number of epochs of the classification model manually to obtain the best results for the validation set. We obtained 1e − 5 as the best value for the learning rate and 3 as the best value for the number of epochs. We performed early stopping if the validation loss did not improve over 10 evaluation steps. The rest of the hyperparameters which we kept as constants are mentioned in the Appendix. When performing training, we trained five models with different random seeds and considered the majority-class self ensemble mentioned in Hettiarachchi and Ranasinghe (2020b) to get the final predictions.

Results and Evaluation
Organisers used the accuracy as the evaluation metric as shown in Equation 1 where TP is True Positive, TN is True Negative, FP is False Positive and FN is False Negative.
Since there were less or no training data available for other monolingual and cross-lingual settings, we trained and evaluated models for each of our strategies using English-English training and dev sets. Then the best models are picked to use with few-shot and zero-shot learning approaches. We report the results obtained by English-English evaluation in Table 2. In the BERT column, we report the results of the bert-large-cased model while in the XLM-R column, we report the results of the xlm-r-large model.
As shown in Table 2, some strategies outperformed the default sentence pair classification architecture. Among all experimented strategies <B> + [CLS] strategy performed best. Usually, multilingual transformer models like XLM-R do not outperform the language-specific transformer models. Surprisingly, in this task XLM-R models outperform bert-large models. We selected three best performing models for the submission; XLM-R  Since multilingual models provided the best results for the English-English dataset, it provided an additional advantage as they can be used directly in other language pairs too as mentioned in Section 5. For other language pairs, we did not perform any evaluation due to the lack of data availability. We trusted the cross-lingual performance of XLM-R and used the best three models of the English-English experiment. For the rest of the monolingual pairs, we used the few-shot learning approach using the given dev sets and for the cross-lingual pairs, we used the zero-shot learning approach mentioned in Section 5.
We report the results we got for the test set in Table 3. According to the results, <B> + [CLS] strategy performs best in all the language pairs except Ar-Ar, where <B> strategy outperforms <B> + [CLS] strategy. When compared to the best models submitted to each language pair, our approach shows very competitive results in the majority of the monolingual language pairs. However, we believe that the cross-lingual performance of our methodology should be improved. Nonetheless, we believe that as a methodology that did not use any language-specific resources and did not see any language-specific data, the results are at a satisfactory level.

Conclusions
In this paper, we presented our approach for tackling the SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation. We use the pretrained transformer models and re-  Table 3: Row I shows the accuracy scores for the test set with strategies submitted. Best results for each language pair with our strategies are in bold. Row II shows the accuracy scores for the test set with the best system submitted for each language pair. model the sentence pair classification architecture for this task with several strategies. Our best strategies outperform the default sentence pair classification setting for English-English. For other monolingual language pairs, we use the few-shot learning approach while for cross-lingual language pairs we use the zero-shot approach. Our results are compatible with the best systems submitted for each language pair and are at a satisfactory level given the fact that we did not use any language-specific processing nor resources.
As future work, we would be looking to improve our results more with new strategies. We would like to experiment with whether adding languagespecific processing and resources would improve the results. We are keen to add different neural network architectures like Siamese transformer networks (Reimers and Gurevych, 2019) that perform well in sentence pair classification tasks (Ranasinghe et al., 2019b;Mueller and Thyagarajan, 2016) to the TransWiC framework. Furthermore, we are hoping to work in a multi-task environment and experiment whether transfer learning from a similar task like semantic textual similarity (Cer et al., 2017) would improve the results for this task.