GX at SemEval-2021 Task 2: BERT with Lemma Information for MCL-WiC Task

This paper presents the GX system for the Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC) task. The purpose of the MCL-WiC task is to tackle the challenge of capturing the polysemous nature of words without relying on a fixed sense inventory in a multilingual and cross-lingual setting. To solve the problems, we use context-specific word embeddings from BERT to eliminate the ambiguity between words in different contexts. For languages without an available training corpus, such as Chinese, we use neuron machine translation model to translate the English data released by the organizers to obtain available pseudo-data. In this paper, we apply our system to the English and Chinese multilingual setting and the experimental results show that our method has certain advantages.


Introduction
In recent years, contextual embeddings have drawn much attention. The approaches of calculating contextual embeddings include multi-prototype embeddings, sense-based and contextualized embeddings (Camacho-Collados and Pilehvar, 2018). However, it is not easy to evaluate such multiple embedding methods in one framework. Pilehvar and Camacho-Collados (2019) present a large-scale word in context dataset to focus on the dynamic semantics of words. Following and expanding them, the MCL-WiC task (Martelli et al., 2021) performs a binary classification task that indicates whether the target word is used with the same or different meanings in the same language (multilingual data set) or across different languages (cross-lingual data set). Besides, it is the first SemEval task for Word-in-Context disambiguation (Martelli et al., 2021). 1 Reproducible code: https://github.com/yingwaner/bert4wic A typical solution to the problems is obtaining context-specific word embeddings, such as Context2vec (Melamud et al., 2016) and Bidirectional Encoder Representations from Transformers (BERT) (Devlin et al., 2019). BERT is designed to pre-train deep bidirectional representation in unlabeled texts by jointly conditioning in the left and right contexts of all layers. Due to its powerful capabilities and easy deployment, we use BERT as our major system and fine-tune on the training data released by the organizer to get the context-specific word embeddings.
In this paper, we participate in the sub-task of multilingual settings in English and Chinese. The organizer only provides English training data, and we fine-tune the pre-trained English BERT model based on this data. For Chinese tasks where no training set is available, we train a satisfactory neural machine translation (NMT) model to translate the English training set into Chinese and then fine-tune the Chinese BERT model based on the pseudo-data. The experimental results show that our method achieves 82.7% in English multilingual setting and 76.7% in Chinese multilingual setting.

Background
In this section, we will briefly introduce the wordin-context task and the structure of BERT for the sentence pair classification task.

Word-in-Context
The MCL-WiC task (Martelli et al., 2021) expands the Word-in-Context (WiC) (Pilehvar and Camacho-Collados, 2019) task to be multilingual and cross-lingual settings. For WiC, each instance has a target word lemma, which provides it with two contexts. Each context triggers the specific meaning of the word lemma. The task is to identify whether lemma in two contexts corresponds to Figure 1: Overall fine-tuning procedures for our system. The token Lemma is the target word that require the system to judge whether it has the same meaning in two sentences. the same meaning, which has been widely investigated in recent years. Wiedemann et al. (2019) perform word sense disambiguation models with contextualised representations. Hu et al. (2019) prove that the supervised derivation of time-specific sense representation is useful. Giulianelli et al. (2020) present an unsupervised approach to lexicalsemantic change that makes use of contextualized word representations. Loureiro and Jorge (2019) compute sense embeddings and the relations in a lexical knowledge base. Scarlini et al. (2020) drop the need for sense-annotated corpora so as to collect contextual information for the senses in WordNet.

BERT
Neural contextualized lexical representation has been widely used in natural language processing, which benefits from deep learning model in optimizing tasks while learning usage dependent representations, such as ULMFiT (Howard and Ruder, 2018), ELMo (Peters et al., 2018), GPT (Radford et al., 2018(Radford et al., , 2019, and BERT (Devlin et al., 2019). BERT is pre-trained by two unsupervised tasks: masked LM task, which is simply masking some percentage of the input tokens at random, and then predicting those masked tokens; and next sentence prediction task, which is whether the next sentence in the sentence pair is the true next sentence. In the fine-tuning phase, task-specific inputs and outputs are plugged into the BERT and all parameters are fine-tuned end-to-end.
The architecture of BERT is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017). There are several encoder layers in the BERT model. For a single layer in Transformer encoder, it consists of a multi-head self-attention and a position-wise feed-forward network. Specifically, there are specialized input and output formats for different downstream tasks. For language pair classification task, the input format is [CLS] + Sentence 1 + [SEP] + Sentence 2 + [SEP]. At output layer, the [CLS] representation is fed into an output layer for the classification task, such as entailment, sentiment analysis, and the word-in-context disambiguation task.

System Overview
Systems proposed for both English and Chinese multilingual settings were based on BERT model (Devlin et al., 2019) with task-specific input modifications. We participate in the multilingual setting and divide the system into two parts according to the language: English setting and Chinese setting.

English Setting
Following Devlin et al. (2019), we initialize our model with the well pre-trained model, which has been trained on the large-scale data set and obtained the general knowledge. Then we fine-tune the model on the English parallel sentences released by the organizers.

Model Architecture
The model architecture in the fine-tuning stage is shown in Figure 1. On the basis of the original BERT input, lemma token is added, which is the target word that needs the system to judge whether it has the same meaning Figure 2: Our system input representation. The input embeddings are the sum of the token embeddings, the segment embeddings, the position embeddings, and the lemma embeddings. It is only at the position of the lemma tokens that the lemma embedding is E T . In this example, we assume that Tok N in sentence 1 and Tok 1 in sentence 2 are also lemma tokens.
in the sentence pair. For instance, given the sentence pairs 'They opposite to the policies designed to tackle inflation.' and 'I tackled him about his heresies.', the standard input format is: [CLS] They opposite to the policies designed to tackle inflation .
[SEP] I tackled him about his heresies .
[SEP] tackle [SEP] where tackle is the lemma token, which is also the word that needs to be judged by the system whether it has the same meaning in two sentences. In this way, we emphasize the target word so that the output T [CLS] of the output layer can express whether the lemma token is synonymous in the two sentences.
Input Representation In addition, we also made some modifications for the input representation, which is made up of the sum of the corresponding token, segment, and position embeddings according to Devlin et al. (2019). The input representation of our system is shown in Figure 2. We adjust the segment embeddings of the lemma token to further emphasize the importance of the target word in the whole sentence pair, which is represented as E C . Moreover, we introduce lemma embeddings in the input representation. Lemma embeddings are similar to segment embeddings, but segment embeddings are to distinguish between sentence 1 and sentence 2, while lemma embeddings are to distinguish between the position of lemma tokens and the position of other tokens. Only the lemmas in sentence 1 and sentence 2 and the final lemma Tok L will be marked E T , and the other positions will be marked E F , that is, for a training example, there will be three lemma markers E T in lemma embeddings. In this way, we enhance the relationship of lemma tokens to make them more closely connected, and at the same time highlight and emphasize the position and importance of lemma tokens, so that the final output can obtain enough lemma token information.

Chinese Setting
The multilingual setting in Chinese is more difficult because there is no available training data in Chinese, so it is not possible to fine-tune the pre-trained BERT model. In order to solve this problem, we introduce neural machine translation method.
Neural Machine Translation Due to the superior performance of Transformer, we use it as our neural machine translation model. We first train an English-to-Chinese translation model on an opensource dataset and evaluate its performance to ensure that it has sufficient translation quality. Then, we use this translation model to translate sentence 1 and sentence 2 from the English training set released by the organizer into Chinese, respectively, and regard the generated sentences as the training data of Chinese MCL-WiC task. Finally, we use the pre-trained Chinese BERT model to fine-tune this generated data set to get our final model.

Model Architecture
The system in Chinese setting is somewhat different from the system in English setting in that there is no lemma token. We use machine translation to translate English training data into Chinese, because every token in a sentence has a context, so sentence to sentence translation does not change the meaning of the whole sentence much. However, a lemma token has no context, so it is difficult for translation model to choose which token to translate into the target  Table 1: Main results on English and Chinese tasks. The measure is accuracy (%). The '+' in systems represent an increase in modules of the system in the previous row. ∆ represents the difference between the result of the current system and that of the previous row.
language, because it may correspond to multiple meanings. Therefore, the final submitted system to the task has no lemma token, no segment embedding E C and no lemma embeddings. However, in order to analyze the role of lemma tokens in this multilingual setting, we will report the results with lemma token and segment embedding in Table 1 and Section 5.1. In this case, the lemma token will be translated to the most common Chinese word.

Experimental Setup
In this section, we will describe the experimental settings for English and Chinese in detail.

English Setting
Take one pre-trained English cased BERT base model 2 with 12 layer, 768 hidden, 12 heads, and 110M parameters, and fine-tune on the English training data in 5 epochs with batch size is 16 and max sequence length is 128. The dropout rate is 0.2 and other settings are followed Devlin et al. (2019).

Chinese Setting
The fine-tuning setups are the same as the English ones, except that the pre-training model is a Chinese BERT base 3 with a layer of 12, hidden size of 768, heads of 12, and parameters of 110M. For machine translation model, we implement Transformer base model (Vaswani et al., 2017) using the open-source toolkit Fairseq-py (Ott et al., 2019). The training data of English-Chinese is from UNPC v1.0 and MultiUN v1 in WMT17 4 , which are total 30.4M sentences. We trained the model with dropout = 0.1 and using Adam optimizer (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, and = 10 −9 . The translation is detokenized and then the quality is 35.0, which is evaluated using the 4-gram case-sensitive BLEU (Papineni et al., 2002) with the SacreBLEU tool (Post, 2018). 5 . This translation model achieves satisfactory results, which shows that the method of translating English training set into Chinese has theoretical basis and feasibility.

Results
In this section, we will first report the main results of English and Chinese multilingual setting and analyze the importance of all the factors in the system. Then, we explore the probability of error for each part of speech.

Main Results
We conduct our experiments based on BERT frame 6 . The specific meanings of each system in the main experiment are as follows: Fine-tuning Following the standard fine-tuning format of Devlin et al. (2019), sentences 1 and 2 are connected by [SEP].
+Lemma Token Lemma token is added on the basis of the previous system, and the training input at this time is '[CLS] + Sentence 1 + [SEP] + Sentence 2 + [SEP] + Lemma + [SEP]'. +Segment E C Based on the previous system, the segment embedding of the final lemma token is set to E C . +Lemma Embeddings Lemma embedding will be added on the basis of the previous system, and the input representation at this time consists of four parts: token, segment, position, and lemma embeddings.
The main results are shown in Table 1, and the specific analysis is as follows: English Fine-tuning alone can obtain relatively good performance, and the performance has been further improved with the introduction of the three new modules. Here we conduct experiments to check their influence on our method by adding them one by one. With the addition of +Lemma Token, our system has a significant improvement, while the improvement brought by the other two modules is slightly lower, indicating that the presence or absence of lemma token has a greater impact.
Chinese Fine-tuning achieves the best performance in this task. After the introduction of +Lemma Token, the result shows that the effect is greatly reduced, which may be because the translated lemma token is not necessarily the appropriate translation result. Because fine-grained translations of individual tokens have no context, they often fail to translate properly, as mentioned in Section 3.2. However, after the introduction of +Segment E C , the effect has been slightly improved, which proves that our idea is effective. Because it is difficult to get the exact position of the lemma word after translation, there is no result of +Lemma Embeddings. Based on the above results, we use Fine-tuning as the final system for the Chinese task. In other words, our final model has no lemma token, no segment embedding E C and no lemma embeddings.

Error Analysis
Lemma tokens have different parts of speech, so we think about the relationship between the accuracy of system prediction and the parts of speech of lemma tokens. Based on this, we reported the accuracy of each part of speech in the test set, as shown in Figure 2.
There are similar findings of error analysis for English and Chinese tasks. The order of the data volume on the test set is NOUN, VERB, ADJ, and ADV. The distribution of these data types is consistent with the training set, and each part of speech in the training set is 4123, 2269, 1429, and 175, which is also in descending order and occupies roughly the same proportion of each part of speech. Therefore, the parts of speech with more data in the training set often get better performance on the test set, which indicates the importance of data size. More data can enable the model to learn more classification knowledge and behavior, which affects the prediction results.

Impact of Dropout
To analyze the importance of dropout, we conducted experiments by using different dropouts on both English and Chinese test sets, and the results are shown in Figure 3. As we can see, the performance of the two tasks increases with the increment of dropout rate and reach the best performance when dropout rate equals 0.2. As dropout rate continues to increase, the performance deteriorates, which indicates that too many lost parameters may make the model difficult to converge.
Besides, Dropout performs differently in terms of data quality. In general, real corpus (English) should be of better quality than pseudo corpus (Chinese). On this basis, the performance of different dropout models on high-quality real corpus is relatively stable, with a gap of less than 1%, while the performance of pseudo corpus fluctuates greatly, with a gap of 3%.

Conclusion
In this paper, we describe the GX system participating in the MCL-WiC task. In order to obtain the general basic knowledge, we use the pre-trained BERT model and then fine-tune it on the data released by the organizer. In order to further emphasize the relationship between sentence pairs and the importance of lemma, we introduce three new factors: lemma token, lemma segment embedding, and lemma embedding, and finally get better results. Our system reaches 82.7% in English multilingual setting and 76.7% in Chinese multilingual setting.