hub at SemEval-2021 Task 2: Word Meaning Similarity Prediction Model Based on RoBERTa and Word Frequency

This paper introduces the system description of the hub team, which explains the related work and experimental results of our team’s participation in SemEval 2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC). The data of this shared task is mainly some cross-language or multi-language sentence pair corpus. The languages covered in the corpus include English, Chinese, French, Russian, and Arabic. The task goal is to judge whether the same words in these sentence pairs have the same meaning in the sentence. This can be seen as a task of binary classification of sentence pairs. What we need to do is to use our method to determine as accurately as possible the meaning of the words in a sentence pair are the same or different. The model used by our team is mainly composed of RoBERTa and Tf-Idf algorithms. The result evaluation index of task submission is the F1 score. We only participated in the English language task. The final score of the test set prediction results submitted by our team was 84.60.


Introduction and Background
With the continuous development of science and technology, we are now in an era of massive data. We cannot use manual methods in the processing and retrieval of text data. Especially in the work of comparing and calculating the semantic difference at the word level in the text. In this type of work, automatic processing of text data with machines has become a new choice. The research on the detection method (Resnik, 1995;Miller and Charles, 1991) and evaluation method (Sánchez et al., 2012) of semantic similarity has become a subject of wide discussion. Specific application scenarios have been produced in some fields of natural language processing and information retrieval. Such as sentiment analysis (Araque et al., 2019), medical disease similarity query (Mathur and Dinakarpandian, 2012), text question and answer (Mohler and Mihalcea, 2009) etc.
Similar to humans' strategies for detecting the meaning of words in different sentences, machines and algorithms also need to predict the results based on the context. Therefore, the method of generating vectors based on each word is not suitable for such tasks. For example Word2Vec (Mikolov et al., 2013). Based on the characteristics of text serialization, extracting contextual information in the text as the input of the model will provide the model with richer and more accurate information. For example, in dealing with the problem of polysemous and synonymous words. The ELMo (Peters et al., 2018) method based on LSTM (Shi et al., 2015) overcomes the difficulty that the model cannot learn the context. ELMO can dynamically adjust word embedding according to the context, so it can solve the problem of ambiguity. However, the use of a bidirectional LSTM as a feature extractor makes its training time and feature extraction effect unsatisfactory. In the follow-up work, the appearance of Transformer (Vaswani et al., 2017) introduces new and better feature extractors for the model. The BERT (Devlin et al., 2019)  former Encoder (Vaswani et al., 2017) achieved the best results in many NLP tasks. We participated in SemEval-2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC) English task. This task is to predict whether a word with the same part of speech has the same meaning in a sentence pair (Martelli et al., 2021). We are inspired by the work of Chen, Weilong and others on the task of predicting the influence of context on word similarity , and use methods based on RoBERTa (Liu et al., 2019) and Tf-Idf (Ramos et al., 2003) to complete the task. At the same time, we also tried to combine ALBERT (Lan et al., 2020) with BERT (Devlin et al., 2019) and Tf-Idf to observe their performance on the English data set. We introduce our methods and experiments in detail in Sections 2 and 3. Our model code can provide reference 1 .

Data and Methods
In this section, we will introduce the data we use in the task and the models and methods we use.

Data Description
The task organizer team provides each team with training data sets, validation data sets, and test data sets related to the "Multilingual and Cross-lingual Word-in-Context Disambiguation" task. Because we only successfully submitted the test set prediction results of the English task, we only discuss the English data set here. The training data set and the validation data set are composed of two parts. The first part contains the ID, the lemma of the target word, the part of speech of the target word, the sentence pair data, and the position index of the target word in the sentence pair. The target word is usually only one word, and they have the same part 1 https://github.com/Hub-Lucas/hub-at-task2 of speech in the sentence pair. The second part is whether the target words appearing in the sentence pair are tags with the same meaning.
If two words have the same meaning, it is "True", otherwise it is "False". The sentence lengths in the sentence pairs are not the same. Compared with the training data set and the validation data set, the test set only contains the first part mentioned above. We need to use our method to predict whether the same words appearing in sentence pairs in the test set have the same meaning. Table 1 shows a sample of sentence pair data we used in the task.
There are 8000 and 1000 data in the training set and validation set respectively. The proportions of the "True" label and the "False" label in the training set and the validation set are the same, both are 50% and 50%. There are 1000 pieces of data in the test set. Information about word frequency will be involved in our method. We use word cloud graphs to visualize the text data in the training set and the text data in the test set. The word cloud image clearly shows us the characteristics of word frequency distribution in the text data set. Figure 1 and Figure 2 show the word frequency information in the training set, validation set, and test set.

Methods
Combined with the analysis and understanding of task description and task data set, we chose to develop a system based on RoBERTa and Tf-Idf. Besides, we also tried to use the combination of AL-BERT (Lan et al., 2020), BERT (Devlin et al., 2019) and Tf-Idf to verify their effect on the verification set. Due to the addition of the attention mechanism, Transformer has achieved good results in multitasking in the field of natural language. The three models of BERT, ALBERT, and RoBERTa are all based on the improvement of the transformer architecture. Compared with BERT, ALBERT not only has fewer parameters, but also has the characteristics of parameter sharing between different layers (Lan et al., 2020;Devlin et al., 2019). Therefore, ALBERT is better than BERT in terms of memory space and training time. Compared with ALBERT (Lan et al., 2020), RoBERTa (Liu et al., 2019) does not perform the task of predicting the next sentence during the pre-training process, and also uses a new dynamic masking mechanism. At the same time, the pre-training time of the RoBERTa model is longer, using a larger batch size, and the corpus data used for pre-training is also larger (Liu et al., 2019). In our system, the first step is to use the pre-processed data as the input data of RoBERTa and Tf-Idf. In the second step, we get the output result of the last layer of RoBERTa (RoBERTa Output) and the output result of Tf-Idf (Tf-Idf Output). In the third step, we use the output result of Tf-Idf to weight the output result of RoBERTa. We can get a weighted result, we call it RoBERTa weighted output. In the fourth step, we connect the RoBERTa output result and the RoBERTa weighted output result together. In the fifth step, we use the result of the previous step as the input of the classifier. Use the classifier to output the prediction results of the model. In the final step, the results of the model prediction are processed into the format required by the task organizer team. Among them, the shape of RoBERTa output [batch size, max sequence length, hidden size]. The shape of Tf-Idf output is [batch size, max sequence length]. Equation 1-3 is the process of weighting operation. In [RoBERT a Output] i is the result of the i − th batch of RoBERTa output. The result of multiplying these two matrices is In equation 3, The value range of i is an integer between 0 and batch size. Calculate the value of each [RoBERT a W eighted Output] i to get [RoBERT a W eighted Output]. Its shape is the same as RoBERTa output. Figure 3 shows the model structure and data flow  Table 2: F1 result scores obtained on the validation set using different models. The validation set is provided by the task organizer team.
of RoBERTa combined with Tf-Idf.

Experiment and Results
In this section, we will introduce the data preprocessing methods and experimental settings we used in the task and the final results.

Data Preprocessing
Combined with our analysis in the data description section, we remove the stop words of sentence pairs in the data. For the stop word list, we use the stopwords package provided by NLTK. To use the Tf-Idf algorithm to obtain the weighted output, and to ensure that the shape of the text encoding processed by the Tf-Idf algorithm is consistent with the output shape of RoBERTa, we have deleted the part of the text encoding that exceeds the maximum sentence length. For those less than the maximum sentence length for text encoding, we perform zeropadding operations. The encoding of Tf-Idf is obtained using the toolkit provided by gsim (Řehůřek and Sojka, 2010) 2 .
In the data input, we use the [SEP] symbol to separate the sentence pairs together. Then use the [SEP] symbol to concatenate Lemma that appears in each sentence in the sentence pair. It should be noted that the three models we used in the experiment, BERT, ALBERT, and RoBERTa, are different in the division of symbols. Here, we use [CLS] and [SEP] uniformly for the convenience of description.

Experiment setting
As we introduced in the previous section, on the data set for this task, we use 4 different models to experiment with the result scores on the validation set. We adjust the parameters as much as possible to achieve the optimal results of each different model, so different models use different parameter combination settings.  • ALBERT+Tf-Idf: The epoch, batch size, maximum sequence length, and learning rate for the model are 6, 32, 150, and 3e-5, respectively.

Results
The final result score evaluation index uses the F1 score. Therefore, the effects of the different models we used in the experimental phase are all using F1 scores to determine which model is better. We use the same validation set data to evaluate the performance of different models. Comparing the result score obtained by the combination of ALBERT, BERT and Tf-Idf with the score obtained by the combination of RoBERTa and Tf-Idf, it can be seen that the combination strategy of RoBERTa can get a better F1 score. Compared with the F1 score obtained by using RoBERTa alone, the F1 score obtained by RoBERTa+Tf-Idf is better. This also verifies the feasibility and effectiveness of our method. We sort the results according to Table 2.
The prediction result of the English test set we finally submitted is predicted by RoBERTa+Tf-Idf.
Compared with the F1 scores obtained by the top three teams in the English data, there is still a certain gap. Our F1 score ranks middle among all result scores. Our final ranking is 49th. We sort the results according to Table 3.

Conclusion
This paper proposes a model that combines RoBERTa and Tf-Idf to calculate whether the target words in English sentence pairs are similar. We introduced our analysis of the data, the methods used in the experiment, and the results of the experiment in Sections 3 and 4. We compared the effects of different models of ALBERT, BERT, RoBERTa and the combination of Tf-Idf. The experimental results also prove that RoBERTa+Tf-Idf can get better results in our method. In future work, we will improve our methods to get better results. For example, other types of word embedding vectors can be introduced into our model, and the method of weighting and vector fusion can also be improved.