RG PA at SemEval-2021 Task 1: A Contextual Attention-based Model with RoBERTa for Lexical Complexity Prediction

In this paper we propose a contextual attention based model with two-stage fine-tune training using RoBERTa. First, we perform the first-stage fine-tune on corpus with RoBERTa, so that the model can learn some prior domain knowledge. Then we get the contextual embedding of context words based on the token-level embedding with the fine-tuned model. And we use Kfold cross-validation to get K models and ensemble them to get the final result. Finally, we attain the 2nd place in the final evaluation phase of sub-task 2 with pearson correlation of 0.8575.


Introduction
LCP is an augmented version of Complex Word Identification(CWI) (Shardlow et al., 2020), predict complexity score for each target word in a sentence. The dataset is a multi-domain English dataset annotated with a 5-point Likert scale (1-5). The annotation model in CompLex addresses complexity as a continuum instead of a binary feature. Previous studies of CWI treat the task as a binary classification, predict a complexity label (complex vs. non-complex) for a set of target words in a sentence.
In this paper, we exploratory data analysis(EDA) for the dataset, and found that the distribution of the single task and multi task dataset are very inconformity, so should bulid two models for every dataset great than one model for merge two task dataset.
Several key technologies as follows: • Train a RoBERTa based fine-tune corpus classifier. It use the data of all single and multi * • At each layer, calculate target vector and context tokens embedding attention, the layer context vector is average the context tokens embedding with soft alignment.
• Weighted the RoBERTa last 12 layers context vector and target vector. They use the same weights, and it's sum equals to 1.
• The degeneration of gradual unfreezing (Howard and Ruder, 2018). At first epoch freeze the pretrained model parameter only learning the head layers parameter, then unfreeze all model parameter.
• Multi-Sample Dropout at last layer (Inoue, 2019) 2 Background Previous approaches to CWI typically refer to binary identification of complex words, two shared tasks on CWI topic have been organized so far. SemEval-2016 Task 11 (Paetzold and Specia, 2016) and BEA workshop 2018(Yimam et al., 2018. The two tasks approache a number of different model to classification, ranging from traditional machine learning classifiers such as support vector machines (SVM), decision trees, random forest, and maximum entropy classifiers to deep learning classifiers, such as recurrent neural networks. A wide range of features were used such as word embeddings, word and character n-grams, word frequency, Zipfian frequency distribution, word length, morphological, syntactic, semantic, and psycholinguistic. BERT is a new language representation model, and stands for Bidirectional Encoder Representations from Transformers (Devlin et al., 2018). Since BERT appear, fine-tuned pre-trained model with just one additional output layer to create stateof-the-art models for a wide range of tasks. A number of Transformers series models are proposed, such as GPT-2, RoBERTa (Liu et al., 2019), XLM, DistilBert, XLNet. In this papaer we focus on use RoBERTa to slove the tasks.

System overview
We propose a RoBERTa with attention based model to solve the LCP task, Figure 1 outlines our proposed model framework. First, use Byte-Pair Encoding (BPE) to tokenize the input sentent, which is an effective subword technique to relieve the Out-of-Vocabulary (OOV) problem. For every RoBERTa hidden layer, we apply token pooling that is average the target words tokens embedding as the target vector. Then we calculate the attention between target vector and context tokens vector which are sentence tokens masked the target tokens, After that, context tokens embedding multiply attention weight as the context vector, then concatenate the context vector and target vector. Second, pooling the RoBERTa last 12 layers context vector and target vector, Finally, connect the MLP layer to predict the LCP Score.

Pooling
The input sentenct is tokenized to n tokens t i , i = 1, 2, ..., n, and the target tokens index are l t = {k, k + 1, ..., m}, the context tokens index are l c = {1, ..., k − 1, m + 1, ..., n}, which exclude the target tokens. E j i denotes the i th token embedding of hidden layer j, T j denotes the target vector of the hidden layer j.
The attention weight between context tokens embdding and target vector of layer j is compute by.
After that, we compute the weighted summation for c j Finally, calculate the pooling target vetor t, and context vector c, they are weighted last N layers target vector and context vector of RoBERTa, the weight w 1 , w 2 , ..., w N is the model parameters and equals to 1. We use 5-fold cross validation, first generate a new feature score bin which is binning the LCP score by quantile. Because the dev dataset commonly used to search the optimal hyper parameters, in this experiment we only use dev dataset to found the best epoch, in order to prevent overfitting by early stop, but we found that pretrained model only train 5-6 epochs could be convergence on the task dataset, so not need deliberately generate the dev dataset, only let dev dataset as same as test dataset. Train and test dataset are splited use stratified KFold by the features domain corpus and score bin.

Corpus information
The dataset give the sentences domain corpus, but how to use this information? At first, we build the multi-task learning. The auxiliary task is the corpus classification which use the last 12 layers average CLS token embedding. But the auxiliary task not improve the LCP task, and the accuracy of corpus classifier is quite low. It's not conform to the actual, because of the sentences corpus come from Bible Europarl and Biomedical, and they are very easy to distinguish.
Since that, we build a corpus classification model separately which is a RoBERTa fine-tune model (Sun et al., 2019). Benefit from the dataset are easy to classify, the model only need to train 1 epoch, and could get 0.99 accuracy. We merge the single and multi trial, train and test dataset as new train dataset, this can let the model see all data include test dataset. After train, the RoBERTa learning the domain knowledge, and in advance learning part of the test dataset.
Then, export the RoBERTa model as the pretrained model of LCP task.

Single LCP Task
First merge the single train and trial dataset, then process stratified 5-fold, compare the origin pretrained model(RoBERTa-large) and fine-tune by the corpus classification(pre-RoBERTa-large).
For train, we use the Mean squared error(MSE) loss function and adam optimizer (Kingma and Ba, 2014). At first epoch we freeze the RoBERTa parameters, only traininge the head layers. Apply learning rate linear schedule with warmup, lr = 2e − 5, warmup steps = 200, and use early stop. Table 2 shows the single task result, the metric is Pearson correlation (R). The fold-x column is the metric of CV model evaluate on the fold-x dataset. The mean column is the average of the fold-* column. Pre-trained corpus classification with finetune RoBERTa-large is a little outperformance than origin RoBERTa-large. The single model result is the average of all 5-fold models's predict result for single task test data, and model result column is the metric of the the model result. The task final result is the average of all models result, and final result column is the metric of the the final result. The two model can achieve 0.7586 and 0.7618, but use simple average ensemble could get 0.7629. It's quite effective.

Multi LCP Task
The merged dataset of multi train and trial only have 1616 examples, In single task, the pre-RoBERTa-large is outperformance than origin RoBERTa-large. In order to augment the multi task examples, Fisrt use the data which merge all sigle and multi train trial dataset, use 5-fold cross validation, splited data use stratified KFold as same as single task. Then use pre-RoBERTa-large train the LCP task. After that, inference the vector h = [c, t] for all merge data, the final h is average of all 5-fold models. Finally, use the vector to calculate cosine similarity of the multi dataset with single dataset, then recall single examples add to the multi train example with threshold. Here we use sim threshold = 0.75, and recall 2707 single examples.
Then split dataset and train strategy are as same as singe task. The results are in Table 3. gen-RoBERTa-large is the origin RoBERTa model with Data Augmentation, pre-gen-RoBERTa-large is the RoBERTa model fine-tune by the corpus with Data   Augmentation. Results shows model fine-tune by the corpus classification are outperformance than origin model, The final result 0.8575 is fusioned by average of the four models cv results, and rank the 2nd in test phrase.

Conclusion
This paper presents a method to predicting lexical complexity, which apply RoBERTa-large as the backbone language model. First fine-tune backbone model for corpus classification. Then bulid model with attention based context representation. make vector recall for multi task data augmentation. Finally, we carry out a multi-model average ensemble strategy to enhance the model performance. In the future, we will exploit better model for text representation, and utilizing data augmentation for all task.