DeepBlueAI at SemEval-2021 Task 1: Lexical Complexity Prediction with A Deep Ensemble Approach

Lexical complexity plays an important role in reading comprehension. lexical complexity prediction (LCP) can not only be used as a part of Lexical Simplification systems, but also as a stand-alone application to help people better reading. This paper presents the winning system we submitted to the LCP Shared Task of SemEval 2021 that capable of dealing with both two subtasks. We first perform fine-tuning on numbers of pre-trained language models (PLMs) with various hyperparameters and different training strategies such as pseudo-labelling and data augmentation. Then an effective stacking mechanism is applied on top of the fine-tuned PLMs to obtain the final prediction. Experimental results on the Complex dataset show the validity of our method and we rank first and second for subtask 2 and 1.


Introduction
Lexical complexity is one of the main reasons leading to overall text complexity and thus result in poor reading comprehension for readers (DuBay, 2004). Different from the Complex Word Identification (CWI) (Shardlow, 2014) task, which aims to predict whether a given word is complex or not, the goal of lexical complexity prediction (LCP) is to predict the complexity value of the given parts from contexts as shown in Figure 1. The underlined parts of the sentence are the words that need to be predicted and the same words in different contexts may have different complexity scores. LCP plays an important role in the usual Lexical Simplification (LS) (Bott et al., 2012) pipeline since it can help simplifiers find the challenging words and replace them with appropriate alternatives that easy to understand. Either LCP or CWI can not only be used as a component of LS systems but also as a stand-alone application within intelligent

Multi-words
Context1: SEM confirmed many of the observations made by confocal microscopy. Complexity score: 0.64473

Context2:
SJ and SVJ carried out confocal microscopy on whole-mounts of stria vascularis. Complexity score: 0.7750

Single word
Context1: They shall be to you for a refuge from the avenger of blood. Complexity score: 0.3475 Context2: There will be a pavilion for a shade in the daytime from the heat, and for a refuge and for a shelter from storm and from rain. Complexity score: 0.075 Figure 1: Examples of LCP including single words and multi-words. The complexity score is the score for the underlined words.
tutoring systems for second language learners or in reading devices for people with low literacy skills (Gooding and Kochmar, 2018).
In this paper, we introduce our system for the lexical complexity prediction task of the SemEval-2021 (Matthew et al., 2021). We fulfill this task by leveraging multiple pre-trained language models (PLM) with different training strategies. There are two main steps for our system: (i) fine-tuning numbers of heterogeneous PLMs, including BERT (Devlin et al., 2019), ALBERT (Lan et al., 2019), RoBERTa  and ERNIE (Zhang et al., 2019), with various hyperparameters and training strategies, obtaining diverse models; (ii) applying an effective stacking mechanism on top of these PLMs to predict the final complexity scores.
Our experiments, merging PLMs in total, indicate that our method successfully utilizes weaker PLMs as well as high-performing PLMs. As a result, our system ranks second and first for Subtask 1 and 2 of LCP 2021, SemEval-2021.

Lexical Complexity Prediction
There has been some work for the creation and evaluation of automatically graded vocabulary lists

Complex Word Identification
A related area of LCP is CWI. Early studies on CWI either attempt to simplify all words (Thomas and Anderson, 2012) or set a frequency-based threshold (Biran et al., 2011). Shardlow (2013) indicates that a classification-based method to CWI is the most promising one. Most of the teams participating in two CWI shared tasks also use classification approaches with extensive feature engineering.
In CWI 2016 (Paetzold and Specia, 2016a), complexity was defined as whether or not a word is difficult to understand for non-native English speakers and the words in the dataset are tagged as complex or non-complex by 400 non-native English speakers. The results highlight the effectiveness of Decision Trees (Quijada and Medero, 2016;Mukherjee et al., 2016) and Ensemble methods (Paetzold and Specia, 2016b;Malmasi et al., 2016) for the task.
In CWI 2018 (Yimam et al., 2018), a multilingual dataset was provided containing English, German, Spanish and French and there were two subtasks: binary classification and probabilistic classification. The submitted systems mainly use traditional machine learning classifiers(e.g. SVM, Random Forests) with features (Butnaru and Ionescu, 2018;Kajiwara and Komachi, 2018), deep learning methods (Hartmann and Dos Santos, 2018;De Hertog and Tack, 2018) and ensemble methods (Gooding and Kochmar, 2018;Aroyehun et al., 2018). More recently, (Gooding and Kochmar, 2019) propose a new perspective by treating CWI as a sequence labeling task that can detect both complex words and phrases. All these methods are different from ours which utilizes heterogeneous PLMs with various training strategies.

Background
Task Definition There are two subtasks in the LCP task. For subtask 1, the goal is to predict the complexity score for a single word from the given context. As an example shown in Figure 1, the 'refuge' is the word that needs to be predicted and since the meaning of it is harder to get in the first context, its complexity score in the first context is much higher. For subtask 2, the goal is to predict the complexity score for a multi-word expression from the given context. An example is also shown in the right part of Figure 1.  a 5-point Likert scale: one for very easy, two for easy, three for neutral, four for difficult, and five for very difficult. The numerical labels were transformed to a 0-1 range as shown in Figure 1. To add further variation to the data, three corpora were selected including Bible, Europarl (Koehn, 2005) and Biomedical (Bada et al., 2012). Each corpus has its own unique language features and styles. In addition to single words, multi-word expressions were also selected for annotating. In the end, there were 9476 annotated contexts with 5166 unique words.

PLMs-based Method
PLMs such as BERT (Bidirectional Encoder Representations from Transformers) use the encoder structure of the Transformer (Vaswani et al., 2017) for deep self-supervised learning, which requires task-specific fine-tuning. In this paper, the downstream task is to predict the complexity scores, a real-value in the range of [0,1], of given words. Our method is capable of dealing with both subtask 1 and 2. Figure 2 shows the main architecture of our BERT-based model for predicting complexity scores.
Since PLMs can process multiple input sentences, we add a query sentence before the context to emphasize the words (e.g. river) that need to be predicted and the corpus (e.g. Bible) they come from. We add special tokens [CLS] and [SEP] to separate the query and the context as shown in Figure 2. BERT first tokenizes the input contents and then generates contextualized vector representations for each token in multiple hidden layers. We focus on the output of only the first position that we passed the special [CLS] token to. The last k hidden layers are selected to get the final representation of token [CLS] through a weighted calculation function as below, where W i is the learning weight for each hidden layer. The calculated representation is then fed into a dense layer, and the technique of multi-sample dropout (Inoue, 2019) is utilized to accelerate training and finally obtain the predicted complexity scores. The loss function can be chosen among several options including Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE).

Training strategies
In order to further improve the diversity of trained models, we incorporate two training strategies as depicted below.
Pseudo-Labelling Pseudo-labelling is the process of using a labeled data model to predict labels for unlabeled data. We predict the unlabeled test dataset and mix these pseudo labels with the training set together to train the new model.
Data augmentation Data augmentation is the technique used to increase the amount of data by adding slightly modified copies of already existing data or newly created synthetic data from existing data. It acts as a regularizer and helps reduce overfitting when training a machine learning model. In this paper, data augmentation consists of two parts. We first add the dataset released by CWI 2018 into the training set. Besides, for subtask 2, since its training dataset is small which only contains one thousand samples, we add the dataset of subtask 1 to train the model for subtask 2. Then, for a given sentence in the training set, we perform the operations containing synonym replacement, random insertion, random swap, and random deletion introduced by Wei and Zou (2019).

Stacking Trained Models
Model stacking is an efficient ensemble method to improve model accuracy. The main procedure of stacking trained models in our method including five steps. First, we use heterogeneous PLMs including BERT, RoBERTa, ALBERT, and ERNIE as base models. Second, we generate multiple   hyperparameter sets by setting different values of dropout, selecting different numbers of last hidden layers, and using different loss functions. Since our purpose here is not only to find the best hyperparameter sets but also to collect diverse sets with reasonable performances, we keep all the training results from different sets. Third, we perform 7fold cross-validation during the whole training process to avoid overfitting or selection bias. Fourth, we adopt several training strategies including using pseudo-labelling (Iscen et al., 2019) and data augmentation to further improve the diversity of trained models. Ultimately, we train a simple linear regression model as the final estimator. Suppose that the complexity score predicted by a based model with one hyperparameter set isŷ j , then the final complexity scores will be calculated as below, where N is the total number of various fine-tuned PLMs with different hyperparameters sets and W j is the weight for each predicted score from different PLMs learned by a linear regression model.

Evaluation Metrics
As mentioned in the official evaluation procedure of LCP 2021, several evaluation metrics are chosen including Pearson correlation (R), Spearman correlation (Rho), Mean absolute error (MAE), Mean squared error (MSE), and R-squared (R2). The final results are ranked using Pearson correlation.

Parameter settings
All models are implemented based on the opensource transformers library of hugging face (Wolf et al., 2020), which provides thousands of pretrained models that can be quickly downloaded and fine-tuned on specific tasks. Table 1 shows the four employed PLMs and different parameters we set for each PLM including different numbers of hidden layers, different dropout pairs, and different loss functions.

Ablation Study
PLMs with Training Strategies For subtask 1, we use different PLMs including ERNIELARGE, AL-BERTXXLARGE, BERTLARGE, RoBERTaLARGE as shown in Table 2. The results are the average scores of 7fold cross-validation on the training dataset. Since RoBERTaLARGE performs best on this task, we further incorporate the training strategies including pseudo-labelling (PL) and data augmentation (DA) with it. However, for the training dataset, we find that by adding the training strategies, the results decrease a little bit.
For subtask2, we use two types of PLMs which are RoBERTaLARGE and ALBERTXXLARGE. The results shown in Table 3 are also obtained by averaging the scores of 7-fold cross-validation on the training dataset. Since we have added the dataset of subtask 1 into subtask 2, we also show the results    Table 3 and we can find that it is very effective by increasing 0.02 from base models.
Stacking trained models We use a linear regression (LR) model to stack different pre-trained models. We train the weights of each model in LR on the training set and then use the learning weights to predict the final scores of the test set. Figure 3 shows the comparison of Pearson Correlation values for stacking different models of subtask 1. The columns in blue are the values computed by averaging predicted scores of different models while the columns in orange are the values through the LR function. We can clearly observe that the LR-based ensemble method outperforms those with the mean-based method, which verifies the validity of using the LR mechanism. Besides, although we find that adding training strategies to the base models would decrease performance according to Table 2, the performance will be improved when stacking them all. This indicates the positive effect of increasing model diversity.

Official Ranking
For both subtask 1 and subtask 2, among all the pre-submission experiments, we find that the scores obtained from stacking all the models performed best. The official ranking is presented in Table 4 and it demonstrates that our system is ranked first in subtask 2 and ranked second in subtask 1.

Conclusion
In this paper, we propose a top-performing model for the task of lexical complexity prediction. We fine-tune several pre-trained language models including BERT, ALBERT, RoBERTa, and ERNIE with different training strategies such as pseudolabelling and data augmentation and stack them with a simple linear regression model. Experimental results show the effectiveness of this ensemble method and we win first place and second place for subtask 2 and 1.