JUST-BLUE at SemEval-2021 Task 1: Predicting Lexical Complexity using BERT and RoBERTa Pre-trained Language Models

Predicting the complexity level of a word or a phrase is considered a challenging task. It is even recognized as a crucial step in numerous NLP applications, such as text rearrangements and text simplification. Early research treated the task as a binary classification task, where the systems anticipated the existence of a word’s complexity (complex versus uncomplicated). Other studies had been designed to assess the level of word complexity using regression models or multi-labeling classification models. Deep learning models show a significant improvement over machine learning models with the rise of transfer learning and pre-trained language models. This paper presents our approach that won the first rank in the SemEval-task1 (sub stask1). We have calculated the degree of word complexity from 0-1 within a text. We have been ranked first place in the competition using the pre-trained language models Bert and RoBERTa, with a Pearson correlation score of 0.788.


Introduction
Lexical complexity plays a significant role in the readability level and comprehension. The precise anticipation of lexical complexity can help systems direct the user to an acceptable simple text accurately or modify the text to be more fluid (Brothers and Traxler, 2016). Predicting the complexity of words is a subjective and challenging problem, while it is conjectural, too. Yet, mapping words into their complexity is an essential task to understand natural language. Numerous components can influence the prediction of lexical complexity. Several approaches were proposed to solve or mitigate this type of study using Machine and Deep learn-ing methods (Sengupta et al., 2020;Gooding and Kochmar, 2019;Bahja, 2020).
This paper describes the JUST-BLUE team's model that participated in the SemEval 2021-task1, Lexical Complexity Prediction (LCP) (Shardlow et al., 2021). The task provides participants with an augmented version of CompLex, a multi-domain English dataset with sentences annotated using a 5-point Likert scale (1-5) (from very easy to very difficult) (Shardlow et al., 2020). The task is to predict the complexity value of words in context. It is worth mentioning that our model, JUST-BLUE, has been ranked first in this task. We have used the pre-trained language models, BERT and RoBERTa Which have proven their effectiveness in this area (Liu et al., 2019), along with the ensembling method (weighted averaging) to achieve the highest Pearson correlation score of 0.788.
The rest of this paper is organized as follows: Section 2 sheds light on related work. Section 3 describes the methodology proposed in this research. Section 4 discusses the experimentation setup and evaluation results. Whereas Section 5 concludes this research.

Related work
One of the most prominent challenges in the current era is the prediction of lexical complexity. Prediction of the word complexity in machine learning can be binary; the word is complex or not complex. It also can be a non-binary prediction, as a probabilistic prediction with the measurement of complexity within a particular scale (0.6 the probability that the word is complex). SemEval 2016 introduced the first shared task of predicting word complexity with a mission limited to the word orders being complex or non-complex (binary prediction) (Paetzold and Specia, 2016). Decision Tree classifiers achieved the best results (Zampieri et al., 2017). It has been noted that word length is a good indication of word complexity (De Hertog and Tack, 2018).
The authors in (Shardlow, 2013) discussed the importance of frequency and length of words. They used the Keras deep learning library to predict whether an English or Spanish word is complex or not. They used character embedding, word length, frequency count, word embedding, and psychological measures as features to predict complex words and achieved 0.872 as F1-score. The authors in (Yimam et al., 2018) worked on various languages, such as English, Spanish, French and German. They worked on two different methods for predicting complex words. The first method is to find if the word orders are either complex or simple. The second is to find the probability that the word is complex. The complex levels depended on the average of the annotators' answers. For example, if the number of annotators who expected the word to be complex is 6 out of 10, then the probability is 0.6. A claim stated that this annotating method is considered impractical since the probability of 0.5 cannot be considered complex or not complex. So the authors in (Shardlow et al., 2020) suggested a Likert scale with 5-point. The authors asserted that this method is more accurate scale instead of calling the word complex and noncomplex. We can divide the word into being very easy, easy, neutral, difficult, and very difficult. This scale is beneficial to our work.
The deep learning pre-trained language models, BERT and RoBERTa, are considered state-of-theart for NLP. Teams in the previous shared tasks of SemEval 2020 had used these models to obtain the best results for different NLP tasks (Al-Khdour et al., 2020;Shatnawi et al., 2020;Jurkiewicz et al., 2020). Our approach experimented with these models using different hyperparameters and weighted averaging methods that lead to the best result in the competition for predicting lexical complexity.

Methodology
This section describes our approach methodology and goes as follows: First, we describe the task's dataset. Then, the preprocessing step. Finally, we describe the JUST-BLUE approach to predict the word's complexity.

Data
The SemEval-task 1 competition has provided the contestants with three files (trial, train, and test data). The files contain several columns as follows: • id: the identification number for each entry.
• corpus: the sources from which the words were being collected. It was extracted from three sources: the bible, biomedical, and The European Parliament.
• sentence: the set of words for which complexity needed to be measured.
• token: the single word in which complexity needed to be measured.
• complexity: the degree of complexity of the word, ranging from 0 to 1.

Pre-Processing Step
First, we cleaned the data and removed all single and double quotations manually. This step helped to separate some of the merged rows. Next, we deleted any row where columns contain the NaN value because it will not be effective in the training process.

JUST-BLUE Architecture
We have used the pre-trained language models, BERT and RoBERTa models. We have imported the BERT model using BERT-sklearn library as it includes SciBERT and BioBERT models for the scientific and biomedical fields. We also have used simple transformers; classification libraries to import the RoBERTa model. As we mentioned earlier, the goal of the task is to determine the complexity of the word. Knowing that the word's complexity changes slightly based on the complexity of the sentence, we have used both the token (word) and the sentence to predict the word's complexity. We have fed BERT and RoBERTa models with the 'token,' and the 'complexity' label to be trained. We have also inserted 'sentence' and 'complexity' columns to both models for training as a second strategy. The results have been combined using an ensembling voting method, Weighted Averaging. Our experiments show that the 80:20 ratio for weights can achieve the best results. The highest voting rate is for the "token" model (model 1) since we need to calculate the degree of complexity for a single word. On the other hand, the complexity of a The Simple Averaging method has been used as the ensembling technique to merge BERT and RoBERTa's models' results. Figure 1 illustrates the methodology used.
For more clarification, suppose we have the word 'sea' for which we want to calculate the complexity. The 'sea' word exists in this sentence "and they entered into the boat, and were going over the sea to Capernaum." First, we feed the word sea to model1 using RoBERTa. We also feed the sentence that contains the word sea to RoBERTa model2. Then, we combine the two results obtained using Weighted Averaging. Suppose that the RoBERTa model1 result is 0.01 (the word sea has a 0.01 complexity degree) and RoBERTa model2 is 0.13 ( the sentence has a 0.13 complexity degree). The resulted RoBERTa models is 0.01x80% + 0.13x20%, which is equal to 0.034. We repeat these steps for BERT's models. If the BERT model has a result of 0.052, then the final step is to calculate the average of the RoBERTa and BERT model. The complexity is (0.034 + 0.052)/2, equal to 0.043, as shown in Figure 2.

Results and Discussion
We used Python version 3.6 on the Colab environment to execute our codes. We have experimented with several models to determine which models are suitable for this task. We have experimented with BERT and RoBERTa pre-trained models. We also examined SVM and Random Forest machine learning models. Table 1 shows the results we have obtained throughout our experiments.
The challenging step was to find the best weights for the models that used tokens (single words) and sentences to get the best result (Table2). As we mentioned earlier, some words have a different complexity degree, depending on their location in the sentences. Therefore, it was necessary to insert   Table 2 shows the best weight, which is 80% for words and 20% for sentences. The next step was to explore BERT and RoBERTa's best hyperparameters, such as learning rate, batch size, epochs, and max sequence length. Table 3 shows the description of these hyperparameters, and Table 4 shows example results of fine-tuning JUST-BLUE hyperparameters.
Finally, we thought of determining the effects of the base size and large size models of BERT and RoBERTa on the accuracy. It is shown by our experiments that the large sizes decreased the accuracy.
In the testing phase, we noticed that the words (tokens) in the file were new. Therefore, we decided to limit the number of arguments to avoid overfitting. We just changed "num-train-epochs "=3 in BERT and RoBERTa's model, but the other arguments had the default values. We have used three different models. The first was the BERT model, the second was the RoBERTa model, and the third was BERT and RoBERTa together as described in the Methodology Section. Table 5 shows the results we received from the different models we used.
JUST-BLUE approach achieved the best result using RoBERTa and BERT's models with a Pearson correlation of 0.788 scores. We have also achieved the least Mean Absolute Error(MAE) with 0.0609. Our model is ranked first the LCP-sub task1 of a single word. The Spearman's Rho (Rho) and R-squared (R2) scores are 0.7369 and 0.6172, respectively. The number of teams in the shared task Lexical Complexity Prediction (LCP) was 54 teams. This shared task is considered a high level of CWI 2016 and CWI 2018 with a larger number of words from various sources.

Conclusion
Predicting the complexity of words is one of the most prominent tasks that the NLP research community strives to solve. It is worth noting that in 2016 and 2018, two tasks were issued to determine whether the word was complex or not. Se-mEval 2021 introduced task 1, Lexical Complexity Prediction (LCP) that aims to predict the word's complexity from 0 to 1. This paper described the top-ranked team's model, JUST-BLUE. The JUST-BLUE model obtained the highest Pearson Correlation score of 0.788 using the pre-trained language models BERT and RoBERTa. Our strategy depends on the ensembling methods, Simple and Weighted Averaging.