CS-UM6P at SemEval-2021 Task 1: A Deep Learning Model-based Pre-trained Transformer Encoder for Lexical Complexity

Lexical Complexity Prediction (LCP) involves assigning a difficulty score to a particular word or expression, in a text intended for a target audience. In this paper, we introduce a new deep learning-based system for this challenging task. The proposed system consists of a deep learning model, based on pre-trained transformer encoder, for word and Multi-Word Expression (MWE) complexity prediction. First, on top of the encoder’s contextualized word embedding, our model employs an attention layer on the input context and the complex word or MWE. Then, the attention output is concatenated with the pooled output of the encoder and passed to a regression module. We investigate both single-task and joint training on both Sub-Tasks data using multiple pre-trained transformer-based encoders. The obtained results are very promising and show the effectiveness of fine-tuning pre-trained transformers for LCP task.


Introduction
Text Simplification (TS) is a fundamental task for improving text readability, and presents a wide variety of use cases, including assisting children with reading difficulties and native speakers with low literacy levels (De Belder and Moens, 2010;Aluísio and Gasperin, 2010), increasing accessibility for people with intellectual disabilities (Saggion, 2017), and facilitating certain aspects of language for language learners (Paetzold and Specia, 2016). TS may involve modifications to the syntactic structure of a sentence, its lexical units or both (Shardlow, 2014).
Lexical Simplification (LS), as a sub-task of TS, focuses on simplifying complex words of an input sentence. It first identifies complex words in a sentence, known as Complex Words Identification (CWI) or Lexical Complexity Prediction (LCP) task. Then, it replaces them with other alternatives of equivalent meaning. Those substitutions should be more simplistic while preserving the semantic and the grammatical structure of the input sentence (Paetzold and Specia, 2017;Qiang et al., 2020).
Most of the previous research has modeled LCP as a binary classification task (Paetzold and Specia, 2017;Zampieri et al., 2016;Ronzano et al., 2016). A recent research study has introduced a multidomain dataset, annotated using a 5-point Likert scale scheme (Shardlow et al., 2020). The aim is to label the complexity of a word or a Multi-Word Expression (MWE), in a more fine-grained manner, from very easy to very difficult. Hence, the lexical complexity of words is expressed on a continuous scale.
In this paper, we introduce our submitted system to the SemEval-2021 LCP 1 and 2 Sub-Tasks (Shardlow et al., 2021). The proposed system consists of a deep learning model for word and MWE complexity prediction. Our model employs a residual attention block and a regression module on top of a pre-trained transformer encoder, as follows: • The encoder is fed with the concatenation of the context and the complex word or MWE, using the SEP token of the encoder's tokenizer.
• The residual attention block is a layer on top of the encoder's Contextualized Word Embedding (CWE) of the input context (sentence) and the complex word or MWE. The aim is to leverage the encoder's CWE to extract the relevant features of the inputs.
• The attention layer output is concatenated with the pooled output of the encoder and passed to the regression module for complexity prediction.  The proposed model is trained to minimize both the Root Mean Square Error (RMSE) and the auxiliary loss associated to the negative Pearson Correlation. The two losses are combined using the uncertainty loss weighting (Kendall et al., 2017). We investigate two pre-trained transformer networks, namely BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Moreover, we evaluate both single-task and joint training of word and MWE complexity prediction sub-tasks. The best performances are achieved using RoBERTa-large encoder while performing joint training on both Sub-Tasks data. The obtained results are very promising and show the effectiveness of our system, which was ranked among the top 10 submitted systems to both LCP 1 and 2 Sub-Tasks.
The rest of this paper is organized as follows. Section 2 describes the dataset and the sub-tasks of SemEval-2021 Task 1. In Section 3, we present our system overview. Section 4 summarizes and discusses the obtained results for both Sub-Task 1 and Sub-Task 2. Finally, Section 5 concludes the paper.

Dataset Descripion
The dataset of the Lexical Complexity Prediction shared task (Shardlow et al., 2021) is an augmented version of the Complex dataset (Shardlow et al., 2020). In addition to complex word annotation, the data also include MWEs along with their context sentences and complexity scores. The dataset is annotated using a 5-point (1-5) Likert scale scheme and covers sentences from three domains: Bible, EuroParl, and Biomedical texts. The dataset is labeled by a group of annotators from Englishspeaking countries. It is compiled from sentences with at least four valid annotations. The aggregation of annotations is performed ensuring that the normalized complexity is in the interval [0, 1]. The complexity scores are on a 5-point Likert scale and correspond to five levels of complexity ranging from "Very Easy" to "Very difficult" (Shardlow et al., 2020).

Sub-tasks Descripion
The LCP shared task consists of two sub-tasks (Shardlow et al., 2021): • Sub-Task 1: predicting the complexity score of single words.
• Sub-Task 2: predicting the complexity score of multi-word expressions.
The training set consists of 7,662 samples for single word complexity prediction (Sub-Task 1), while the training set of MWE sub-task contains 1,517 samples (Sub-Task 2). Figure 1a presents the number of samples per domain. The dataset is almost balanced for all three domains in the two LCP sub-tasks. Figure 1b and 1c show the complexity distribution of single words and MWEs, respectively. The Figures (1b and 1c) illustrate that most single words have a complexity score less than 0.5, whereas for MWEs, the complexity scores are between 0.25 and 0.75.

System Overview
The proposed system uses a residual attention block and a regression module on top of the pre-trained transformer encoder network. In the following, we describe each component of our system.

Transformer Encoder
In order to encode the input context and the complex word or MWE, we employ two state-of-the-art pre-trained transformer encoder networks, namely BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). First, the context (sentence) and the complex word or MWE are concatenated using the special token (SEP or /s) of the encoder's tokenizer, as follows: • BERT case: Then, the tokenizer of the encoder splits the input into wordpieces [T 1 , T 2 , ..., T n ] and encodes them using its vocabulary. The transformer encoder is fed with these encoded inputs. As a result, it outputs: • The pooled embedding h pooled ∈ R 1×d (the embedding of [CLS] and < s > tokens for BERT and RoBERTa encoders, respectively).

Attention block
Our model applies an attention layer on top of the CWE, output by the encoder (Bahdanau et al., 2015;Yang et al., 2016). The aim is to reward CWEs according to their relevance to the complexity prediction task. Using the CWE, the attention layer extracts a features vector v, representing the weighted sum of H vectors: where W a ∈ R d×1 and W α ∈ R n×n are the learnable parameters of the attention layer, C ∈ R n×1 is the context vector of the attention mechanism, and α ∈ [0, 1] n weights the CWEs according to their relevance to the task.

Regression Module
The regression module F consists of one hidden layer and one output layer. F is fed with the concatenation of the encoder's pooled output h pooled and the output attention block v. F outputs theŷ, the predicted complexity: The proposed system is trained to minimize both the Root Mean Square Error (RMSE) and the auxiliary loss associated to the negative Pearson Correlation: • The RMSE loss: • The auxiliary loss associated to the negative Pearson Correlation: where N is the number of samples, y is the ground truth complexity,ŷ is the predicted complexity, and y (resp.ŷ) is the mean of y (resp.ŷ). In order to combine both L rmse and L aux , we use the uncertainty loss weighting (Kendall et al., 2017). The latter aims to combine multiple losses according to their uncertainty and to avoid manual tuning of the loss weights. Finally, our model is trained to minimize the total loss, given by: where σ 1 and σ 2 are two parameters for learning the relative weight of L rmse and L aux .

Results
This section describes the experiment settings and the obtained results.

Experiment Setting
We investigate the performance of our system using both the base and the large models of BERT and RoBERTa encoders: • BERT-base: 12 transformer blocks, d = 768, 12 attention heads, and 110M parameters.
We implement a simple text preprocessing pipeline that normalizes the contractions 1 . All models are trained using Adam optimizer (Kingma and  Table 1: The obtained results using single-task and joint training of both Sub-Tasks 1 and 2. The best performances are highlighted with bold font. The attached superscript ‡ denotes the results of our two official submissions to both Sub-Tasks 1 and 2 (TEST).  Ba, 2015) with a learning rate of 1 × 10 −5 . The batch size and the number of epochs are fixed to 16 and 5, respectively. We investigate both single-task training and joint training of both Sub-Task 1 and Sub-Task 2 (training a single model on both subtasks data). All models are trained on the full train sets, validated on the trial sets, and evaluated on the test set of each Sub-Task. For evaluation purpose, we use the shared task's evaluation metrics, namely the Pearson correlation, the Spearman correlation, the Mean Absolute Error MAE, the Mean Squared Error MSE, and the coefficient of determination R2.

Experiment Results
Table 1 presents the obtained results of our model for both single-task and joint training, using the four transformer-based encoders. The overall results show that training joint models for both Sub-Tasks (1 and 2) outperform their single-task counterparts. The Use of deep encoders (large encoders) in our model yields better correlation performances. The best results for the correlation metrics are obtained using joint training and RoBERTa-large. For Sub-Task 1, the best MAE, MSE and R2 performances are achieved using single-task training and RoBERTa-large encoder. For Sub-Task 2, the best performances of all evaluation measures are obtained using joint training. In accordance with Sub-Task 1, the best correlation performances are attained using RoBERTa-large encoder. Besides, the best R2 is achieved using BERT-large, while the top MAE and MSE performances are obtained using RoBERTa-base.
To sum up, the best performances are obtained by joint training of our model on top of a deep encoder. These results can be explained by the fact that deep encoders yield better input representation for both Sub-Tasks. The joint training helps to leverage signals from both Sub-Tasks.

Ablation Experiment
In order to assess the effectiveness of each component of our model, we perform an ablation study using joint training and RoBERTa-large encoder. Table 2 illustrates the results of our model's ablation study. The results show that all components in our model improve the system performance. The auxiliary loss improves the performances of correlation measures, while it degrades MAE, MSE, and R2 performances. Combining RMSE and auxiliary losses using uncertainty loss weighting slightly improves the performance of correlation measures.

Conclusion
In this paper, we have presented our submitted system to the SemEval-2021 Task 1. The pro-posed system consists of a deep learning model for word and MWE complexity prediction. Our model employs a residual attention block and a regression module on top of a pre-trained transformer encoder. We have trained the model to minimize the uncertainty weighted loss of the RMSE and the auxiliary loss associated to the negative Pearson correlation. Experiments are performed using the base and the large variants of the pre-trained BERT and RoBERTa encoders. The best performance is obtained using RoBERTa-large encoder while performing joint training on both Sub-Tasks data.