CSECU-DSG at SemEval-2021 Task 1: Fusion of Transformer Models for Lexical Complexity Prediction

Lexical complexity prediction (LCP) conveys the anticipation of the complexity level of a token or a set of tokens in a sentence. It plays a vital role in the improvement of various NLP tasks including lexical simplification, translations, and text generation. However, multiple meaning of a word in multiple circumstances, grammatical complex structure, and the mutual dependency of words in a sentence make it difficult to estimate the lexical complexity. To address these challenges, SemEval-2021 Task 1 introduced a shared task focusing on LCP and this paper presents our participation in this task. We proposed a transformer-based approach with sentence pair regression. We employed two fine-tuned transformer models. Including BERT and RoBERTa to train our model and fuse their predicted score to the complexity estimation. Experimental results demonstrate that our proposed method achieved competitive performance compared to the participants’ systems.


Introduction
Lexical complexity prediction (LCP) has become an important task in this globalization age, especially for second language learners (Przybyła and Shardlow, 2020). LCP is a little bit expansion of complex word identification (CWI) task (Paetzold and Specia, 2016;Štajner et al., 2018), where CWI is a binary classification of a word that is complex or not and LCP is finding the complexity level of a word in continuous labelling in a sentence . LCP plays a vital role in many NLP applications such as lexical simplification (Paetzold, 2016;Qiang et al., 2020), text generation, and machine translation (Wang et al., 2016). Besides, it helps those people who are suffering from **The first two authors have equal contributions.
LCP is a very challenging task (Zampieri et al., 2017), especially because the non-identical target audiences will have distinct needs. For example, speakers of one language usually less familiar with different subsets of the vocabulary of a second language. Besides, the grammatical shape of a sentence and the ambiguous meaning of a word in different places make this task more challenging and important to explore. A single word may portray different lexical complexity because of its non-identical usage, position, tense form, and redundancy in different sentences or in the same sentence. To estimate multi-word complexity, we need to consider the dependency between tokens.

Sentence
Token Complexity To address the challenges of lexical complexity prediction of words in sentences, (Shardlow et al., 2021a) proposed a shared task at SemEval-2021 Task 1. The task is divided into two subtasks. In sub-task 1, a system needs to determine the complexity level of a word in the sentence, whereas in sub-task 2, a system needs to determine the overall complexity level of multiple words in the sentence. To explain the definition of both sub-tasks, we articulate a few examples in Table 1.
Taking part in the LCP shared task of SemEval-2021, we exploit the pairwise contextual information of sentence and token. In this regard, we proposed a combined transformer based framework with sentence pair regression. We make a pairwise learning framework with the sentence-token pair to train the two state-of-the-art transformers model including BERT and RoBERTa.
We organize the rest of the paper as follows: Section 2 presents the details of our proposed framework. Whereas in Section 3, we present our experimental settings and analyze the performance of our model against the various settings and related methods. Finally, we conclude our paper in Section 4 with some future directions.

Proposed Lexical Complexity Prediction Framework
In this section, we describe our proposed lexical complexity prediction framework. Our goal is to predict the complexity score of a token or a set of tokens in the given sentence. We depict the overview of our framework in Figure 1.

Single/Multi-Token and Sentence
Complexity Score Regression Score BERT RoBERTa

Fusion of Regression Scores
Text-pair Data Formats In our framework, we use a sentence pair regression concept in transformer models to perform lexical complexity prediction where input sentence and target word pairs are packed together into a single sequence. After performing sentence-token pair regression through BERT and RoBERTa models, we estimate each model's regression score. Subsequently, we fuse these models' predictions by taking the mean of these scores to determine the final complexity score.
Single / Multi Token Sentence Figure 2: Pairwise learning using BERT model.

Fine-tuned Transformer Models
We fine-tune the transformer models to perform sentence pair regression for LCP through BERT and RoBERTa. We describe the details in the subsequent sections.

Input Representation
We train with the sentence-word pair for better understanding their contextual relation which in turn helps to estimate the complexity of the target word in the sentence. It is important for an LCP system to predict both single and multi-words complexity.
We exploit Huggingface transformers library (Wolf et al., 2020) with pairwise training where input target words and sentence make pair as a single sequence and detached with the [SEP] token. We utilize two pre-trained transformer models including RoBERTa and BERT. For LCP tasks training, each model's first token is the special [CLS] token at the beginning of every sequence which is also responsible for the final layer regression score of each model. For each sequence, we separate every pair with [SEP] token (as presented in Figure 2) where the target words belong to text a and sentence belongs to text b. We fine-tune the architecture with the pre-trained BERT and RoBERTa models to estimate the complexity score.

BERT
BERT (Devlin et al., 2019) stands for bidirectional encoder representations from transformers, is a new method of pre-training sentence representations which achieves state-of-the-art results on many NLP tasks including question-answering, text classification, and sentence-pair regression. We take advantage of the bert fast tokenizer and bert-baseuncased model for sentence-pair regression where target words and sentence make pair as a single sequence.

RoBERTa
RoBERTa (Liu et al., 2019) is an extension to the original BERT model which is named as a robustly optimized BERT pre-training approach. It focuses on the key hyper-parameters choices and removing the next sentence prediction (NSP) objective. Besides, it is training with much larger mini-batches and learning rates. We exploit the roberta fast tokenizer and roberta-base model for sentence-pair regression to get the complexity score where target word and sentence are trained as pairwise training.

Fusion of Transformer Models
To ameliorate the performance of individual models, we fuse the predicted complexity score of two models to generate a unified score. We use the arithmetic mean to average both model's regression scores to determine the final complexity score. The estimation is computed as follows: where x i and y i correspond to the BERT and RoBERTa regression score, respectively.

Dataset Description
The organizers of the lexical complexity prediction (LCP) task 1 at SemEval-2021 (Shardlow et al., 2021a) provided a multi-domain English benchmark dataset (Shardlow et al., , 2021b to evaluate the performance of the participants' systems. The dataset was collected from three different corpuses including the Bible, europarl, and biomedical. The proposed task is divided into two subtasks, subtask 1 focused on single word instances whereas sub-task 2 focused on multi-word instances.

Experimental Settings
We now describe the set of parameters that we have used to design our proposed lexical complexity prediction model. In our CSECU-DSG system, we utilize two state-of-the-art Huggingface transformer models with fine-tuning, including BERT and RoBERTa. We use simpletransformers API (Rajapakse, 2019) to implement our system. We train our system with the provided training data. We trained BERT and RoBERTa model using 5 epochs and set the learning rate of 2.99e-5, save steps = 767, and evaluate during training steps = 40. We used the CUDA-enabled GPU and set the manual seed = 4 to generate the reproducible results. Default settings were used for the other parameters.

Evaluation Measures
To evaluate the performance of participants' lexical complexity prediction systems, SemEval-2021 task 1 organizers used different strategies and metrics for sub-task 1 and sub-task 2 . For both sub-task, standard evaluation metrics including Pearson correlation (R), Spearman correlation (Rho), mean absolute error (MAE), mean squared error (MSE), and R-squared (R 2 ) were applied to estimate the performance of a system. However, Pearson correlation (R) is considered as the primary evaluation measure for both subtasks of this task.

Results and Analysis
The comparative results of our proposed CSECU-DSG system along with top-5 performing systems (Shardlow et al., 2021a) in sub-task 1 and sub-task 2 are presented in Table 2 and Table 3, respectively. Following the benchmark of SemEval-2021 task 1, participants' systems are ranked based on the primary evaluation metric Pearson correlation (R) score. At first, we presented the performance of our proposed method. We also presented the performance of top-5 ranked participating systems and LCP baselines. Here, we see that our proposed method obtained competitive performance against the other top-performing systems. In comparison to the other participants' methods, we have seen that our system demonstrated a similar kind of performance on both sub-task. This deduces the applicability and generalizability of our system for the complexity estimation of both the single and multi-words.

Discussion
In order to estimate the effect of each component of our CSECU-DSG model, we estimated the performance of the individual model. The summarized experimental results for sub-task 1 and sub-task 2 are presented in Table 4.
From the results, it can be observed that RoBERTa based model performed better compared to the BERT model when considering individual model's performances. However, combining two models regression scores by using mean increased Pearson correlation score by more than 1% on both subtasks. This deduced the importance of our model fusion.
All three models performed better for multiwords complexity estimation compared to the single word complexity. We have seen a similar kind of trend in other models' performances reported in Table 2, and Table 3. This demonstrated that estimating the single word complexity is more challenging compared to the multi-words expression. This is because a multi-word expression contains more words, therefore, contains more contextual information that helps the model for complexity estimation compared to the single word.

Method
Single

Conclusion and Future Directions
In this paper, we presented our approach to the lexical complexity prediction task. We tackled the problem by performing sentence pair regression using two SOTA transformer models including BERT and RoBERTa in a unified architecture. By using pairwise learning, we exploited the contextual relation between sentence-word pairs to estimate the complexity score. Our method achieved competitive scores compared to other participants.
In the future, we have a plan to incorporate various handcrafted features with state-of-the-art neural methods to distill the relationship of sentenceword pairs for complexity estimation.