IAPUCP at SemEval-2021 Task 1: Stacking Fine-Tuned Transformers is Almost All You Need for Lexical Complexity Prediction

This paper describes our submission to SemEval-2021 Task 1: predicting the complexity score for single words. Our model leverages standard morphosyntactic and frequency-based features that proved helpful for Complex Word Identification (a related task), and combines them with predictions made by Transformer-based pre-trained models that were fine-tuned on the Shared Task data. Our submission system stacks all previous models with a LightGBM at the top. One novelty of our approach is the use of multi-task learning for fine-tuning a pre-trained model for both Lexical Complexity Prediction and Word Sense Disambiguation. Our analysis shows that all independent models achieve a good performance in the task, but that stacking them obtains a Pearson correlation of 0.7704, merely 0.018 points behind the winning submission.


Introduction
Complex Word Identification (CWI) consists of determining which words or multi-word expressions (MWE) in a text could be difficult to understand by certain readers. This is one of the first steps in the typical Lexical Simplification pipeline (Shardlow, 2014). CWI has traditionally been treated as either a binary (Paetzold and Specia, 2016) or regression (Štajner et al., 2018) task. For the latter, the complexity of a word/MWE was computed as a percentage of binary complexity ratings. Recently, Shardlow et al. (2020) proposed to move away from the binary definition of CWI, and instead collected complexity ratings using Likert scales. This allows re-defining the task as Lexical Complexity Prediction (LCP). Leveraging this new collected data, the First LCP Shared Task was organised in SemEval-2021 (Shardlow et al., 2021).
Our team participated in Sub-task 1: predicting the complexity score of single words. Basically, given a sentence and a target word in it, the goal is to predict the complexity score of the target. One particular challenge is that the same target can have different complexity scores depending on the sentence it appears in. Therefore, our proposed approach takes the context of the target into consideration in two ways. First, we use contextualised word representations from pre-trained Transformed-based models, such as RoBERTa (Liu et al., 2019) and XLNet (Yang et al., 2019). In particular, we use the LCP data to fine-tune two RoBERTa models and one XLNNet model that receive as input the target and a context window of 1, and a RoBERTa model whose inputs are the target and a context window of 2. Second, we hypothesise that different contexts could evoke different senses of the target word. As such, we exploit data for Word Sense Disambiguation (WSD) through multi-task learning. In particular, we finetune a BERT (Devlin et al., 2019) model with two tasks: LCP and WSD, using the Unified Evaluation Framework (Raganato et al., 2017) for the latter. The predictions from all these models are combined with several morphosyntactic and corpus-based features, and used to train a Gradient Boosting Decision Tree with LightGBM (Ke et al., 2017).
On the test set of the Shared Task, our model achieved a Pearson correlation of 0.7704 and ranked 10th, only 0.018 points behind the winner. An ablation study shows that all independent models contributed to the stacked model's performance, with the predictions from the BERT model finetuned in a multi-task fashion having the greatest impact in predicting lexical complexity. The code to reproduce our results is available in: https: //github.com/kdrivas/lexical_complexity.

Background
The LCP Shared Task on SemEval-2021 asks participants to develop models that predict the com-

Sentence with Target Complexity
His left hand is under my head. 0.125 Do therefore according to your wisdom, and don't let his gray head go down to Sheol in peace. plexity of a target word/MWE in a sentence in English (Shardlow et al., 2021). This Shared Task builds on previous editions that focused on Complex Word Identification (Paetzold and Specia, 2016;Štajner et al., 2018), with a key difference: complexity ratings are continuous scores instead of binary. Furthermore, the same target word/MWE can appear in more than one sentence but with different complexity scores. Table 1 presents an example from the data.

0.383
The data for the Shared Task is an extension of CompLex (Shardlow et al., 2020), a dataset with complexity ratings for target words/MWE in sentences in English in three domains: Bible, Europarl and Biomed. The dataset is split into two substasks: LCP for single words and LCP for MWEs.

System Description
This section details our stacking approach to the LCP Shared Task Sub-task 1. An overview of our system can be seen in Figure 1.

Features
After joining all the data from both subtasks (single word and MWE), we extracted some features presented in (Yimam et al., 2018;Finnimore et al., 2019) and other custom ones, such as (1) the complexity of the target words in the lexicon proposed in (Maddela and Xu, 2018), (2) the predictions from four fine-tuned Transformer based models, and (3) the number of senses and dependencies of the target word/MWE.

Morphosyntactic and Lexical Features
First, we computed the number of characters and the number of words surrounding the target word/MWE. In addition, we obtained the part-ofspeech of the first token and the syntactic dependencies of the whole target using the spaCy library. 1 We also counted the number of possible part-ofspeech tags for the token using the Brown dictio-1 https://spacy.io/ nary in NLTK. 2 Then, we counted the number of propositions, verbs, nouns, adverbs and got the ratio between the number of nouns and verbs using the whole sentence. Finally, we calculated the total number of syllables and morphemes.

N-gram Features
We formed n-grams considering one and two tokens surrounding the target word/MWE. Then, we computed their frequency in the Children's Book Test (Hill et al., 2015) and Simple Wikipedia (Kauchak, 2013). In addition, using the previous corpora, the Lang-8 corpus (Mizumoto et al., 2011) and the Tatoeba corpus, 3 we computed the frequency of the target tokens.

Word Complexity Lexicon
The lexicon created in (Maddela and Xu, 2018) contains complexity scores for more then 15,000 words. After lower-casing the words in the lexicon and the datasets from the Shared Task, we assigned the complexity from the lexicon to the words in the LCP data. If the word does not appear in the lexicon we assigned a null value.

Transformer-based Model Predictions
The last set of features is composed of the predictions of four pre-trained language models finetuned on the training data of both subtasks. The first three were a RoBERTa (Liu et al., 2019) and an XLNet (Yang et al., 2019) models that received as input the target word/MWE and a context window of 1, and a RoBERTa model with the target and a context window of 2. The last model was a BERT fine-tuned in a multi-task fashion with two tasks: LCP and Word Sense Disambiguation (WSD). For the former task, we only used the data generated with a window size of 1 and, for the latter, the Unified Evaluation Framework (Raganato et al., 2017).
Multi-Task Model. Given a sentence S of the dataset of the Shared Task and a complex word w in position a whose part of speech is p, we obtain a subsequence of size 1, sub =< w a−1 , w a , w a+1 >; then: where CLS is the CLS token of BERT, which represents the sentence. This representation is concatenated with the embedding token of p: The concatenated vector is then used as input to a dropout layer and a linear layer: Using out 1 , we computed loss L 1 using mean squared error. After getting the first task loss, we computed the loss for the second one. Given an ambiguous sentence S and a sequence output of senses id A, we used the BertForTokenClassification implementation in HuggingFace 4 to obtain the output out 2 , and then used cross entropy to compute loss L 2 . Finally, we multiply a weight per each task loss to get the final overall loss: Finally, we perform other experiments

Architecture
Our model architecture is shown in Figure 1. First, we got the predictions from the four language models. Then, we concatenated those predictions with the additional features, and stacked a LightGBM model that received them as input features. 4 https://huggingface.co/ transformers/model_doc/bert.html# bertfortokenclassification

Experimental Setup
As previously described, we used four different models: RoBERTa, XLNet, BERT and LightGBM. In addition, for training/fine-tuning each model we chose the Mean Absolute Error (MAE) as our validation metric.

RoBERTa and XLNet
We fine-tuned the models for 4 epochs with a batch size of 24. In addition, we used a learning rate of 2e-5 and Adam optimizer. We used the models for sequence classification provided by HuggingFace. 5

Multitask BERT
We fine-tuned a BERT model using two tasks: LCP and WSD. We trained the WSD task using the Unified Evaluation Framework (Raganato et al., 2017), but filtered sentences with a size greater 22 tokens. For fine-tuning, we used a learning rate of 2e-5 and Adam optimizer. We fine-tuned the models for 5 epochs with a batch size of 32. We calculated the loss accumulating the gradients from both tasks. Also, we experimented with assigning different weights to each task, and found that the best configuration was 0.8 for LCP and 0.2 for WSD.

LightGBM
At the top of our architecture, we used a LightGBM model. Using Hyperopt, a bayesian optimization framework, we set up a max depth of 5, num-leaves of 8, min-sum-hessian-in-leaf of 0.9, a baggingfraction of 0.9, a bagging-freq of 100, a learningrate of 0.08, and a min-data-per-group of 100. We trained using 500 iterations with an early stopping of 90. Also, we declared the type of corpus and the part of speech as categorical features.

Results
The test set contains more than 1,000 sentences with 573 different target words. Table 2 shows the official evaluation metrics for each domain-corpus in the LCP dataset. Overall, we achieved a Pearson correlation of 0.7704, and finished in 10th place in the Shared Task Sub-task 1, only 0.018 points behind the winning submission.

Corpus
Pearson  The scores in the validation set (Table 3) follow a similar behaviour as those in the test set. For both, the corpus where our model achieves the best Pearson correlation is Biomed. However, looking at other metrics such as MAE, this corpus has the greatest error, with Europarl having the lowest. The differences may be because, even though the model may well capture the trend of the outputs, it could be more difficult to predict values in a corpus with higher variance of complexity scores, as is the case for Biomed (Figure 2  6 Ablation Study Table 4 shows the contribution of each set of features (including predictions of fine-tuned models) to the final score. Although the predictions of the fine-tuned Transformers-based models perform very well independently, the combination of all the predictions and the additional traditional features achieves the best performance in the validation set. Another way of visualising the importance of each feature is using SHAP values (Lundberg and Lee, 2017). Figure 3 reports the 10 most important features for the LightGBM model, i.e. the impact of each feature in predicting the target complexity score. The X-axis shows the increase or decrease of target complexity, while the red and blue colours refer to the feature value's size. For example, in the case of feature size of sentence, if the number of characters is larger there will be a positive impact, i.e. the complexity will increase. On the other hand, if the sentence length is smaller, there will be a negative impact, i.e. the complexity will decrease. We can observe that the most important feature is the predictions given by the BERT Multitask model since they have the greatest impact. This signals that WSD data could benefit predicting lexical complexity. It is also noted that the predictions of the Transformers-based models are in the top 5 of importance. Other features, such as the size of the sentence or the number of word senses, also have good contributions to the impact.

Conclusion
In this paper, we presented our system for the single word complexity prediction sub-task in the LCP Shared Task. Our approach consisted of combining lexical features and predictions from fine-tuned pre-trained Transformer-based models. We found that each set of features achieved a good performance on their own, and that combining all of them achieved our best result. In particular, we found that fine-tuning a pre-trained Transformer-  Table 4: Results of each approach on validation data based model using multi-task learning with data from word sense disambiguation helped the most with learning to predict lexical complexity.
Considering that there were unseen tokens in validation and test sets, the task resembles a zero shot classification problem. Therefore, as future work, semi-supervised learning approaches or data augmentation algorithms could be explored, and training in a multitask fashion another transformerbased models like RoBERTa.