Cambridge at SemEval-2021 Task 1: An Ensemble of Feature-Based and Neural Models for Lexical Complexity Prediction

This paper describes our submission to the SemEval-2021 shared task on Lexical Complexity Prediction. We approached it as a regression problem and present an ensemble combining four systems, one feature-based and three neural with fine-tuning, frequency pre-training and multi-task learning, achieving Pearson scores of 0.8264 and 0.7556 on the trial and test sets respectively (sub-task 1). We further present our analysis of the results and discuss our findings.


Introduction
Predicting which words are considered hard to understand for a given target population has many applications. For example, it can be used to identify texts appropriate for language learners or included in a pipeline for automatic text simplification for people with low literacy skills or reading disabilities (Xia et al., 2016;Shardlow, 2014;Gooding and Kochmar, 2019b). In this paper, we describe our submission to the SemEval-2021 shared task on Lexical Complexity Prediction (LCP) (sub-task 1), where participating teams are expected to predict the complexity score of single words in context (Shardlow et al., 2021). Compared to previous shared tasks on Complex Word Identification (CWI), which have primarily focused on binary classification as systems were expected to identify words as complex or not (Paetzold and Specia, 2016a;Yimam et al., 2018); a new multi-domain English dataset was used for the purpose, which was annotated using a 5-point Likert scale (Shardlow et al., 2020). We approached LCP as a regression problem and proposed a traditional featurebased model, as well as three neural models exploring fine-tuning, frequency pre-training and multitask learning (MTL).
The remainder of this paper is organised as follows. Section 2 presents related work in the area.
In Section 3, we describe our approach to the task and detail the four models included in our final ensemble system. In Section 4, we turn to the experiments, describing the data and evaluation metrics used, and presenting our results on the shared task trial set. Section 5 presents our official results on the shared task test set, and offers a discussion of the results and the performance of our submitted system. Finally, we conclude the paper and provide an overview of our findings in Section 6.

Related work
The SemEval-2016 shared task on CWI (Paetzold and Specia, 2016a) was framed as a binary classification problem, where complexity was defined as whether or not a word is difficult to understand for non-native English speakers. A set of 400 nonnative speakers annotated the data in a binary fashion and a word was labelled as complex if it was annotated as complex by at least one annotator. The study performed by Zampieri et al. (2017) showed that most systems performed poorly due to the way the data was annotated. They also found out that words that were annotated as complex by the majority of human annotators tend to be easier for systems to identify, arguing that lexical complexity should be seen as a continuum on a spectrum rather than a binary value. The second CWI shared task was organized as part of the BEA-2018workshop (Yimam et al., 2018. It extended the previous one by introducing a new probabilistic classification sub-task where participants were asked to assign the probability that an annotator would find a word complex. The continuous complexity value for each word was calculated as the proportion of annotators that found a word complex. The results of the shared task showed that traditional feature engineering approaches (mostly based on length and frequency features) performed better than neural network and word embedding approaches, including the winning system Camb-2018 from Gooding and Kochmar (2018). However, this system was subsequently outperformed by a sequence labeller approach to CWI that incorporated word context (Gooding and Kochmar, 2019a). In both shared tasks, the top-performing systems demonstrated the strength of ensemble models (Paetzold and Specia, 2016b;Gooding and Kochmar, 2018).

Random forest regression
As a baseline, we began with training a simple random forest regressor based on 15 manually selected linguistic features. The regressor was trained with 100 trees, and we used mean absolute error (MAE) to measure the quality of each split. Most of our features were inspired by psycholinguistic studies and readability metrics. The full list of features can be found in Table 1.
Frequency Based on the psycholinguistic findings that the frequency of a word is strongly correlated with the speed at which it is processed (Preston, 1935;Monsell et al., 1989;Brysbaert et al., 2011), we introduced six features which are based on frequencies found in the Simple English Wikipedia (SimpleWiki). 1 We selected SimpleWiki for its standardised form, relatively low frequency of complex words, and coverage of a large number of topics. Two of our frequency-based features were calculated based on the frequency of words that match both the surface form and the syntactic role -this was done as a coarse form of word sense disambiguation, but also to capture syntactic complexity.
Syntax Psycholinguistic studies have shown that syntactic complexity is linked to processing speed (Ferreira, 1991) and working memory limitations (Norman et al., 1992), which may affect participants' perception of lexical complexity. In a similar vein, we added three syntactic features: the number of compounds and modifiers in the phrase containing the target word, and the number of child dependencies linked to the target word.
Readability We included syllable-based 2 and character-based metrics, which were inspired by traditional readability metrics such as the Flesch-Kincaid readability tests (Kincaid et al., 1975) and the Coleman-Liau index (Coleman and Liau, 1975).

Fine-tuning BERT
Fine-tuning pre-trained language models via supervised learning has become the key to achieving state-of-the-art performance in various natural language processing (NLP) tasks. Our approach builds upon this, where we used BERT (Devlin et al., 2019) as the underlying language model and added a linear layer on top that allows for regression.
We treated it as a sequence regression problem and constructed the input by concatenating the target word w t , the complexity of which was to be determined, and its context sentence: We then fed the [CLS] representation into the output layer for regression.
We used the L1-loss, which measures the MAE for the prediction, i.e.: where x and y are respectively the output of the model and the target value. N is the batch size.
During training, the whole model was optimised in an end-to-end manner.

Frequency pre-training
We proposed an extension to the fine-tuning BERT system by introducing a pre-training step. We constructed a new pre-training set with 20,000 sentences extracted from SimpleWiki, filtering for whole sentences by detecting the presence of verbs, and removing sentences that are longer than 256 words, as this is the length of the longest sentence in the training data.
Frequency of each word and part-of-speech (POS) combination in SimpleWiki was counted and converted into a value between 0 and 1: where f is the original frequency value and h is the highest frequency found (excluding stop words). This conversion makes use of the Zipfian distribution observed in natural language (Zipf, 1935), allowing the model to be pre-trained on output values that match the range in the shared task dataset (see Section 4.1 for more details).  We chose this particular frequency feature because it is the most strongly correlated one with the complexity values in the training data among the 15 features used in the random forest regressor (see Table 1 #12).

Neural multi-task learning
MTL allows models to use information from related tasks and learn from multiple objectives, which leads to performance improvement on individual tasks (Rei and Yannakoudakis, 2017;Yuan et al., 2019;Taslimipoor et al., 2020;Andersen et al., 2021). Instead of only predicting the complexity value of word in context, we extended the model to incorporate auxiliary objectives. We used a joint learning approach trained on in-domain data only and experimented with three related tasks to boost model performance: • POS tagging • Grammatical Relations (GR) prediction: We included as an auxiliary objective the prediction of the GR type of a dependent with its head.
• Genre classification: A classification task was introduced to predict the genre of the text.
Model weights were shared between the main and auxiliary training objectives. We used pretrained DistilBERT  for language representation as the basis for our neural network and added additional layers on top of the Transformer (Vaswani et al., 2017) architecture for finetuning.
The final layer for the LCP objective is a fully connected layer that performs regression. Different from the first two neural systems, we treated it as a token regression problem, where we only input the context sentence, and fed the vector representation of the target word w t into the output layer for For those cases where the target word was split into multiple sub-tokens, we took the averaged representation.
Additionally, a new output layer was introduced to perform the auxiliary task. For the first two token-level auxiliary tasks (POS and GR), the token representations were fed into the output layer. The model only predicted labels for auxiliary objectives on the first token of a word, in an identical fashion to Devlin et al. (2019). For genre classification, we used the [CLS] representation. The overall loss function is a weighted sum of the main LCP loss (measured as MAE) and the auxiliary loss (as cross-entropy): 4 Experiments

Dataset and evaluation
The data used in this shared task is an augmented version of CompLex (Shardlow et al., 2020), a multi-domain English dataset with sentences annotated using a 5-point Likert scale with 1 being very easy and 5 being very difficult. The final complexity labels were normalised in the range of [0, 1]. The dataset contains texts of three genres (Bible, Biomedical and Europarl) and both single words (sub-task 1) and multi-word expressions (sub-task 2). Since we focused on sub-task 1, we used only single word instances in our experiments. Corpus statistics are given in Table 2. Systems were evaluated using Pearson correlation. We also report scores for the following metrics: Spearman correlation, MAE, mean squared error (MSE) and R-squared (R2).

Training details
We used spaCy 3 to preprocess the data and automatically generated lemma, POS and GR labels to be used in our experiments.
For the feature-based system, we used the random forest regressor in the scikit-learn library. 4 For the neural systems, we used pre-trained language models provided by huggingface . 5 All neural systems were trained using the AdamW variant (Loshchilov and Hutter, 2019) of the Adam stochastic optimisation algorithm (Kingma and Ba, 2015). Detailed hyperparameters are listed in Table 3. Each neural model was trained on one NVIDIA Tesla P100 GPU.

Individual system performance
Individual system performance on the trial set is reported in Table 4, where RandomForest refers to the feature-based random forest regression system, BERT refers to the fine-tuned BERT system, BERT freq. refers to the fine-tuned BERT system with frequency pre-training, and MTL X refers to the MTL system with the subscript ' X ' representing the auxiliary task (POS, GR, or genre). We also report results from the winning system Camb-2018 in the BEA-2018 CWI shared task, a feature-based, context-independent linear regression model.
We can see that our feature-based RandomForest system achieved comparable performance to the heavily feature-engineered Camb-2018 system, despite using only 15 features. This may be due to the fact that linguistic features are often highly interdependent and capture very similar information.
We also notice that all our neural systems outperformed both feature-based systems by large margins (+0.1 Pearson). This contradicts the findings   Table 5: Performance of ensemble systems on the trial set (sub-task 1). The best results are marked in bold.
from the BEA-2018 CWI shared task where traditional feature-based approaches performed better than neural network and word embedding approaches. This could possibly be explained by the use of pre-trained Transformer-based language models in our neural systems, as well as a different annotation scheme employed when constructing the CompLex dataset used for this shared task. Nevertheless, our findings appear to match the general trend in NLP where neural systems are overtaking feature-based models as the state of the art. All our neural systems produced comparable results: BERT freq. yielded the best Pearson, MAE, MSE and R2 scores; while BERT yielded the best Spearman score.

Ensemble performance
We further averaged the outputs from individual systems to obtain an ensemble. Table 5 shows results for different system combinations. Overall, the best system consists of all our individual systems proposed in Section 3, including the featurebased RandomForest system; and achieved the best Pearson score of 0.8264, Spearman of 0.7676, MSE of 0.0063, and R2 of 0.6688. The ensemble of all neural systems yielded the best MAE of 0.0621.

Official results and discussion
Our submission to the LCP shared task (sub-task 1) is the result of our best system (in terms of Pearson), an ensemble of three neural and one featurebased systems MTL All + BERT + BERT freq. + RandomForest. The official results are reported in Table 6. Our final system achieved a Pearson score of 0.7556.

Per-genre performance
Using the Pearson correlation metric, the highest performance is obtained on the Biomedical data, followed by the Bible and Europarl data. On the MAE metric, however, the worst performance is found for the Biomedical data (see Table 6). We hypothesise that this might result from differences in the distribution of the lexical complexity scores. In particular, the scores for the Biomedical data appear to have a slightly larger interquartile range (see Appendix A, Figure A.1c).

Individual system contribution
To measure the contribution of each individual system to the overall performance, a number of ablation tests were performed, where one system was removed at a time. Results in Table 6 suggest that all neural systems have positive effects on the overall performance. Among them, MTL All is the most effective one, whose absence is responsible for a 0.02 decrease in Pearson, followed by BERT freq. and BERT. Interestingly, removing RandomForest yielded a better Pearson score of 0.7560, indicating that it is detrimental and brought performance down. This is inconsistent with our results on the trial set (see Table 5), where all systems contributed to the final system.  Table 6: Official results of our submitted system on the test set (sub-task 1). Per-genre performance and ablation test results are included.

Analysis of RandomForest
To understand why the feature-based regressor performed worse on the test data, we examined the correlation between each feature and the complexity scores in the training (train), trial, and test sets. Results in Table 1 show that several linguistic features (particularly #3, #4, and #10) are more strongly correlated with scores in the trial data compared to the test data, which may explain the discrepancy in our results. Although most features appear to have a small but significant correlation with complexity in the training data, many are not significant in the test data, likely due to the smaller sample size. This suggests that, while there may be some weak, overall correlation between these features and complexity, there is sufficient noise in the data that the relationship is negligible and unreliable when used to predict the complexity of a given word. Additionally, we investigated the importance of each feature in the random forest regressor, as measured by the mean permutation importance (Breiman, 2001) -see Table 1. Our analysis reveals that the frequency of the target lemma (#8) is the most important one, followed by the frequency of the target word itself (#7). Both of these features are more strongly correlated with complexity in the trial data than either the training or test data, which also contributes to the inconsistency described above.

Conclusion
This paper presents our contribution to the SemEval-2021 shared task on LCP. We competed in sub-task 1 (single words) with an ensemble system combining three neural models and one featurebased model. Our analysis reveals that even though all three neural systems perform comparably, the MTL system contributed the most to the ensemble system. Adding the feature-based model improved the performance on the trial data, but brought performance down on the test data. In addition to the mismatch between the trial and test data, the noise in the data further contributed to this inconsistency. The comparatively lower performance of the feature-based system is especially interesting in light that such systems were competitive for CWI until relatively recently (Gooding and Kochmar, 2018). When looking at different genres, our submitted system yielded the highest performance in Pearson, but worst performance in MAE in Biomedical domain, compared to the other genres. We hypothesise that this is due to differences in data distribution between genres.