UTFPR at SemEval-2021 Task 1: Complexity Prediction by Combining BERT Vectors and Classic Features

We describe the UTFPR systems submitted to the Lexical Complexity Prediction shared task of SemEval 2021. They perform complexity prediction by combining classic features, such as word frequency, n-gram frequency, word length, and number of senses, with BERT vectors. We test numerous feature combinations and machine learning models in our experiments and find that BERT vectors, even if not optimized for the task at hand, are a great complement to classic features. We also find that employing the principle of compositionality can potentially help in phrase complexity prediction. Our systems place 45th out of 55 for single words and 29th out of 38 for phrases.


Introduction
Accurately measuring the complexity of words can be useful in many ways. It facilitates the creation of text simplification technologies that could, for example, help in identifying and adapting challenging excerpts of literary pieces targeting specific groups, such as children (De Belder and Moens, 2010) and second language learners (Paetzold and Specia, 2016e), and make news articles and official documents more accessible to the general population (Paetzold and Specia, 2016a).
This task has received a considerable amount of attention in the past few years, especially due to the popularity of the Complex Word Identification (CWI) shared tasks of 2016 (Paetzold and Specia, 2016c) and (Yimam et al., 2018, where dozens of teams were challenged to judge the complexity of words in context. While the CWI 2016 task used a simple binary complex/not complex classification setup for English only, the CWI 2018 task explored both a binary classification and a regression setup and multiple languages. The majority of the most successful systems submitted to these shared tasks combined ensemble methods, such as Random Forests (Ho, 1995) and AdaBoost (Freund and Schapire, 1997) with numerous linguistic features, including word frequencies, n-gram frequencies, word length, number of senses, number of syllables, psycholinguistic metrics, and word embeddings (Konkol, 2016;Malmasi et al., 2016;Paetzold and Specia, 2016d;Gooding and Kochmar, 2018;Hartmann and Dos Santos, 2018). However, because these tasks were held prior to the ascension of transformer-based masked language models, such as BERT (Devlin et al., 2019) and ROBERTA (Liu et al., 2019), we could not find any systems that exploited the power of the features produced by them.
In this paper, we describe the UTFPR systems for the Lexical Complexity Prediction shared task of SemEval 2021 (LCP 2021), which combine classic complexity prediction features with contextual word and phrase representations extracted from transformer-based models. In our experiments, we explore the efficacy of a number of different machine learning models, feature combinations, and corpora sources for our features. In what follows, we present the task being addressed (Section 2), our approach (Section 3), some preliminary experiments (Section 4), our final shared task results (Section 5), and our conclusions (Section 6).

Task Description
We address the LCP 2021 shared task (Shardlow et al., 2021), held at SemEval 2021. The shared task is split into two sub-tasks: predicting the incontext lexical complexity of single words and phrases for the English language. Participants could choose to submit systems to either or both sub-tasks.
The organizers provided training, trial and test sets for both sub-tasks. Each instance of these datasets is composed of an ID, a source identifier, a sentence, a target word or phrase within the sentence, and a complexity score calculated based on judgments made by 20 English speakers from the USA, UK and Australia. The source identifier describes from where the sentence came from, the possibilities being the Bible, biomedical documents and the Europarl corpus. The task's dataset is an extended version of the CompLex dataset (Shardlow et al., 2020).
The training, trial, and test sets for single words have 7662, 421, and 917 instances, respectively. The training, trial and test sets for phrases have 1517, 99, and 184 instances, respectively. Participants were allowed and encouraged to use any external resources they saw fit.

Approach
Our approach consists of using modern ensemble models to learn from a combination of commonly used complexity estimation features, such as word frequencies, word length, and number of senses, with contextual representations extracted from large pre-trained BERT-like models, which have been widely used to create state-of-the-art solutions to numerous tasks. While it has been observed that word frequencies (especially those extracted from spoken text) tend to drive the performance of effective complexity prediction systems (Paetzold and Specia, 2016c), we hypothesize that the wealth of knowledge present in transformerbased models such as BERT can help in extracting complementary contextual complexity clues.

Features
We explore a set of 779 total features in our approach. They are: • Frequency: We use not only word/phrase frequency, but also n-gram frequencies as well.
• Length: We use the number of characters that compose the word/phrase. For phrases, instead of using its overall length, we use the average number of characters of all individual words. We motivate this decision in the experiments of Section 4.2.
• Number of senses: We use the word/phrase's number of senses catalogued in the WordNet database (Miller et al., 1990). In line with our setup for word length, for phrases, we use the average number of senses of all individual words.
• BERT vector: We use the numerical representation of 768 dimensions produced by the pre-trained BERT model (Devlin et al., 2019). For phrases and out-of-vocabulary words that were fragmented during tokenization, we average the representations produced for all fragments. More specifically, we used the bertbase-uncased model from the Hugging Face's transformers library (Wolf et al., 2020).
In the experiments of Section 4.3, we conduct an ablation study that reveals the performance impact of adding/removing some of these features from our models.

Preliminary Experiments
In this section, we describe the preliminary experiments we conducted in an effort to engineer our final systems for the LCP 2021 shared task.
In these experiments, all machine learning models were trained and optimized on the training set and tested on the trial set provided by the organizers. All models were implemented using the Scikit-Learn library (Pedregosa et al., 2011) and optimized using grid search and 5-fold cross validation.

Corpora Analysis
Arguably the most important features we use are frequencies. These must be calculated based on a language model trained on a specific corpus, so, as a first step in our engineering process, we decided to conduct an experiment to choose a corpus for the shared task in question. As evidenced and discussed by Brysbaert et al. (2012) and Paetzold and Specia (2016b), frequencies extracted from spoken text corpora tend to correlate better with word complexity, so we decided to choose the SubIMDB corpora (Paetzold and Specia, 2016b) for our experiment. SubIMDB is a structured corpus extracted from 38,102 subtitles of children, family and comedy movies and series. We created 12 SubIMDB splits for this experiment: Children movies (Chi-M), children series (Chi-S), children movies and series (Chi-MS), family movies (Fam-M), family series (Fam-S), family movies and series (Fam-MS), comedy movies (Com-M), comedy series (Com-S), comedy movies and series (Com-MS), all movies (Movies), all series (Series), and the entire corpus (All). We calculate the Pearson correlation between the trial set complexity scores and n-gram frequencies for all n-gram configurations described in Section 3.1. To do so, we trained 5-gram language models over these splits using KenLM (Heafield, 2011).
The results illustrated in Table 1 are absolute correlation scores for the trial set of the single words sub-track (original values were negative, given that complexity inversely correlates with word frequency). We chose absolute scores to make the table more compact. It can be observed that the (0, 0) configuration (no context) yields the best correlations in every scenario. It can also be noted that, while the family movies split (Fam-M) is best for (0, 0), the remaining configurations tend to benefit from larger splits. Based on that observation, in the experiments that follow, we use family movies to calculate frequencies for single words/phrases and the whole SubIMDB corpus for the remaining n-grams.

Phrase Compositionality
The next step in our engineering process was to optimize the performance of our submission for the phrases sub-track. For that, we tested the hypothesis that the complexity of a phrase can be more reliably modelled if addressed as a product of the complexity of its words. To do so, we first calculated 3 features from our feature set using 4 different composition functions, then calculated the Pearson correlation between them and the reference complexity scores from the trial set.
The features calculated are: Phrase/word frequency, length, and number of senses. The composition functions are: None (addressing the phrase as a single word), averaging, maximum, and minimum. Frequencies were calculated using a 5-gram language model trained over the entire SubIMDB corpus.
The results in Table 2 show that, overall, employing the principle of compositionality in feature calculation for phrases increases the correlation between classic complexity features and human complexity scores. This is especially true for word senses, given that Wordnet has very few phrases  catalogued.
In the subsequent experiments, we employ averaging as the compositionality function in feature calculation for phrases.

Feature Selection
The last step in engineering our submissions was to select a set of features and a machine learning model from the ones described in Section 3. To do so, we conducted a thorough ablation analysis with all models and multiple feature subsets.
Each feature subset is identified by a set of IDs. Each ID describes a feature or group of features. The identifiers are: • Word/phrase frequency (F)

• BERT vector (V)
The F identifier represents the (0, 0) configuration described in Section 3.1, while the N identifier represents all others. For example, the subset FNLSV contains all features, while the subset FNS does not contain length or the BERT vector.
The results in Table 3 show the results for the feature configurations that we feel were the most relevant for our engineering process. It can be seen that the best performing variant for both single words and phrases is an SVM trained over all features except n-gram frequencies. Models tend to benefit from the inclusion of word length, number of senses, and especially the BERT vector to the feature set. Interestingly, discarding n-gram frequencies tends to improve the models' performance, especially for single words. This was observed not only in the results of Table 3, but also in many other comparisons we tested, such as FNLSV versus FLSV and FNLS versus FLS.

Task Results
We based the creation of the final UTFPR systems on the experiments of the previous section. Our final systems are SVMs trained with word/phrase frequencies, word/phrase length, number of senses, and BERT vector (no n-gram frequencies). Compositionality in phrases was handled through averaging. Frequencies were calculated using a 5-gram language model trained over family movies from SubIMDB. Due to a limitation in time availability, the BERT model was used in its original pre-trained form and not optimized for the task at hand. Table 4 showcases our shared task performance in comparison to the top 3 and bottom 3 systems with respect to Pearson correlation. Our systems for single words and phrases placed 45th out of 55 and 29th out of 38, respectively. Inspecting the instances that featured most discrepancy between gold labels and predictions, we found that our systems had a tendency of both underestimating the complexity of some of the most complex words and phrases (above 0.7 complexity) and overestimating the complexity of the simplest ones (below 0.2). The conservative nature of their predictions seems to be the main reason why our systems did not place higher.

Conclusions
We presented the UTFPR systems submitted to the Lexical Complexity Prediction shared task of Se-mEval 2021. Although the placing of our systems were not impressive, we do showcase through our preliminary experiments that employing compositionality can potentially improve the predictions for phrases. We also show that including word length, number of senses, and non-optimized BERT vectors to complexity prediction models can noticeably improve their predictions for both words and phrases. In the future, we intend to test the efficacy of adding BERT vectors optimized for the task at hand to the pool of features of our models.   Table 4: Pearson correlation obtained by the UTFPR systems on the shared task compared to the top 3 and bottom 3 systems of each sub-task.