Alejandro Mosquera at SemEval-2021 Task 1: Exploring Sentence and Word Features for Lexical Complexity Prediction

This paper revisits feature engineering approaches for predicting the complexity level of English words in a particular context using regression techniques. Our best submission to the Lexical Complexity Prediction (LCP) shared task was ranked 3rd out of 48 systems for sub-task 1 and achieved Pearson correlation coefficients of 0.779 and 0.809 for single words and multi-word expressions respectively. The conclusion is that a combination of lexical, contextual and semantic features can still produce strong baselines when compared against human judgement.


Introduction
Lexical complexity is a factor usually linked to poor reading comprehension (Dubay, 2004) and the development of language barriers for target reader groups such as second language learners (Saquete et al., 2013) or native speakers with low literacy levels, effectively making texts less accessible (Rello et al., 2013). For this reason, complex word identification (CWI) is often an important sub-task in several human language technologies such as text simplification (Siddharthan, 2004) or readability assessment (Collins-Thompson, 2014).
The Lexical Complexity Prediction (LCP) shared task of Semeval-2021 (Shardlow et al., 2021) proposes the evaluation of CWI systems by predicting the complexity value of English words in context. LCP is divided into two sub-tasks: Subtask 1, predicting the complexity score for single words; Sub-task 2: predicting the complexity score for multi-word expressions. In our participation in both sub-tasks, we treat the identification of complex words as a regression problem, where each word is given a score between 1 and 5, given the sentence in which it occurs. In order to do this, we have evaluated sub-sets of word and sentence features against different machine learning models.
Our best submissions achieved Pearson correlation coefficients of 0.779 and 0.809 for single words and multi-word expressions respectively.
In Section 2 we review related work for this task. Section 3 and 4 introduce the data and feature engineering approaches respectively. In Section 5 the performance of different machine learning models is analysed. In Section 6 we present the obtained results. Finally, in Section 7 we draw our conclusions and outline future work.

Related Work
Previous CWI studies applied to the English language have relied mostly on word frequencies, psycholinguistic information (Devlin and Tait, 1998), lexicons and other word-based features such as number of characters or syllable counts (Shardlow, 2013), which considered in most cases the target word in isolation. In order to address the limitations of word-level approaches more recent work made use of contextual and sentence information such as measuring the complexity of word n-grams (Ligozat et al., 2012), applying language models (Maddela and Xu, 2018) or treating the problem as a sequence labelling task (Gooding and Kochmar, 2019).
In this paper, we not only evaluate many of the traditional word-based features found in the literature but we also pay attention to the context surrounding the target by generating additional bigram and sentence features. In the end, we demonstrate that a careful selection of simple features is still competitive against more novel approaches for this task.

Datasets
CompLex (Shardlow et al., 2020), which was the official dataset provided by the organizers, contains complexity annotations using a 5-point Likert scale for 7,662 words and 1,517 multi-word expressions (MWE) from three domains: the Bible, Europarl, and biomedical texts.
External datasets and models are historically allowed and used in SemEval as a way of complementing the original training set. Likewise, based on previous experiences, external resources can also correlate better with the evaluation labels than the official task resources in certain scenarios (Mosquera, 2020). For this reason, related datasets from previous CWI shared tasks such as CWI 2016 (Paetzold and Specia, 2016) and CWI 2018 (Štajner et al., 2018) were considered and evaluated as both extra training data and for deriving additional features. However, the performance of our models during the validation step not only didn't improve but worsened when attempting to use them.

Feature Engineering
The 51 features used in order to detect the complexity of single words and each component of MWEs are as follows: Word length (word len): The length in characters of the target word.
Morpheme length (morpheme len): Number of morphemes for the target word.
Google frequency (google freq): The frequency of the target word based on a subset Google ngram corpus 1 .
Wikipedia word frequency (wiki freq1): The frequency of the target word based on Wikipedia 2 .
Wikipedia document frequency (wiki freq2): The number of documents in Wikpedia where the target word appears.
Complexity score (comp lex): Complexity score for the target word from a complexity lexicon (Maddela and Xu, 2018).
Number of morphemes (morpheme len): Number of morphemes in the target word.
Zipf frequency (zip freq): The frequency of the target word in Zipf-scale as provided by the wordfreq (Speer et al., 2018) Python library.
Is stopword (stop): True if the target word is an stopword.
Is acronym (acro): Heuristic that is set to True if the target word is a potential acronym based on simple casing rules.
Average age of acquisition (age): At what age the target word is most likely to enter someone's vocabulary (Kuperman et al., 2012).
Average concreteness (concrete): Concretedness rating for the target word .
Lemma length (lemma len): Lemma length of the target word.
Word frequency (COCA) (word freq): Frequency of the target word based on the COCA corpus (Davies, 2008).
Lemma frequency (COCA) (lemma freq): Frequency of the lemmatized target word based on the COCA corpus (Davies, 2008).
(consonant freq): Frequency of consonants in the target word.
Number word senses (wn senses): Number of senses of the target word extracted from WordNet (Fellbaum, 2010).
Number of synonyms (synonyms): Number of synonyms of the target word from WordNet.
Number of hypernyms (hypernyms): Number of hypernyms of the target word from WordNet.
Number of hyponyms (hyponyms): Number of hyponyms of the target word from WordNet.
WordNet min-depth (wn mindepth): Minimum distance to the root hypernym in WordNet for the target word.
WordNet max-depth (wn maxdepth): Maximum distance to the root hypernym in WordNet for the target word.
Number of Greek or Latin affixes (greek or latin affix): True if the target word contains Greek or Latin affixes 3 .
Bing frequency (bing counts): The frequency of the target word based on the Bing n-gram corpus (Wang et al., 2010).
Bi-gram frequency (ph mc2): Bi-gram frequency for the target and its preceding word in Google Books Ngram Dataset obtained via the phrasefinder API 4 .
Volume count (ph vc2): The number of books where the target and its preceding word appeared in the Google Books Ngram Dataset obtained via the phrasefinder API.
Year of appearance (ph fy2): The first year where the target and its preceding word appeared in the Google Books Ngram Dataset obtained via the phrasefinder API.
Kincaid grade level (sentence Kincaid): Kincaid grade level of the whole sentence.
ARI score (sentence ARI): Automated readability index (Senter and Smith, 1967) of the whole sentence.
All the readability features were calculated using the readability Python library 5 .

Machine Learning Approach
Since the labels in the training dataset were continuous we have modelled both sub-tasks as regression problems. For sub-task 1, we made use of Light-GBM (LGB) (Ke et al., 2017) implementation of gradient tree boosting. Minimal hyper-parameter optimization was performed against our development set, using a 0.01 learning rate and limiting the number of leaves of each tree to 30 over 500 boosting iterations.
For sub-task 2, the complexity score of each MWE component was obtained by using a linear regression (LR) model and averaged with equal weights.
By examining the feature importance for both the LGB model in Figure 2 and the LR model in Figure 3 we can observe several sentence readability features being identified as top contributors. While some degree of correlation between the complexity of the sentence and the target word was expect a priori, a machine learning model can also use sentence-level complexity as a predictor of formality and genre (Mosquera and Moreda, 2011), thus being able to differentiate between the different sub-corpora present in the training data as seen in Figure 1 .

Results
For sub-task 1, we have evaluated the performance of both linear and tree ensembles using the provided trial set and a randomly selected holdout with 30% of the training data as development set. The best performing model was gradient boosting. See Table 1.

Conclusion and Future Work
In this paper, we present the system developed for the Lexical Complexity Prediction task of SemEval 2021. Even though most of the features we made use of are relatively common in previous works, we demonstrate that a careful selection of lexical, contextual and semantic features at both target word and sentence level can still produce competitive results for this task. In a future work we would like to explore different neural network architectures and automated machine learning (AutoML) approaches.