RS_GV at SemEval-2021 Task 1: Sense Relative Lexical Complexity Prediction

We present the technical report of the system called RS_GV at SemEval-2021 Task 1 on lexical complexity prediction of English words. RS_GV is a neural network using hand-crafted linguistic features in combination with character and word embeddings to predict target words’ complexity. For the generation of the hand-crafted features, we set the target words in relation to their senses. RS_GV predicts the complexity well of biomedical terms but it has problems with the complexity prediction of very complex and very simple target words.


Introduction
Text simplification is the process of modifying a text so that it becomes easy for the reader to understand the meaning of the text without any loss of information. A main part of text simplification is lexical simplification. In lexical simplification, complex words are replaced with easier or more frequent synonyms. Following Shardlow (2014), the process of lexical simplification can be split as follows: I.) identification of complex words in a given text, II.) substitution generation, III.) word sense disambiguation, IV.) synonym ranking, V.) substitution of complex word with the best synonym in correct morphological form.
Following Shardlow (2014), the most common errors in lexical simplification are that the words are not identified as complex or that words are incorrectly identified as complex. One reason might be the approach to predict complex words. So far, in the task called complex word identification (CWI), a word in a sentence was labeled as either complex or simple without any range in between. Shardlow et al. (2020) criticize this approach because there is no clear threshold when a word starts to be complex. Hence, they propose a new task called lexical complexity prediction (LCP). The aim of LCP is to predict the complexity of a single word or a multi-word expression on a scale of 0 to 1. This paper proposes RS GV , a model for LCP in the context of the SemEval-2021 task 1 (Shardlow et al., 2021a). RS GV uses hand-crafted features relative to their WordNet senses, Flair embeddings and a neural regressor in a cross-domain and withindomain setting.

Related Work
Lexical complexity prediction is a new sub-task of lexical text simplification. The aim is to predict the complexity of a single word or a multiword expression on a scale of 0 to 1. The most similar task is CWI. In contrast to LCP, CWI aims at binary classification that determines whether a word is complex or not. As LCP has been mentioned for the first time in the context of this shared task (Shardlow et al., 2020(Shardlow et al., , 2021a, no other related work exists yet. Hence, we outline the state of the art in CWI. SemEval-2016 Task 11: CWI Paetzold and Specia (2016) collated 9200 sentences from the CW Corpus (Shardlow, 2013), the LexMTurk Corpus (Horn et al., 2014), and the Simple Wikipedia corpus (Kauchak, 2013). All these corpora were based on the Simple English Wikipedia (SEW). CWI was treated as a binary classification task, wherein 400 non-native speakers annotated content words in English text. It was observed from the annotations that complex words were shorter, less ambiguous and had a low occurrence in SEW. F-score and G-score were used as the evaluation metrics. The features incorporated by the submitted systems can be seen in Figure 1.
It is shown that the word frequency, lexical, semantic and morphological features play a dominant role in CWI. Besides these, n-gram features were also experimented with by a few systems. Word embeddings were not used extensively.
CWI Shared Task 2018 Another shared task on complex word identification was organized in 2018 (Yimam et al., 2018). Yimam et al. (2018) collected data from three sources, i.e., professionally written news, WikiNews and Wikipedia, and in four languages, i.e., English, German, French, and Spanish. The shared task was composed of two sub-tasks. Sub-task 1 approached the problem as a binary classification problem and sub-task 2 treated it as a probabilistic classification problem, wherein the score between 0 and 1 indicated the proportion of annotators who considered a word as complex. Native as well as non-native readers annotated the dataset created by Yimam et al. (2017). A word was deemed to be complex if at least one out of twenty annotators labeled it as complex. Based on annotations, it was observed that the systems might perform better when trained on domain-specific data. It was also found that traditional feature engineering-based approaches performed better than neural network and word embedding based approaches. The features incorporated by the submitted systems of 11 teams can be seen in Figure 1. The graph reinstates the fact that frequencies, lexical, semantic and morphological features play a key role in CWI. However, it was observed that as compared to 2016, in 2018, word embeddings were more commonly used.

Data
The corpus (Shardlow et al., 2020(Shardlow et al., , 2021b contains 9,476 annotated instances in three new CWI/LCP domains, i.e., bible, political and biomedical texts. For every instance, one target word, its target complexity value and its containing sentence are given. The complexity value is based on crowd-sourced human ratings of at least 4 and at most 20 persons with residence in the UK, USA, or Australia. Each instance was rated on a 5-point Likert scale from 1 (very easy) to 5 (very difficult). Afterwards, the ratings were averaged and normalized on a continuous scale between 0 and 1, where 0 is easy and 1 is complex.
Each target word occurs in multiple instances and may capture different senses so that each word can be assigned to different complexity values in different instances. For example, vision occurs in all sub-domains with different meaning, e.g., ability to see, supernatural experience, and foresight.
Following the corpus description (Shardlow et al., 2020), a target word should only occur in a different sentence but not in the same sentence twice. Unfortunately, in our corpus analysis, we found a few doubled instances but with varying complexity values. For example, body is rated within in the same sentence in the biomedical part of the set with complexity values of 0.05 and 0.32 (see Appendix C, Table 9). This variation underlines that LCP is a subjective task, and, hence, a difficult NLP task (see section 5.3).
More details regarding the data, including the data split in training, trial, and test can be found in the shared task paper (Shardlow et al., 2021a).
As a preprocessing step we tokenized the sentences and annotated the tokens with their lemma, part-of-speech, and morphological information using spaCy (Honnibal and Montani, 2017). This linguistic information is the basis of our features.

Evaluation
The lexical complexity prediction is evaluated, following the shared task instructions (Shardlow et al., 2021a), with e.g., Pearson's correlation (r, mainly reported here) and Mean Absolute Error (M AE).

Baselines
We use the baseline results reported by the organizers 1 as comparative results. They use linear regression models with the following features, complexity-average, word length, log word frequency from SUBTLEX and log word frequency combined with word length.

System Description
Our system's main characteristics are a combination of hand-crafted features, contextualized character embeddings (see subsection 4.1), a sense relative normalization (see subsection 4.2), and a neural network for regression (see subsection 4.4).

Features
Based on the survey of features previously used for complexity estimation of words (see section 2), we decided to combine hand-crafted features and contextualized embeddings. A list of all language resources used for feature generation is provided in Appendix B (see Table 7).
We suggest that the contextualized embeddings perform better on LCP as the context of the target word and the meaning of the sentence are important for words' complexity. To the best of our knowledge contextualized character embeddings have not been used for CWI or LCP before.
The embedddings are extracted using FLAIR (Akbik et al., 2019a). Details regarding the settings of the word and character embeddings are provided in Appendix B (see Table 8).

Hand-crafted Features
An overview of all hand-crafted features used is visualized in Table 1.
Readability Assessment Features. We use the sentence's readability as a feature because we assume that a token would be perceived as more complex if the entire sentence is complex. We implemented the readability using readability scores which are mainly applicable on texts such as Kincaid et al. (1975), Gunning (1952), Coleman and Liau (1975), Dale and Chall (1948) and Senter and Smith (1967)   2014). We do not consider readability scores that are applicable on sentences as we could not reproduce certain sentence-level readability methods.
Lexical Features. Word length, word frequency and number of syllables are included in the set of lexical features following the methodology explained in Shardlow et al. (2020). The word frequency values are obtained from Sharoff (2006) and the GoogleWeb1T resource (Brants, Thorsten and Franz, Alex, 2006). Besides these, the number of consonants and vowels are also calculated.
WordNet Features. Paetzold and Specia (2016) use the number of senses, synonyms, hypernyms and hyponyms among other features to identify complex words. In our study, the number of hypernyms, hyponyms and senses are retrieved from the English WordNet (Fellbaum, 1998).
Psycholinguistic Features. Similarly as proposed in Davoodi and Kosseim (2016), we generate psycholinguistic features, e.g., word familiarity and age of acquisition, using the Medical Research Council (MRC) Psycholinguistic Database version 2.0 (Wilson, 1988).
Morphological Features. As seen in the survey of CWI shared task, morphological features are often used for this task. Hence, we also use a few morphological features derived by the morphological database MorphoLex-EN (Sánchez-Gutiérrez et al., 2018), e.g., number of prefixes, morphemes, and suffixes. We assume the more morphological rich, the more complicated the word.
Lexicon-based Features. As, for example, proposed in AbuRa'ed and Saggion (2018), and Wani et al. (2018), we check if the target word is con-tained in the Oxford 3000 word list (Dictionaries, 2021) with commonly used words. We assume the more common a word is, the simpler it would be.
Other Features. Since it is expected that the corpus contains a lot of named entities, such as person names in the bible subcorpus, we check if a target word is a named entity, as also suggested in Gooding and Kochmar (2018). The last feature is the position of the target word in the sentence. If a target word occurs more than once in a sentence, we consider the word's last occurrence. In contrast to AbuRa'ed and Saggion (2018), who normalize the word position by the sentence length, we use the absolute word position because we normalize all features afterwards.

Normalization
The hand-crafted features described above all range on different scales, hence, normalizing is required. The normalization is performed as follows: I.) the synsets of the target word are identified, II.) the values of features for every word in the synset are calculated, III.) the values are normalized using min-max normalization. This is being done to compare words that are related to each other, rather than comparing, for instance, frequencies of unrelated words (glee and joyous as opposed to glee and table). In this manner, we are normalizing all the values within a range of 0-1, but by comparing each word with a related word in the synset in which it is present. For words that appear in multiple synsets, we take an average of the normalized values.
As not all features could be normalized relative to their sense (see Table 1), e.g., readability features, we normalized them using scikit-learn's Min-MaxScaler (Pedregosa et al., 2011).

Feature Sets
We create different feature sets considering the normalizing strategies in combination with all character and word embeddings. For the hand-crafted features, we either used the 14 sense relative features, all 34 minmax normalized features, or the 14 sense relative features combined with the missing 20 features minmax normalized (both). All feature sets are listed in Table 2.

Model
RS GV's structure is a more simple version of the structure proposed in De Hertog and Tack (2018), containing linear layers instead of convolutional layers. Our model is a simple feed-forward neural network with two input layers -one for the handcrafted features and one for the embedding features-, both followed by a linear hidden layer. Both feature layers are concatenated in another hidden linear layer. It is finally followed by a linear output layer which is activated using the rectified linear unit function (ReLU). We also use stochastic gradient descent (SGD) optimization function. L1Loss as implemented in scikit-learn (Pedregosa et al., 2011) or another mean absolute error loss function seems best for our purpose of predicting continuous labels in a regression task. Following easy stopping, we chose 250 epochs for our model. All hyperparameters with which our model performs best are listed in Appendix A (see Table 5).
RS GV can be trained either across all domains at once (cross-domain) or on each domain separately (within-domain).

Implementation
The system is implemented in Python 3.8 and Py-Torch 1.6 (Paszke et al., 2019) using the packages listed in Appendix B (see Table 6). The code of the system is available in our GitHub repository: https://github.com/gayatrivenugopal/ SharedTask-LPC2021.

Ablation Tests / Error Analysis
In this section, we report on different approaches made during developing RS GV . We compare the results on the trial data using the different feature sets, and a within and a cross domain approach. In the following, we report the average of Pearson correlation on 10 system runs.

Feature Sets
The system's performance considering all different feature sets is summarized in Table 2 Table 2: Results of all feature sets reporting Pearson correlation r (average of 10 runs) on the trial data set. The standard deviation is provided in the last column.
Hand-crafted Feature Sets. Considering all embedding feature sets (see Table 2), RS GV performs often best and with a comparative low standard deviation (see Table 2) with the hand-crafted feature set both (e.g., r f lair =0.8027, ±0.0051) compared to sense relative (e.g., r f lair =0.8002, ±0.0056) and minmax (e.g., r f lair =0.8007, ±0.0039). Hence, in the following, we report the results only on the hand-crafted feature set both.
As a compromise of contextualized vs noncontextualized and character vs word embeddings, we use stacked Flair embeddings. They combine the forward and backward versions of Flair contextualized character embeddings with GloVe non-contextualized word embeddings.

Cross-domain vs. within-domain
In contrast to the insight of Yimam et al. (2018), RS GV performs on average better using the crossdomain approach (r=0.8027, ±0.0051) than the within-domain approach (r=0.7823, ±0.0235). The standard deviation of the within-domain approach implies that the model is not as robust as the cross-domain approach. Roughly 3000 instances per domain might be too less to train a robust LCP model with a neural network.

Deep Learning vs. Machine Learning
We compare the results of our deep learning approach of RS GV with a machine learning regression, i.e., linear regression of scikit-learn. As a result, the neural network and Flair (r=0.8027, ±0.0051) significantly improve LCP compared to the machine learning regression (r=0.6945) using only hand-crafted features. Hence, we can confirm the results of the CWI shared task 2018, character embeddings and neural networks do improve LCP.

Submitted Results
Following the previously described ablation tests, we chose to submit the results of the cross-domain approach and the within-domain approach. Both use a deep learning regressor and stacked Flair embeddings in combination with the hand-crafted feature set both.This section presents the official results of our system RS GV on the test set at Se-mEval 2021 Task 1 sub-task 1 (see also Table 3).
With a Pearson correlation coefficient of r=0.7478 our system with the within-domain outperforms the cross-domain approach on the test data (r=0.7316). Officially, RS GV ranks on place 34 of 54. The best system proposed by the team JUST BLUE achieved r=0.7886.
Comparing our submitted results with the results on an average of 10 runs (see Table 3), the cross-domain approach can outperform the withindomain approach on the test and trial data.
Overall, both approaches achieve better results than each of the baselines.  Table 3: Results using the trial (3rd) and test dataset (4th column) using Pearson correlation r for evaluation. The first block contains our submitted and averaged results of 10 runs using Flair and both. The second block reports the results of the baselines and the third block the results of the best performing system.

Error Analysis
The submitted results reveal that RS GV cannot stick with the shared task's best performing systems. This section presents insights regarding the problems and strengths of RS GV on the test data.
Domain-specific Results. The subcorpora differ regarding their lexical complexity: The biomed subcorpus has the highest average of lexical complexity in the single word dataset (0.325) and the europarl subset the lowest average (0.286). When we train and predict the lexical complexity per domain, we can observe the same ranking of the complexity prediction per domain as in Shardlow et al. (2020): The lexical complexity of the europarl domain is most difficult to predict for RS GV , whereas the biomedical subcorpus is most easy (see Table 4).
Domain Feature set r (n=10) SD all Flair + both 0.7823 0.0235 bible Flair + both 0.7177 0.0182 biomed Flair + both 0.8585 0.0042 europarl Flair + both 0.7444 0.0089 Table 4: Results of within-domain approach and results per domain using the trial dataset for evaluation. The Pearson correlation r is an average of 10 system runs.
The standard deviation is provided in the last column.
High vs. Low Complexity. It seems that our system can predict complex words better than easy words. However, when splitting the test dataset by complexity value and not by domain, RS GV performs poorly on very complex words (complexity value > 0.666, r=0.0125, ±=0.0542, n=12), which might be again due to too less training samples (n=105) for the neural network. Furthermore, the system performs poorly for very easy words (complexity value <0.2, r=0.0873, ±=0.0272, n=188) although roughly 20% of the training samples (n=1600) are in this complexity area. We have not found a reason for it yet.
Homonym-specific Results. This SemEval task aims at predicting word complexity of tokens in different context including different meanings. Looking more closely on homonyms, on the one hand, different complexity values are assigned to different meanings of a homonym, e.g., vision, but on the other hand, similar complexity values are assigned to a homonym, e.g., resolution. Hence, there is no clear interpretation of how to predict their complexity. This problem is reflected in RS GV , our system predicts only slightly different complexity values per homonyms. It seems, that RS GV can somehow differentiate the different meaning of the words but overall it differentiate not good enough to perform well in their lexical complexity prediction.
The examples also show the importance of the multi-word LCP task, hence "account" is part of light verb constructions as "to give account" and "to take into account".
Context-specific Results. A few samples contain the same token in the (nearly) same sentence, but the complexity values of them are varying (see Appendix C, Table 9). Removing these 6 out of overall 917 samples of the test data, the system output already improve from 0.7316 to 0.7334. This underlines that LCP is a subjective task and, hence, difficult to predict for machines.
Linearity. We tested the data for linearity in order to justify the usage of linear regression. We could not find any linearity between the individual features and the complexity value. The missing linearity might be a reason why RS GV could not keep with other systems of the shared task.

Discussion and Conclusion
We described our model named RS GV which was submitted to SemEval Task 2021: Task 1 regarding lexical complexity prediction. We propose a neural network with a combination of hand-crafted and word/character embeddings to approach the task. Our analysis shows that normalization of handcrafted features using WordNet senses achieves better results than using only a minmax normalization. Furthermore, we figured out that RS GV predicts lexical complexity best using a combination of non-contextualized word embeddings and contextualized character embeddings.
In contrast to other shared tasks results, our cross-domain approach achieves better results than the domain-specific approach. A domain-specific approach may need more data to perform reliably. Furthermore, our neural regressor seems problematic, since it shows some variance in the results on average and the current dataset might be too small for regression with neural networks.

Future Work
In future works, we plan to improve the character and word embeddings. We could fine-tune the embeddings on our data or use domain-specific pre-trained embeddings, which fits the datasets' domains, e.g., BioFlair (Sharma and Jr, 2019).
Furthermore, we could calculate more handcrafted features or edit the current ones. For example, the implementation of sentence readability formulas seems more promising than the misuse of text readability formulas on sentences.
The current neural network contains only a few linear layers, an extension using, e.g., convolutional layers for feature selection seems promising.   Table 6: Python packages used for the implementation of the proposed system.

3LCXHSGDLT6CT5B6A4WGQ3SQJNDSES
Therefore thus says Yahweh of Armies concerning the prophets: Behold, I will feed them with wormwood, and make them drink the water of gall; for from the prophets of Jerusalem is ungodliness gone forth into all the land.
wormwood 0.7321 0.4117 3DWNFENNE3V120VNY4BPPGPCAHX4JD therefore thus says Yahweh of Armies, the God of Israel, Behold, I will feed them, even this people, with wormwood, and give them water of gall to drink. wormwood 0.4843 0.4170 Table 9: Samples of the test set with the same token in the same sentence but different complexity values. The last column contains the predicted values of RS GV .