LangResearchLab NC at SemEval-2021 Task 1: Linguistic Feature Based Modelling for Lexical Complexity

The present work aims at assigning a complexity score between 0 and 1 to a target word or phrase in a given sentence. For each Single Word Target, a Random Forest Regressor is trained on a feature set consisting of lexical, semantic, and syntactic information about the target. For each Multiword Target, a set of individual word features is taken along with single word complexities in the feature space. The system yielded the Pearson correlation of 0.7402 and 0.8244 on the test set for the Single and Multiword Targets, respectively.


Introduction
Presence of complex words can lead to poor comprehension of a text. Identification of such complex words in a given text is a core component in the task of Automatic Simplification and Evaluation (Shardlow, 2013). The Lexical Complexity Prediction Task of SemEval 2021 (Shardlow et al., 2021) aims at development of systems for prediction of complexity scores for a target word/phrase in a given sentence. In literature, binary classification of target words in a text into complex or non-complex is referred to as Complex Word Identification (CWI) (Paetzold and Specia, 2016;Zampieri et al., 2017;Gooding and Kochmar, 2018;AbuRa'ed and Saggion, 2018;Yimam et al., 2018). Unlike previous works, a continuous complexity score is assigned to the target word in the present task which is referred to as Lexical Complexity Prediction (LCP) (Shardlow et al., 2020). For the present work, regression is performed for LCP on a set of linguistic features covering semantic, syntactic and contextual aspects of the target word as described in Section 3. Additionally, various lexicon based features are used to indicate the rarity of target words. The system achieves 0.8194 Pearson correlation for Single Word Target and 0.7482 for Multiword Target on the trial set.

Task Setup
The task is divided into two subtasks, namely Single Word Target and Multiword Target based on the length of the target. The dataset and evaluation metrics are described below.
• Dataset: The dataset consists of an augmented version of CompLex (Shardlow et al., 2020). It comprises sentences from three corpora, viz. World English Bible Translation, English Portion of the European Parliament proceedings, and articles from CRAFT corpus belonging to biomedical domain. It is split into three subsets Train, Trial, and Test.
• Evaluation Metrics: The systems are evaluated using Pearson correlation coefficient (P), Spearman rank correlation coefficient (S), Mean absolute error (MAE) and Coefficient of Determination (R 2 ).

Features
In this section we present the details of the feature space used in the present work.

Corpus Features
A feature, named Corpus, is used to indicate to which of the 3 corpora the input sentence belongs.

Shallow Features
Word level shallow features used in the present work are number of letters (Nlet), syllables (Nsyl), vowels (Nvow), percentage of upper case alphabets (PerUp), simple universal part-of-speech tag (POS), and detailed Penn part-of-speech tag (Tag) of the target word extracted using SpaCy.

NLTK WordNet Features
Number of hypernyms (Nhyper) and number of morphemes (Nmorph) of the target word consider-ing its POS tag in the given sentence are also used as features.

Exquisite Corpus (EC) Features
Exquisite Corpus 1 compiles texts from seven different domains namely Wikipedia, Subtitles, News, Books, Web, Twitter and Reddit. We have used the frequency (WordFreq) in EC and the Zipf frequency (ZipfFreq) of the target word as features (van Heuven et al., 2014).

SUBTLEX Features
Frequency (SubtFreq) of the target word extracted from SUBTLEXus 2 and its Contextual Diversity (ConDiversity) i.e. percent of the films in which the word appears are used as features.

Language Model (LM) Features
Given an input sentence S = w 1 w 2 . . . w N and a target word w t where t ∈ 1, 2, . . . N , the following features are extracted from a trigram language model trained on the Gigaword corpus 3 .

Character Language Model (CharLM) Feature
The probability of the target word (Prob3c) calculated using trigram character language model is considered as a feature. The trigram 4 probabilities are calculated using letter counts from Google Web Trillion Word Corpus. Suppose a word W consist of N letters, W = w 1 . . . w N then, the corresponding feature value will be computed as: 1 https://pypi.org/project/wordfreq/ 2 https://github.com/Wonderlic-AI/wonderlic nlp 3 lm giga 64k nvp 3gram.zip 4 http://norvig.com/ngrams/count 3l.txt

Kucera and Francis (KF) Features
The features derived by Kučera and Francis (1967), namely target word's written frequency of occurrence (KFFreq) and the number of categories of text in which the target word was found (KFNcats) are used.

Ogden Feature
A binary feature is used to indicate presence of the target word in the list of 1000 words included in Ogden's Basic English 5 (IsOgden).

Inquirer Tag Features
The General Inquirer classifies about 7500 words using 182 General Inquirer categories developed for social science content analysis (Stone et al., 1966). A binary feature is created for each category to indicate its occurrence for the target word. The POS tag of the target is matched with the 'OthTags' category to filter out incompatible categories as given in Table 1 POS of the Target Compatible OthTags

Single Word Target
In the Single Word Target task, complexity scores between 0 to 1 needs to be assigned for a target word of the input sentence. Various regression models are trained using the optimal set of features using scikit-learn 6 . The results are presented in Table 2. For both Decision Tree and Extra Tree Regressors the maximum depth (maxdepth) is tuned between 1 to 20, and the optimal maxdepth is found to be 6 and 8, respectively. Random Forest Regressors with the default setting produced the best results for the trial dataset. Using the above, our submission to the shared task achieved 0.7402 Pearson correlation on the test set.

Feature Importance
The Gini importance of the top 5 features are reported in Table 3. Gini importance of a feature is computed as the (normalized) total reduction of the mean squared error brought by that feature. The importance of the features is also analyzed by removing a set of features at a time and training a Random Forest Regressor for the reduced feature space. Each of the features from the optimal feature space has a positive effect on the performance of the system as indicated in

Inquirer Tags Importance
The effect of inclusion of Inquirer Tags in the feature space has a positive effect however the magnitude is low. This may be due to the low coverage of these features as reported in Table 5. The coverage is defined as the percentage of target words having at least one Inquirer Tag.

Additional Features
The following set of features when included in the feature space led to a decrease in performance for the present task on the trial set.
• Etymological Feature: The ISO code of the target word's origin language

Error Analysis
Error analysis indicates that absolute error for 87% test samples were less than 0.10. Samples belonging to Biomedical corpus had highest errors. Some predictions of the proposed model are presented in Table 8. The correlation between the actual and predicted complexity for similar targets in dissimilar contexts is high. However, it is revealed that difference in complexity of proper noun targets in distinct contexts could not be captured effectively through the present set of linguistic features.

Multiword Target
In the present task the Multiword Targets are pairs of two adjacent words. We have experimented with two approaches for predicting complexity scores for Multiword Targets, as described in Section 5.1 and Section 5.2

Single Word Combination
In this approach, each word of a Multiword target is considered as individual single word targets, and the complexity scores are predicted using the Single Word Target 10 model. The individual word scores are combined using Average, Maximum, and Minimum. Additionally, Algebraic Sum (a + b − ab) and Product (ab) of the individual scores are also considered. These are taken from Fuzzy s-norm and t-norm (Klir and Yuan, 1995). The results are indicated in Table 9. For both trial and test set, maximum of the complexity score of each word of the multiword target gives the least MAE and the highest R 2 value. But, the highest P 10 Random Forest Regressor w/o additional features for trial set is obtained when algebraic sum of the individual complexity scores are taken and highest S is obtained when product of the individual complexity scores are taken. For the test set, algebraic sum gives highest P and S.

Feature Combination
In this approach features corresponding to the individual words are concatenated, and then a regression model is trained with the increased feature space for complexity prediction. The individual target word complexity value predicted by the Single Word Target model is also considered as a feature.
The results are presented in      gle Word model were combined using different strategies, while in the second, the feature space was expanded to accommodate features and complexity scores corresponding to individual target words. The latter yielded the best results. Our sys-tem achieved 36 th and 17 th rank with respect to the two subtasks. The difference in the correlation value between the top performer is less than 0.05 for Single Word Target and 0.04 for Multiword Target.