JCT at SemEval-2021 Task 1: Context-aware Representation for Lexical Complexity Prediction

In this paper, we present our contribution in SemEval-2021 Task 1: Lexical Complexity Prediction, where we integrate linguistic, statistical, and semantic properties of the target word and its context as features within a Machine Learning (ML) framework for predicting lexical complexity. In particular, we use BERT contextualized word embeddings to represent the semantic meaning of the target word and its context. We participated in the sub-task of predicting the complexity score of single words


Introduction
Over the last decade, automated methods for detecting complex words have been developed. At the beginning, most of these methods assumed that lexical complexity is binary, words are either "difficult" or "not difficult". Thus, the first Complex Word Identification (CWI) shared task referred to binary identification of complex words (Zampieri et al., 2017). The main limitation of this assumption is that a word close to the decision boundary is considered to be as complex as one farther apart. Therefore, three years ago, the CWI included an additional probabilistic classification task where the participants were asked to give a probability of the given target word in particular context being complex (Štajner et al., 2018).
Recently, CompLex, a new English corpus for lexical complexity prediction was introduced (Shardlow et al., 2020). The corpus is annotated using a 5-point Likert scale (1-5) (corresponding to very easy, easy, neutral, difficult, and very difficult), and covers 3 genres: Bible translation, European Pariliament proceedings, and biomedical articles. SemEval-2021 (Task 1) shared task on Lexical Complexity Prediction (LCP) (Shardlow et al., 2021a,b) provided the participants with Complex and defined two sub-tasks: predicting the com-plexity score of single words, and predicting the complexity score of multi-word expressions. We present our system for the first sub-task of predicting the complexity score of single words. Our system incorporates linguistic, statistical, and semantic properties of the target word and its context as features within a Machine Learning (ML) framework for predicting lexical complexity.
This paper is organized as follows: First, in Section 2, we describe features from previous works that we have adopted. Then, in Section 3, we describe our feature sets, the feature selection process, and the results on the trial data. Finally, Our system results on the test data are detailed in Section 4, followed by conclusions in Section 5.

Related work
In this section, we shortly describe linguistic, statistical, and semantic features which were encoded as features in previous complexity prediction tasks and were integrated in our system.
Linguistics features, such as Part-Of-Speech (POS) tag, dependency parsing relations, and syllable counts, as well as statistical features, such as word length and word frequency, have been widely used for predicting lexical complexity (Mukherjee et al., 2016;Ronzano et al., 2016;Alfter and Pilán, 2018;Gooding and Kochmar, 2018;Hartmann and Dos Santos, 2018;Kajiwara and Komachi, 2018;Wani et al., 2018). Some of these works found WordNet (Miller, 1998) as a valuable source of lexical features. The main extracted feature is the number of synsets, but also information on hypernyms, hyponyms, holonym, and meronym is useful (Gooding and Kochmar, 2018;Hartmann and Dos Santos, 2018;Wani et al., 2018).
Semantic features were commonly encoded using word embedding representation of the meaning of words (Kuru, 2016;AbuRa'ed and Sag-gion, 2018). These word embeddings were generated using Word2Vec context-independent models (Mikolov et al., 2013). Word2Vec models combine different senses of the word into one single vector. However, recently, there is a growing interest in contextualized word representations, such as BERT (Devlin et al., 2018). BERT model generates context-dependent embeddings that allow a word to have several vector representations depending on the context in which it is used. In contrast to previous works that only use context-independent embeddings, our system uses the BERT-based contextdependent embeddings.

System Description
We adopt a supervised Machine Learning (ML) approach for lexical complexity prediction. The first step in a classifier training is to determine which text characteristics are relevant and how those features are coded.

Feature Sets
We next detail how the semantic properties of the sentence, as well as the linguistic and statistical properties found useful in prior work, are encoded as features. Then, in Section 3.2, we describe our feature analysis procedure and the supervised ML model. The features in our model are divided into 3 sets: linguistic, statistical and semantic.

Linguistic features
Our dataset contains three corpora: Bible, Europarl, and Biomedical, to add variation. Since each corpus has its own unique linguistic features, we first encode the text source by three binary features.
Most of our linguistic features are based on information extracted from a POS tagger. Our linguistic properties include two families of properties: morphological and syntactical.
First, we encode the target word POS. The POS is extracted by the Spacy's statistical POS tagger 1 . Each possible POS tag is represented as a binary feature. We use the following 12 tags from the Universal POS tags 2 : ADJ, ADP, ADV, CONJ, DET, NOUN, NUM, PRT, PRON, VERB and X (other). As an additional feature, the number of syllables in the target word is encoded 3 . Then, we calculate the number of punctuation marks and stopwords in the sentence (two features).
Next, we represent syntactic forms by POS patterns. The POS pattern refers to seven words, the target word and three words before and after it. Each of the words is encoded by 12 binary features, resulting with 84 features.
We also measure the polysemy degree of the target word using the number of senses in WordNet. We obtain two lexical features: number of synsets for the target word and number of synsets for the target word given its POS.

Statistical features
We define some statistical features based on frequency. First, we calculate target word length and sentence length. Then, we extract the target word frequency using Google N-gram 4 word frequencies. We encode the logarithm of this frequency as a feature to speed the ML algorithm's convergence (three features).

Semantic features
We represent the meaning of the surrounding context of the target word by vectors in the same semantic space. We use the BERT semantic space. BERT is a bidirectional transformer pre-trained on a large corpus containing the Toronto Book Corpus and Wikipedia using a combination of masked language modeling objective and next sentence prediction. BERT contextualizing vectors are used to represent the semantic meaning of the sentence by averaging the BERT vectors of seven words, the target word and three words before and after it. Thus, our semantic representation add 768 features (the size of BERT output layer).
To extract additional features, we use two machine learning algorithm: K-Means and k-Nearest Neighbors (KNN) algorithm. K-Means is an unsupervised learning algorithm used for clustering. It takes the unlabeled dataset and tries to group them into k number of clusters. We encode the K-Mean results by four binary features, a feature per cluster (k=4). The results of the KNN algorithm are encoded similarly. However, KNN is a supervised learning algorithm used for classification. It takes the labeled dataset and uses it to learn how to label other sentences. KNN classifies an unseen sentence using it k nearest neighbors voting. We use four complexity classes: 0-0.25, 0.26-0.5, 0.51-0.75, 0.76-1.

Feature Selection
For each of the above feature sets, we tried to filter out non-relevant features using several approaches.
First, we discharged features that decrease the system performance on the training set, namely, the POS pattern features, the WordNet features, and the K-Means and KNN features. We were left with 794 features. These features were selected using the Linear Regression algorithm, which was also selected as a baseline algorithm by the task organizers. To further improve the performance of our systems, we used additional ML algorithms, such as SVM and XGBoost (see more details in Section 3.3).
Next, since correlated features do not carry unique information and may interfere the learning, we tried to discharge highly correlated features. We implemented this approach using the following iterative process. The input is the desired final number of features. First, we define an initial correlation threshold (0.9). Then, we calculate the features' pairwise correlation and features with correlation above the threshold are removed. Next, if we still have more features than desired, we will lower the correlation threshold (by 10%) and repeat the process. This approach improved the performance of the SVM and Linear Regression models (selecting 97 features), but did not increase the performance of the XGBOOST method.
We note that we also tried to filter out feature using the principal component analysis (PCA) feature selection method (Song et al., 2010). PCA aims to pick a subset of features that retains as much information present in the full data as possible. PCA was performed both on the full feature list and on specific features, such as BERT features, but it was not successful.
Some of the classification models had low performance using such amount of features (794 features). Therefore, we further filleted features by calculating their correlation with the complexity score and discarding features with low correlation (less than 0.072). We resulted with the following list of 101 features: It is interesting to note that even though, there are 12 POS tags, only 2 are informative for the complexity prediction task. Considering the source text indicators, the third Bible indicator is not useful. Out of the BERT 768 features, only 94 remained (12.2% of the vector).
The BERT representation of the sentence is generated by pre-trained language representation model. These models can be trained on different datasets of various domains. Since one of our corpora is from the Biomedical domain, we examined the system performance using the domain specific BioBERT (Lee et al., 2020). Figure 1 shows a comparison between the error rate of our system using the classic BERT and BioBERT (BERT on the left and BioBERT on the right). The columns show the error rate for different text sources. The red line is the average error rate. Columns from left to right: Bible, Biomedical, and Europarl. Surprisingly, the error rate of the BioBERT on the Biomedical domain is higher than that of the classic BERT. However, the average error for both is the same (∼ 0.69).

Application of five Machine Learning methods
We combined the features in a supervised classification framework using five ML methods: Linear Regression, Supported Vector Machine (SVM), XGBoost (XGB), KNN, and Stacking (Stack). We trained the ML methods on the train set and evaluated their performances on the trial set. We ran these ML methods by the scikit-learn open-source machine-learning package in python 5 (Pedregosa et al., 2011) using the default parameters. Table 1 shows the performances of the different ML methods on the feature set of 101 features, as described above. The MAE is omitted from the table because it is similar for all the ML algorithms (0.01). The performance differences between the algorithms were not so substantial. Therefore, we next report the performances of all these methods on the test set.

Results
To increase the size of our train set for the test phase of the task, we used both the train and trial sets to train the final model.  To analyze our results, we converted the complexity scores to labels following Shardlow et al. (2020) descriptors. In Figure 2, we present the classification confusion matrix of the XGBoost algorithm. Each column of the matrix represents the instances in a predicted class while each row represents the instances in an actual class. Most of the classification errors (18.54%) were due to incorrect classification of very easy words as easy. There were also errors in the opposite direction (4.36%). Most of the rest of the classifications were between neutral and easy in both directions (7.42% + 6.43% = 13.85%). We note that the 5 th class, very difficult, does not appear in the confusion matrix since there are not any very difficult words in the test set and the system did not classified any of the words as very difficult.

Conclusions and Future Work
We have implemented a system that incorporates linguistic, statistical, and semantic features to predict lexical complexity of target word in context. BERT semantic space was used to represent the word and its context. We investigated several feature selection approaches and used various supervised algorithms. Even though our system was not highly ranked, we believe that some of the presented ideas can be useful for future research on lexical complexity prediction. In particular, we think that BERT is a powerful model that should be explored. Perhaps, fine-tuning BERT for the complexity prediction task would increase the system performance.