BigGreen at SemEval-2021 Task 1: Lexical Complexity Prediction with Assembly Models

This paper describes a system submitted by team BigGreen to LCP 2021 for predicting the lexical complexity of English words in a given context. We assemble a feature engineering-based model with a deep neural network model founded on BERT. While BERT itself performs competitively, our feature engineering-based model helps in extreme cases, eg. separating instances of easy and neutral difficulty. Our handcrafted features comprise a breadth of lexical, semantic, syntactic, and novel phonological measures. Visualizations of BERT attention maps offer insight into potential features that Transformers models may learn when fine-tuned for lexical complexity prediction. Our ensembled predictions score reasonably well for the single word subtask, and we demonstrate how they can be harnessed to perform well on the multi word expression subtask too.


Introduction
Lexical simplification (LS) is the task of replacing difficult words in text with simpler alternatives. It is relevant in reading comprehension, where early studies have shown infrequent words lead to more time spent by a reader fixated on it, and that ambiguity in a word's meaning adds to comprehension time (Rayner and Duffy, 1986). Complex word identification (CWI) is believed to be a fundamental step in the automation of lexical simplification (Shardlow, 2014). Early techniques for conducting CWI suffer from a lack of robustness, from simplifying all words to then study its efficacy (Devlin, 1998), to applying thresholds on features like word frequency (Zeng et al., 2005).
This year's Lexical Complexity Prediction (LCP) shared task (Shardlow et al., 2021) forgoes the treatment of word difficulty as a binary classification task (Paetzold and Specia, 2016a;Yimam et al., 2018) and instead measures degree of complexity on a continuous scale. This choice is intriguing as it mitigates a dilemma with previous approaches of having to treat words extremely close to a decision boundary (suppose a threshold deems a word's difficulty) identically to those that are far away, ie. extremely easy or extremely difficult.
Teams are asked to submit predictions on unlabeled test sets for two subtasks: predicting on English single word and multi word expressions (MWEs). For each subtask, BigGreen presents a machine learning-based approach that fuses the predictions of a feature engineering-based regressor with those of a feature learning-based deep neural network model founded on BERT (Devlin et al., 2018). Our code is made available on GitHub. 1

Related Work
Previous studies have looked at estimating the readability of a given text at the sentence-level.
Mc Laughlin (1969) regresses the number of polysyllabic words in a given lesson against the mean score for students quizzed on said lesson, yielding the SMOG Readability Formula. Dale and Chall (1948) offer a list of 768 (later updated to 3,000) words familiar to grade-school students in reading, which they find correlates with passage difficulty. An issue with traditional readability metrics seems to be the loss of generality at the word-level. Shardlow (2013) tries a brute force approach where a simplification algorithm is applied to each word of a given text, deeming a word complex only if it is simplified. However, this suffers from the assumption that a non-complex word does not require further simplification. They also try assigning a familiarity score to a word, and determining whether the word is complex or not by applying a threshold. We avoid thresholding our features in this study as we find it unnecessary, since raw familiarity scores can be used as features in regression-based tasks.
Results from CWI at SemEval-2016(Zampieri et al., 2017 suggest vote ensembling predictions of the best performing models as an effective strategy, while several top-performing models (Paetzold and Specia, 2016b;Ronzano et al., 2016;Mukherjee et al., 2016) appear to use linguistic information beyond just word frequency. This inspires our use of ensemble techniques, and a foray into phonological features as a new point of research. Results from CWI at SemEval-2018 show feature engineeringbased models outperforming deep learning-based counterparts, despite the latter having generally better performances since SemEval-2016.

CompLex Dataset
Shardlow et al. (2020) present CompLex, a novel dataset in which each target expression (a single word or two-token MWE) is assigned a continuous label denoting its lexical complexity. Each label lies in range 0-1, and represents the (normalized) average score given by employed crowd workers who record an expression's difficulty on a 5-point Likert scale. We define a sample's class as the bin to which its complexity label belongs, where bins are formed using the following mapping of complexity ranges: Target expressions in CompLex have 0.395 average complexity and 0.115 standard deviation, reflecting an imbalance in favor of class 2 and 3 samples.
Each target expression is accompanied by the sentence it was extracted from, drawn from one of three corpora (Bible, Biomed, and Europarl). A summary of the train, trial, 2 and test set samples is 2 In our study we avoid the trial set as we find it to be less representative of the training data, opting instead for training set cross-validation (stratified by corpus and complexity label). provided in Table 1.

External Datasets
We use four additional corpora to extract term frequency-based features from: • English Gigaword Fifth Edition (Gigaword): this comprises articles from seven English newswires (Parker et al., 2011).
• Google Books Ngrams, version 2 (GBND): this is used to count occurences of phrases across a corpus of books, accessed via the PhraseFinder API (Trenkmann).
• British National Corpus, version 3 (BNC): this is a collection of written and spoken English text (Consortium et al., 2007).
• SUBTLEXus: this consists of American English subtitles, offering a multitude of word frequency lists (Brysbaert and New, 2009).

BigGreen System & Approaches
In this section, we overview features fed to our feature engineering-based model, as well as training techniques for the feature learning-based model. We describe our features in detail in Appendix A. Note that fitted models for the single word subtask are then harnessed for the MWE subtask.

Feature Extraction
We aim to capture a breadth of information pertaining to the target word and its context. Most features follow heavily right-skewed distributions, prompting us to also consider the log-transformed version of each feature. For the MWE subtask, features are extracted independently for head and tail words.

Lexical Features
These are features based on lexical information about the target word: • Word length: length of the target word.
• Number of syllables: number of syllables in the target word, via the Syllables library. 3 • Is acronym: whether the target word is a sequence of capital letters.

Semantic Features
These features capture the target word's meaning: • WordNet features: the number of hyponyms and hypernyms associated with the target word in WordNet (Fellbaum, 2010).
• ELMo word embeddings: we extract for each target word a 1024-dimension contextualized embedding pre-trained on the One Billion Word Benchmark (Peters et al., 2018).
• GloVe context embeddings: we obtain the average 300-dimension GloVe word embedding over each token in the given sentence.

Phonetic Features
These features compute the likelihood that soundable portions of the target word would arise in English language. We estimate ground truth transition probabilities between any two units (phonemes or characters) using Gigaword: • Phoneme transition probability: we consider the min/max/mean/standard deviation over the set of transition probabilities for the target word's phoneme bigrams.
• Character transition probability: analogous to that above, over character bigrams.

Word Frequency & N-gram Features
These features are expressly included due to their expected importance as features (Zampieri et al., 2017). Gigaword is the main corpus from which we extract word frequency measures (for both lemmatized and unlemmatized versions of the target word), summed frequency of the target word's byte pair encodings (BPEs), as well as summed frequencies of bigrams and trigrams. We complement these features with their IDF-based analogues. Lastly, we use the GBND, BNC, and SUBTLEXus corpora to extract secondary word frequency, bigram, and trigram measures.

Syntactic Features
These are features that assess the syntactic structure of the target word's context. We construct the constituency parse tree for each sentence using a Stanford CoreNLP pipeline (Manning et al., 2014).
• Depth of parse tree: the parse tree's height.
• Depth of target word: distance (in edges) between target word and parse tree's root node.
• Is proper: whether the target word is a proper noun/adjective, detected using capitalization.

Training
Prior to training, we Z-score standardize all features. For the single word subtask, we fit Linear, Lasso (Tibshirani, 1996) After identifying the best performing model by Pearson correlation, we seek to mitigate the imbalanced nature of the target variable, ie. multitude of class 1,2,3 and lack of class 4,5 samples: we devise a sister version of our top-performing model, fit upon a reduced training set. For the reduced set, we tune percentages removed from classes 1-3 by performing cross-validation on the full training set.

Approach based on Feature Learning
Our handcrafted feature set relies heavily on target word-specific features. Beyond N-gram and syntactic features, it is a cursory analysis of the context surrounding the target word. We seek an alternative, automated approach using feature learning.

Architecture
LSTM-based approaches have been used to model the contexts of target words in past works (Hartmann and Dos Santos, 2018;De Hertog and Tack, 2018). An issue with a single LSTM is its ability to read tokens of an input sentence sequentially only in a single direction (eg. left-to-right). It inspires us to try a Transformer-based approach (Vaswani et al., 2017), architectures that process sentences as a whole (instead of word-by-word) by applying attention mechanisms upon them. Attention weights are useful as they can be interpreted as learned relationships between words. BERT (Devlin et al., 2018) is one such model used for a variety of natural language understanding (NLU) tasks.
Multi-Task Deep Neural Network (MT-DNN) proposed by Liu et al. (2019) offers state-of-theart results for multiple NLU tasks by incorporating benefits of both multi-task learning and language model pre-training. We are able to initialize MT-DNN's shared text encoding layers with a pretrained BERT base model (cased), and fine-tune its later layers for 5 epochs, using a mean squared error loss function and default hyperparameters. Such hyperparameter settings are provided in Appendix B. Note that the model is fine-tuned on only the CompLex corpus.

Input Layer
Data is fed to the model's input layer in Premise-AndOneHypothesis format, premise and hypothesis being sentence and target word/MWE, respectively. The data is preprocessed by a BERT tokenizer, backed by Hugging Face (Wolf et al., 2020).

Output Layer
Our model's output layer produces the predicted lexical complexity for a given target word/MWE. Additionally, we extract attention maps across each of the model's attention heads, for each test set sample. These will be assessed in Section 6.3.

Ensembling
Our best performing feature engineering-based regression model yields two sets of predictions (from fitting on full and reduced training sets, respectively). We default to using the full predictions, then tune a threshold, where predictions higher than the threshold (likely of class 4,5 samples) are overwritten with the reduced predictions. We compute a weighted average ensemble of these predictions with those of our MT-DNN model to obtain a final set of predictions for the single word subtask.
For the MWE subtask, the fitted models from the previous subtask are harnessed to predict lexical complexities for the head and tail words. We then compute a weighted average ensemble of these predicted complexities and the predictions of an MT-DNN model trained on MWEs.

Results
We present the performances of BigGreen's system on each subtask in Tables 2 and 3.  For feature selection, we find success in selecting the top-300 features by mutual information and removing quasi-constant features. The pruned feature set is passed to wrapper/embedded methods and a variety of regressors for model comparison. We find an XGBoost regressor (with hyperparameters tuned via grid search) to excel consistently for the single word subtask. As shown in Table 2, we rank in the top 15% by Pearson correlation.
For the MWE subtask, performances are reported in Table 3. Note that our submitted predictions differ from post-competition predictions. We previously used a training procedure resembling that for the single word subtask: (1) filter methods for feature selection, (2) XGBoost for regression, (3) ensembling with MT-DNN. We had passed the entire MWE as input to our XGBoost and MT-DNN models. We hypothesize that the fewer number of training samples available for this subtask contributed to the prior procedure's lackluster performance. This inspired us to incorporate the predictive capabilities of our fitted single word subtask models by applying them independently on the MWE's constituent head and tail words. This gives us predicted complexities for the head and tail words each, which when ensembled with the predictions of our MT-DNN model (that, mind you, is trained on the entire MWE) yields superior results to those submitted to competition.

Feature Contribution
In total we consider 110 features, in addition to our multidimensional embedding-based features and log-transformed features. We inspect the estimated feature importance scores produced by the XGBoost full model to find that term frequencybased features (eg. unigrams, bigrams, trigrams) are of overwhelming importance (see Figure 1). This raises concern for whether the MT-DNN model too relies on term frequencies to make its predictions, and if not, the linguistic features it may have learned upon fine-tuning. Of the remaining features having non-zero feature importances, most appear to be dimensions of target word-based semantic features (ie. GloVe or ELMo embeddings).  Based on the precedence given to term frequency features by the XGBoost full model, we hypothesize that for certain attention heads, the degree to which BPEs attend to one another varies relative to their word's rarity in lexicon. This follows the findings of Voita et al. (2019), who identify heads in which lesser frequent tokens are attended to semiuniformly by a majority of sentence tokens.

BERT Attention
To test our hypothesis, we estimate for each attention head the Pearson correlation between word frequency and average attention given to each word in the context. 4 As illustrated in Figure 2, we find multiple attention heads appearing to specialize at directing attention towards the most or least frequent words (depending on sign of the correlation). Vertical stripe patterns like that in Figure 3 emerge as a result of attention originating from a spectrum of tokens. The findings seem to affirm the fundamental relevancy of word frequency to lexical complexity prediction, corroborating our intuition.

Conclusion
In this paper, we report inspirations for a system submitted by BigGreen to LCP SharedTask 2021, performing reasonably well for the single word subtask by adapting ensemble methods upon feature engineering and feature learning-based models. We see potential in future deep learning approaches, acknowledging the need for complementary word frequency-based handcrafted features for the time being. We surpass our submitted results for the MWE subtask, by utilizing the predictive capabilities of our single word subtask models.
Avenues for improvement include better data aggregation, as a relative lack of class 4,5 samples seems to hurt Pearson correlation across extremely complex samples. An approach may involve synthetic data generation using SMOGN (Branco et al., 2017). Shardlow et al. (2020) acknowledge a reader's familiarity with a genre may affect perceived word complexity. However, the CompLex dataset lacks information on each annotator's expertise or background, which may offer valuable new insights. tf ngram 2

References
• Sum of the term frequencies of each bigram in the context containing the target word.
tf ngram 3 • Sum of the term frequencies of each trigram in the context containing the target word.
tfidf ngram 2 • Sum of the term frequency-inverse document frequencies of each bigram in the context containing the target word.
tfidf ngram 3 • Sum of the term frequency-inverse document frequencies of each trigram in the context containing the target word.
A.4.2 Google N-gram-based google ngram 1 • Term frequency of the target word.
google ngram 2 head • Term frequency of leading bigram in the context containing the target word.
google ngram 2 tail • Term frequency of trailing bigram in the context containing the target word.
google ngram 2 min • Minimum of the set of term frequencies of bigrams in context containing the target word.
google ngram 2 max • Maximum of the set described above.
google ngram 2 mean • Average of the set described above.
google ngram 2 std • Standard deviation of the set described above.
google ngram 3 head • Term frequency of leading trigram in the context containing the target word.
google ngram 3 mid • Term frequency of middle trigram in the context containing the target word.
google ngram 3 tail • Term frequency of trailing trigram in the context containing the target word.
google ngram 3 min • Minimum of set of term frequencies of trigrams in the context containing target word.
google ngram 3 max • Maximum of the set described above.
google ngram 3 mean • Average of the set described above.
google ngrams 3 std • Standard deviation of the set described above.

FREQcount
• Number of times target word appears in corpus.

CDcount
• Number of films in which target word appears.

FREQlow
• Number of times the lowercased target word appears in corpus.

CDlow
• Number of films in which the lowercased target word appears.

SUBTLWF
• Number of times the target word appears per million words.

SUBTLCD
• Percent of films in which target word appears. • Algorithms applied using Textstat library implementations, most being readability metrics.

B.2 MT-DNN
MT-DNN uses yaml as its config file format. Below are the contents of our task config file: data format: PremiseAndOneHypothesis enable san: false metric meta: -Pearson -Spearman n class: 1 loss: MseCriterion kd loss: MseCriterion adv loss: MseCriterion task type: Regression

B.3 Ensemble
Threshold above which a sample is assigned its reduced prediction (ie. XGBoost reduced prediction) instead of its full prediction (ie. XGBoost full prediction): 0.59. Note that this threshold is used to compute our XGBoost full+reduced prediction.