UPB at SemEval-2021 Task 1: Combining Deep Learning and Hand-Crafted Features for Lexical Complexity Prediction

Reading is a complex process which requires proper understanding of texts in order to create coherent mental representations. However, comprehension problems may arise due to hard-to-understand sections, which can prove troublesome for readers, while accounting for their specific language skills. As such, steps towards simplifying these sections can be performed, by accurately identifying and evaluating difficult structures. In this paper, we describe our approach for the SemEval-2021 Task 1: Lexical Complexity Prediction competition that consists of a mixture of advanced NLP techniques, namely Transformer-based language models, pre-trained word embeddings, Graph Convolutional Networks, Capsule Networks, as well as a series of hand-crafted textual complexity features. Our models are applicable on both subtasks and achieve good performance results, with a MAE below 0.07 and a Person correlation of .73 for single word identification, as well as a MAE below 0.08 and a Person correlation of .79 for multiple word targets. Our results are just 5.46% and 6.5% lower than the top scores obtained in the competition on the first and the second subtasks, respectively.


Introduction
Reading is a complex process due to the mental exercise readers are challenged to perform, since a coherent representation of the text needs to be projected into their mind in order to grasp the underlying content (Van den Broek, 2010). For non-native speakers, the lack of text understanding hinders knowledge assimilation, thus becoming the main obstacle that readers need to overcome. Complex words can impose serious difficulties, considering that their meaning is often dependant on their context and cannot be easily inferred. In order to facilitate text understanding or to perform text simplification, complex tokens first need to be detected. This can be performed by developing systems capable of identifying them by individual analysis, as well as contextual analysis.
There are two main approaches regarding the complexity task. Tokens can be binary classified as complex or non-complex, a procedure that helps users separate problematic words from the others. Words can be also labeled with a probabilistic complexity value, which in return can be used to simplify the text. Words with lower degrees of complexity can be easily explained, whereas more complex tokens can be replaced with simpler equivalent concepts.
The Lexical Complexity Prediction (LCP) shared task, organized as the SemEval-2021 Task 1 (Shardlow et al., 2021a), challenged the research community to develop robust systems that identify the complexity of a token, given its context. Systems were required to be easily adaptable, considering that the dataset entries originated from multiple domains. At the same time, the target structure evaluated in terms of complexity could contain a single word or multiple words, depending on the subtask.
The current work is structured as follows. The next section presents the state-of-the-art Natural Language Processing (NLP) approaches for LCP (probabilistic) and complex word identification (CWI). The third section outlines our approaches for this challenge, while the fourth section presents the results. Afterwards, the final section draws the conclusions and includes potential solutions that can be used to further improve performance.

Related Work
Probabilistic CWI. Kajiwara and Komachi (2018) adopted for the CWI task a system based on Random Forest regressors, alongside several features, such as the presence of the target word in certain corpora. Moreover, they conducted experiments to determine the best parameters for their regression algorithms.
De Hertog and Tack (2018) introduced a deep learning architecture for probabilistic CWI. Apart from the features extracted by the first layers of the network, the authors also included a series of handcrafted features, such as psychological measures or frequency counts. Their architecture included different Long Short-Term Memory (LSTM) modules (Hochreiter and Schmidhuber, 1997) for the input levels (i.e., word, sentence), as well as the previously mentioned psychological measures and corpus counts.
Sequence labeling CWI. Gooding and Kochmar (2019) introduced a technique based on LSTMs for CWI, which obtained better results on their sequence labeling task than previous approaches based only on feature engineering. The contexts detected by the LSTM offered valuable information, useful for identifying complex tokens placed in sequences.
Changing the focus towards text analysis, Finnimore et al. (2019) extracted a series of relevant features that supports the detection of complex words. While considering their feature analysis process, the greatest influence on the overall system performance was achieved by the number of syllables and the number of punctuation marks accompanying the targeted tokens.
A different approach regarding CWI was adopted by Zampieri et al. (2017), who employed the usage of an ensemble created on the top systems from the SemEval CWI 2016 competition (Paetzold and Specia, 2016). Other experiments performed by the authors also included plurality voting (Polikar, 2006), or a technique named Oracle (Kuncheva et al., 2001), that forced label assignation only when at least one classifier detected the correct label. Zaharia et al. (2020) tackled CWI through a cross-lingual approach. Resource scarcity is simulated by training on a small number of examples from a language and testing on different languages, through zero-shot, one-shot, and few-shot scenarios. Transformer-based models (Vaswani et al., 2017) achieved good performance on the target languages, even though the number of training entries was extremely reduced.

Dataset
CompLex (Shardlow et al., 2020(Shardlow et al., , 2021b is the dataset used for the LCP shared task that was initially annotated on a 5-point Likert scale. Moreover, the authors performed a mapping between the annotations and values between 0 and 1 in order to ensure normalization. The dataset has two categories, one developed for single word complexity score prediction, while the other is centered on groups of words; each category has entries for training, trial, and testing. The single word dataset contains 7,662 entries for training, 421 trial entries, and 917 test entries. The multi-word dataset contains a considerably smaller number of entries for each category, namely 1,517 for training, 99 trial entries, and 184 for testing. All entries from the LCP shared task are part of one of three different English corpora (i.e., Biblebiblical entries, Biomed -biomedical entries, and Europarl -political entries), evenly distributed, each one representing approximately 33% of its corresponding set. As such, the task is even more challenging when considering the vastly different domains of these entries.

Architecture
During our experiments, we combined features obtained from multiple modules described later on in detail, and then applied three regression layers, alongside a Dropout layer, to obtain the complexity score of the input (see Figure 1 for our modular architecture). The permanent components are represented by the target word embeddings and the Transformer features, which are concatenated and then fed into the final linear layers, designated for regression. The other components (i.e., characterlevel embeddings, GCN, and Capsule) are enabled in particular setups; similarly, the adversarial training component can also be disabled. At the same time, a series of hand-crafted features can be concatenated before the last layer with the aim to further improve the overall performance.

Pre-trained Word Embeddings
Pre-trained word embeddings were used as features for the final regression as an initial representation of the input. Throughout our experiments, three types of pre-trained word embeddings were consid- ered, namely: GloVe 1 , FastText 2 , and skip-gram 3 . Out of the three previous options, GloVe performed best in our experiments. As such, the results section exclusively reports the performance obtained by our configurations alongside GloVe embeddings for the target word.

Transformer-based Language Models
Considering that Transformers achieve state-of-theart performance for most NLP tasks (Wolf et al., 2020), all our setups include a Transformer-based component. However, they are pre-trained in different manners; thus, we experimented with several variants, as follows: • BERT (Devlin et al., 2019) -Extensively pretrained on English, BERT-base represents the baseline of Transformer-based models; • • RoBERTa (Liu et al., 2019) -RoBERTa improves upon BERT by modifying key hyperparameters, and by being trained with larger mini-batches and learning rates; RoBERTa usually has better performance on downstream tasks.

Adversarial Training
We also aimed to improve the robustness of the main element of our architecture, the Transformerbased component. Therefore, we adopted an adversarial training technique, similar to Karimi et al. (2020). The adversarial examples generated during training work on the embeddings level, and are based on a technique that uses the gradient of the loss function.

Character-level Embeddings
Alongside the previously mentioned word embeddings for the target word, we also employ characterlevel embeddings for the same word, such that its internal structure, as well as its universal context, can be properly captured as features in our architecture.

Graph Convolutional Networks
Besides the previous Transformer-based models, we also explored the relations between the dataset entries, as well as the vocabulary words. Therefore, Graph Convolutional Networks (GCN) (Kipf and Welling, 2016) were also considered for determining node embedding vectors, by taking into account the properties of the neighboring nodes. By stacking multiple GCN layers, the information embedded into a node can become broader, inasmuch as it incorporates considerably larger neighborhoods. Similar to Yao et al. (2019), we consider the graph to have several nodes equal to the number of entries (documents) in the corpus plus the vocabulary size of the corpus.

Capsule Networks
Alongside the relational approach derived from GCN and Transformer embeddings, we intended to further analyze our inputs by passing them through a Capsule Network (Sabour et al., 2017). This approach enables us to obtain features that reflect aspects specific to different levels of the inputs, as Capsule Networks increase the specificity of features, while the capsule layers go deeper.
Character n-grams -The character n-gram approach consists of two steps: first, a vectorizer is applied on the inputs to select a maximum number of 5,000 most frequent n-grams; second, Tf-Idf scores for these elements are computed. The obtained values are then normalized in the [0, 1] range and used as features.
ReaderBench indices -The ReaderBench framework (Dascalu et al., 2017) was used to extract additional textual complexity features reflective of writing style. Out of the 1311 features obtained by running ReaderBench on our inputs, we selected 278. The choice was made by considering only the features with a high linguistic coverage (i.e, were non-zero for at least 50% of the entries).

Traditional Machine Learning Baseline
Several machine learning algorithms, such as Logistic regression, Random Forest Regressors, XGBoost regression, or Ridge regression were experimented using the aforementioned handcrafted features.
We then switched to a ridge regression approach and trained it with a multitude of features, including Transformer-based embeddings (BERT, BioBERT, SciBERT, RoBERTa), pre-trained word embeddings (GloVe, fastText, Skip-gram), and handcrafted features.

Preprocessing and Experimental Setup
Text preprocessing is minimal and consists of removing unnecessary punctuation, such as quotes. The experimental hyperparameters for all modules are presented in Table 1. Table 2 introduces the results obtained using our deep learning architecture, while Table 3 focuses on the traditional machine learning baseline. The best results for the deep learning approaches applied on the single target word dataset are obtained using RoBERTa as Transformer model. The setup which maximizes performance considers RoBERTa, GCN, and Capsule features, obtaining a Pearson score of 0.7702 and a mean absolute error (MAE) of 0.0671 on the trial dataset. Moreover, the high performance is maintained on the test dataset, with a Pearson correlation coefficient of 0.7237 and a MAE of 0.0677. BERT, SciBERT, and BioBERT have similar results with marginal differences; GCN, Capsule, and adversarial training improve performance for all models, while character-level embeddings do not provide a boost in performance. Table 3 presents the results obtained using the features described in Section 3, namely Transformer-based contextualized embeddings (BERT, RoBERTa, BioBERT, SciBERT), pre-   trained word embeddings (GloVe, fastText, skipgram), and hand-crafted features, all combined using various regression algorithms. Logistic regression, Random Forrest and XGBosst yield lower performance when compared to the previous deep learning approaches. However, we managed to increase the scores on the single target word dataset, with Pearson coefficients of 0.7738 and 0.7340 on the trial and test datasets, by combining the results obtained from training several instances of ridge regression. Nevertheless, the best results for the multiple target word task are still obtained by the deep learning approaches (RoBERTa, GCN, Capsule, adversarial training), which surpass the Ridge Regression + pre-trained word embeddings + Transformer embeddings + handcrafted features approach by a low margin of 0.0074 Pearson on the trial dataset and 0.0033 on the test dataset.

Discussion
The entries with the largest difference when compared to the gold standard are represented by the ones that are part of the Biomed category. This discrepancy is valid for both subtasks (i.e, single target word and multiple target words). The Biomed entries employ the usage of more complex terminology, quantities, or specific scientific names. Therefore, it becomes more difficult for standard pre-trained Transformer systems, such as BERT or RoBERTa, to adapt to the Biomed entries. In contrast, corpora with easier to understand language (i.e., Bible and Europarl) are not properly represented when using BioBERT or SciBERT, considering that the Transformers are mainly pre-trained for scientific or biomedical texts.
Moreover, a considerable part of the Biomed entries contains large amounts of abbreviations, while other entries from the same domain have   specific terms or links, as seen in Table 4. The difference between our predictions and the correct labels are up to 0.14 for the complexity probability.

Conclusions and Future Work
This work proposes a modular architecture, as well as different training techniques for the Lexical Complexity Prediction shared task. We experimented with different variations of the previously mentioned architecture, as well as textual features alongside machine learning algorithms. First, we used different word embeddings and Transformerbased models as the main feature extractors and, at the same time, we examined a different training technique based on adversarial examples. Second, other different models were added, such as character-level embeddings, Graph Convolutional Networks, and Capsule Networks. Third, several hand-crafted features were also considered to create a solid baseline covering both deep learning and traditional machine learning regressors.
For future work, we intend to experiment with altering the modular architecture such that the models are trained similar to a Generative Adversarial Network (Croce et al., 2020), thus further improving robustness and achieving higher scores in terms of both Pearson correlation coefficients and MAE.