ANDI at SemEval-2021 Task 1: Predicting complexity in context using distributional models, behavioural norms, and lexical resources

In this paper we describe our participation in the Lexical Complexity Prediction (LCP) shared task of SemEval 2021, which involved predicting subjective ratings of complexity for English single words and multi-word expressions, presented in context. Our approach relies on a combination of distributional models, both context-dependent and context-independent, together with behavioural norms and lexical resources.


Introduction
In our day-to-day life, outside the laboratory, we almost never come across single words or pairs of words, in isolation. Instead, such verbal stimuli are typically embedded within sentences or phrases, and our understanding of individual words and word pairs is influenced by their linguistic contexts (e.g., by disambiguating their intended meaning). Hoewever, almost all behavioural norms collected so far focus only on single words or word pairs (Johns, Jamieson, & Jones, 2020). Therefore, the Lexical Complexity Prediction (LCP) shared task (Shardlow, Evans, Paetzold, & Zampieri, 2021), hosted at SemEval 2021, constitutes a timely and valuable contribution to the study of context-dependent semantics. The task requires competitors to predict subjective ratings of complexity for words or pairs of words, presented within sentences. As mentioned by the organisers, being able to automatically estimate contextualised complexity ratings would have several practical applications, such as detecting and simplifying portions of text that might be particularly difficult 1 https://github.com/armandrotaru/TeamAndi-LCP to understand for second language learners, and people with low literacy levels (e.g., as a result of suffering from a reading impairment).
In this paper we describe our submission to the competition, based on distributional models, both context-dependent and context-independent, as well as behavioural norms/lexical resources 1 . The best results are obtained by combining the three classes of predictors. However, the improvement in performance over using just context-independent models is small, and, in practice, might be compensated by their impressive vocabulary size and ease of use.

General Description
In order to predict word complexity in context, we combined information from three type of sources, namely behavioural norms/lexical resources, and distributional models. With respect to the latter, we included two distinct classes of models: • context-independent models, which output the same vector representation for a given word, regardless of the context in which the word is encountered; • context-dependent models, which output a potentially different representations for a given word, as a function of the context in which the word is presented. Our approach was very similar to that employed in (Rotaru, 2020), for predicting ratings of concreteness in context.

System Description
We tested three groups of predictors, both in isolation and combined. The first group was obtained from comprehensive datasets of subjective ratings (concreteness, age of acquisition, etc.), task performance measures (i.e., response times and accuracies in the lexical decision tasks), as well as frequency, contextual diversity, and prevalence counts, plus CEFR word lists (see the references from the beginning of the previous section). In order to extend the coverage of the subjective ratings, we did not use the original data, but instead relied on extrapolated ratings for more than 70,000 words. The extrapolation was based on the Skip-gram, GloVe, and ConceptNet NumberBatch models, using linear regression over the concatenated vector dimensions. For the (already extrapolated) ratings from (Paetzold & Specia, 2016), as well as for the frequency, contextual diversity, and prevalence counts, we employed only the normed values, as they already have very good coverage. We also used only the original lexical decision data, given that response times and accuracies are difficult to extrapolate, and did not try to extend the CEFR word lists, due to methodological difficulties. For the single word datasets, we employed all the previously mentioned factors, whereas for the multi-word expression datasets, we only employed our own extrapolated factors.
The second group was generated from Skipgram, GloVe, and ConceptNet NumberBatch embeddings. The vocabulary of the models was that described in the discussion above.
For the first two sources of information, and for each selected variable V (e.g., semantic diversity), we generated either four predictors, in the case of the single word datasets, or nine predictors, in the case of the multi-word expression datasets. • V(w) denotes the value of V corresponding to the single word w (e.g., w = "sons"). If w is not present in our norms/models, we set V(w) to the average value of V, computed over the entire vocabulary; • V(w1) and V(w2) denote the values of V corresponding to the words w1 and w2 (e.g., w1 = "skillful", w2 = "workman"), that make up the multi-word expression w1 w2 (i.e., w1 w2 = "skillful workman"). As before, if w1 and/or w2 are not present in our norms/models, we set V(w1) and/or V(w2) to the average value of V, computed over the entire vocabulary; • V(c) denotes the value of V corresponding to the context c in which the single word w, or multi-word expression w1 w2, are encountered (e.g., w = "sons", c = "The ____ of Perez: Hezron, and Hamul."; or w1 w2 = "skillful workman", c = "He made it the work of a ____."). Computing this value involves calculating the average , where V(ci) is the value of V corresponding to the i-th context word, calculated as described previously, and N is the number of context words. These predictors allowed us to include both the individual contributions of the single word w, or the multi-word expression w1 w2, and the context c, as well as certain interactions between the former and the latter.
The third group was derived from the BERT, RoBERTa, ELECTRA, ALBERT, and DeBERTa models. We used the standard (base) versions of each model (i.e., without task-specific fine-tuning), as described in the original papers, with the exception of ELECTRA, where we employed the small, base, and large versions of the model. The implementations of the models were all obtained from the Hugging Face repository (Wolf et al., 2020). The predictors consisted only of the activations for the single word w, or the multi-word expression w1 w2, averaged over the last four hidden layers.
To predict ratings of complexity in context, we employed ridge regression (lambda = 3000), for the single word dataset, and a combination of ridge regression (lambda = 1250) and gradient-boosted decision trees, for the multi-word expression dataset, after zero centering all the aforementioned predictors.

Results and Discussion
The results for English are shown Figure 1, for various sets of predictors and regularization strengths. For reasons of space, we only present the results for ridge regression, but note that similar patterns of performance are obtained for gradientboosted decision trees and other types of models, such as shallow neural networks. Results are averaged over 10 rounds of 10-fold crossvalidation, using only the training dataset.
The results indicate that context-independent models (Fig. 1b) outperform behavioural norms (Fig. 1a), and context-dependent models (Fig. 1cf). A likely reason for the superiority of contextindependent models over context-dependent models is the fact that the former were trained on huge corpora (i.e., 100-840 billion tokens), while the latter were trained on considerably smaller corpora (i.e., 3-33 billion tokens). However, in spite of this significant training disadvantage, context-dependent models produce competitive levels of performance, a finding which can likely be attributed to several factors, such as the highly non-linear integration of contextual information, the use of self-attention mechanisms, and that of more sophisticated learning objectives.
Combining the three classes of predictors produces a relatively small improvement in predictive performance, as compared to relying on any single class. This reflects a very high degree of redundancy between the complexity-related information present in the three types of predictors.
Interestingly, even for the largest set of predictors, consisting of 13,400 variables per 1,517 data points, the degree of regularization does not appear to matter much, indicating little overfitting.
Finally, there is a small, but systematic difference in performance between single words and multi-word expressions, in favour of the latter, even though the training set for single word stimuli is roughly five times larger than that for multi-word stimuli. A potential explanation for this finding might be that the individual variability in meaning for multi-word expressions is smaller than that for single words, given that expressions should be more informative than single words, in virtue of their length (i.e., two words vs one word).

Conclusions
Our results suggest that several approaches can be quite successfully employed in order to predict ratings of complexity in context, for both single words and multi-word expressions. In terms of performance, the best predictors are those derived from context-independent models (e.g., Skipgram), but relatively good results can be obtained also by using context-dependent models (e.g., BERT) and behavioural norms (e.g., subjective ratings of familiarity). Moreover, given that their vocabulary covers a remarkable number of words (i.e., more than 500 thousand, for each of the Skipgram, GloVe, and ConceptNet NumberBatch models), and that they are very easy to use off-theshelf, context-independent models represent a particularly promising approach to predicting ratings of complexity in context.