OCHADAI-KYODAI at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction

We propose an ensemble model for predicting the lexical complexity of words and multiword expressions (MWEs). The model receives as input a sentence with a target word or MWE and outputs its complexity score. Given that a key challenge with this task is the limited size of annotated data, our model relies on pretrained contextual representations from different state-of-the-art transformer-based language models (i.e., BERT and RoBERTa), and on a variety of training methods for further enhancing model generalization and robustness: multi-step fine-tuning and multi-task learning, and adversarial training. Additionally, we propose to enrich contextual representations by adding hand-crafted features during training. Our model achieved competitive results and ranked among the top-10 systems in both subtasks.


Introduction
Predicting the difficulty of a word in a given context is useful in many natural language processing (NLP) applications such as lexical simplification. Previous efforts (Paetzold and Specia, 2016;Yimam et al., 2018;Zampieri et al., 2017) have focused on framing this as a binary classification task, which might not be ideal, since a word close to the decision boundary is assumed to be just as complex as one further away (Shardlow et al., 2020). To alleviate this issue, SemEval-2021 Task 1 (Shardlow et al., 2021) formulates this task as a regression task, where a model should predict the complexity value of words (Subtask 1) and MWEs (Subtask 2) in context. This paper describes the system developed by the Ochadai-Kyodai team for SemEval-2021 Task 1. Given that a key challenge in this task is the limited size of annotated data, we follow best practices from recent work on enhancing model generalization and robustness, and propose a model  ensemble that leverages pretrained representations (i.e. BERT and RoBERTa), multi-step fine-tuning, multi-task learning and adversarial training. Additionally, we propose to enrich contextual representations by incorporating hand-crafted features during training. Our model ranked 7th out of 54 participating teams on Subtask 1, and 8th out of 37 teams on Subtask 2, obtaining Pearson correlation scores of 0.7772 and 0.8438, respectively.

Task Description
SemEval-2021 Task 1 provides participants with an augmented version of the CompLex dataset (Shardlow et al., 2020), a multi-domain English dataset with sentences containing words and MWEs annotated on a continuum scale of complexity, in the range of [0,1]. Easier words and MWEs are assigned lower complexity scores, while the more challenging ones are assigned higher scores. This corpus contains a balanced number of sentences from three different domains: Bible (Christodouloupoulos and Steedman, 2015), Europarl (Koehn, 2005) and Biomedical (Bada et al., 2012). The task is to predict the complexity value of single words (Subtask 1) and MWEs (Subtask 2) in context. The statistics of the corpus are presented in Table 1. Our team participated in both subtasks, and the next section outlines the overview of our model.

Training Procedures
Standard fine-tuning: This is the standard finetuning procedure where we fine-tune BERT and RoBERTa on each subtask-specific data. Feature-enriched fine-tuning (FEAT): During training, we enrich BERT and RoBERTa representations with word frequency information of the target word or MWE. We compute the log frequency values using the Wiki40B corpus (Guo et al., 2020). For MWEs, we compute the log of the average of the frequency of each component word. After applying the min-max normalization to this feature, we concatenate it to the CLS token vector obtained from the last layer of BERT and RoBERTa. Multi-step fine-tuning (MSFT): Multi-step finetuning works by performing a second stage of pretraining with data-rich related supervised tasks. It has been shown to improve model robustness and performance, especially for data-constrained scenarios (Phang et al., 2018;Camburu et al., 2019). Due to the limited size of the data provided for Subtask 2, we first fine-tune BERT and RoBERTa on the Subtask 1 dataset. This model's parameters are further refined by fine-tuning on the Subtask 2 dataset. Multi-task learning (MTL): Multi-task learning is an effective training paradigm to promote model generalization ability and performance (Caruana, 1997;Liu et al., 2015Liu et al., , 2019aRuder, 2017;Collobert et al., 2011). It works by leveraging data from many (related) tasks. In our experiments, we use the MT-DNN framework (Liu et al., 2019a(Liu et al., , 2020b, which incorporates BERT and RoBERTa as the shared text encoding layers (shared across all tasks), while the top layers are task-specific. We used the pre-trained BERT and RoBERTa models to initialize its shared layers and refined them via MTL on both subtasks (i.e. Subtask 1 and Subtask 2). Adversarial training (ADV): Adversarial training has proven effective in improving model generalization and robustness in computer vision (Madry et al., 2017;Goodfellow et al., 2014) and more recently in NLP (Zhu et al., 2019;Liu et al., 2020a;Pereira et al., 2020). It works by augmenting the input with a small perturbation that maximizes the adversarial loss: where the inner maximization can be solved by projected gradient descent (Madry et al., 2017). Recently, adversarial training has been successfully applied to NLP as well (Zhu et al., 2019;Pereira et al., 2020). In our experiments, we use SMART , which instead regularizes the standard training objective using virtual adversarial training (Miyato et al., 2018): Effectively, the adversarial term encourages smoothness in the input neighborhood, and α is a hyperparameter that controls the trade-off between standard errors and adversarial errors.

Ensemble Model
Ensemble of deep learning models has proven effective in improving test accuracy (Allen-Zhu and Li, 2020). We built different ensemble models by taking an unweighted average of the outputs of a few independently trained models. Each single model was trained on standard fine-tuning, multistep fine-tuning, multi-task learning, or adversarial training, using different text encoders (i.e. BERT or RoBERTa).

Implementation Details
Our model implementation is based on the MT-DNN framework (Liu et al., 2019a(Liu et al., , 2020b. We use BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019b) as the text encoders. We used ADAM (Kingma and Ba, 2015) as our optimizer with a learning rate in the range ∈ {8 × 10 −6 , 9 × 10 −6 , 1 × 10 −5 } and a batch size ∈ {8, 16, 32}. The maximum number of epochs was set to 10. A linear learning rate decay schedule with warmup over 0.1 was used, unless stated otherwise. To avoid gradient exploding, we clipped the gradient norm within 1. All the texts were tokenized using wordpieces and were chopped to spans no longer than 512 tokens. During adversarial training, we follow  and set the perturbation size to 1 × 10 −5 , the step size to 1 × 10 −3 , and to 1 × 10 −5 the variance for initializing the perturbation. The number of projected gradient steps and the α parameter (Equation 2) were both set to 1.
We follow (Devlin et al., 2019), and set the first token as the [CLS] token when encoding the input. For Subtask 1, we separate the input sentence and the target token with the special token [SEP]. e.g.
[CLS] This was the length of Sarah's life [SEP] length [SEP]. For Subtask 2, such encoding led to lower performance of our system. Therefore, we consider only the target MWE when encoding the input, e.g.
For each subtask, we used the trial dataset released by organizers as development set (see Table  1). We select the best epoch and the best hyperparameters using performance (measured in terms of Pearson correlation score) on this development set. We also experimented on saving the best epoch and best hyper-parameters for each domain (Bible, Biomedical and Europarl).

Main Results
Submitted systems were evaluated on five metrics: Pearson correlation (R), Spearman correlation (Rho), Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R2). The systems were ranked from highest Pearson correlation score to lowest. We built several models that use different text encoders and different training methods, as described in Section 3. See Table 2 for the results.
First, we observe that ensembling different single models yield better performance on both tasks. Furthermore, models that use feature-enriched representations, multi-task learning, multi-step finetuning and adversarial training surpass models that use the standard fine-tuning approach. We detail next the results for each subtask.
For Subtask 1, the single models that used RoBERTa, adversarial training, multi-task learning and feature-enriched representations performed best on the development set. Moreover, saving the best epoch and hyper-parameters for each domain performed better than saving the best epoch and hyper-parameters without domain distinction. Among the single models, the model that performed best on the development set was the model that uses RoBERTa LARGE and adversarial training (RoBERTa LARGE (ADV) domain model, with a Pearson score of 0.8441). The second best single model was the model that uses RoBERTa BASE and feature-enriched contextual representations (RoBERTa BASE (FEAT) domain model, with a Pearson score of 0.8391). The third best single model was the model that uses RoBERTa LARGE and multi-task learning (RoBERTa LARGE (MTL) domain model, with a Pearson score of 0.8371). Thus, we ensemble these three single models in different ways when making our submissions. The ensemble model that performed best on the test set (Ensemble 2 single word ) was the model that For Subtask 2, the single models that use BERT BASE outperformed models that use RoBERTa, on the development set. Moreover, we noted that using the Subtask 1 dataset as auxiliary dataset by performing multi-step finetuning and multi-task learning greatly help to improve the performance. For instance, the BERT BASE (MSFT) outperformed the BERT BASE model by 0.0405 Pearson correlation points (0.7965 vs 0.8370). The ensemble model that performed best on the test set (Ensemble 1 MWE ) was the model that combined multi-step fine-tuning and multi-task learning using BERT, i.e. BERT BASE (MSFT) and BERT BASE (MTL) models, respectively, and multi-task learning using RoBERTa (RoBERTA LARGE (MTL) model). This ensemble model obtained development and test set Pearson scores of 0.8461 and 0.8438, respectively. Different from Subtask 1, we observe that saving the best epoch and hyper-parameters for each domain on the development set performed worse than saving   the best epoch and hyper-parameters without domain distinction. We hypothesize that, due to the small size of the data provided for Subtask 2, saving the best epoch and hyper-parameters without domain distinction might avoid overfitting.

Analysis
We briefly analyse our best models' results on the test set for each subtask. Figure 1 (top) shows a comparison between our best ensemble model's predictions for Subtask 1 (Ensemble 2 single word ) and the gold answers. We observe that our model often fails to predict correctly in the range where samples have a complexity score below 0.2. We hypothesize this might be due to the skewed distribution of the golden complexity scores for each domain, as shown in Table 4. A possible solution might be to build domain-specific models, and we plan to explore this in future efforts.
Figure 1 (bottom) shows a comparison between the best ensemble model's predictions for Subtask 2 (Ensemble 1 MWE ), and the gold answers. Compared to Subtask 1, the data distribution of the development and test sets of Subtask 2 look  Table 3: Examples of successful and poor predictions on the test set by the best ensemble models submitted for each subtask (Ensemble 2 single word and Ensemble 1 multiword models). Successful predictions are highlighted in bold.

Figure 1:
Comparison between the Ensemble 2 single word and Ensemble 1 MWE models' predictions submitted for Sub-task 1 (top) and Sub-task 2 (bottom), respectively, and the gold answers. On the left, we show the distribution of the correct complexity score and our submission. On the right, we show a scatter plot where the x-axis corresponds to our model's predictions and the y-axis corresponds to the gold answers. more similar, hence a possible reason why the development and test set scores were closer than in Subtask 1 (the best ensemble models obtained development and test set scores of 0.8570 and 0.7772, respectively, in Subtask 1, and 0.8461 and 0.8438, respectively, in Subtask 2). Table 3 shows examples of successful and poor predictions made by Ensemble 2 single word and Ensemble 1 MWE models. Table 4 shows how the performance of these models varies across domains. As we can see, the Biomedical domain obtained highest Pearson correlation scores on both subtasks.

Conclusion
In this paper, we have presented the implementation of the Ochadai-Kyodai system submitted to the SemEval-2021 Task 1. Our model ranked 7th out of 54 participating teams on Subtask 1, and 8th out of 37 teams on Subtask 2. We have proposed an ensemble model that leverages pretrained representations (i.e. BERT and RoBERTa), multi-step finetuning, multi-task learning and adversarial training. Additionally, we propose to enrich contextual representations by incorporating hand-crafted features during training. In future efforts, we plan to further improve our model to better handle data-constraint and domain-shift scenarios.