Self-Supervised Curriculum Learning for Spelling Error Correction

Spelling Error Correction (SEC) that requires high-level language understanding is a challenging but useful task. Current SEC approaches normally leverage a pre-training then fine-tuning procedure that treats data equally. By contrast, Curriculum Learning (CL) utilizes training data differently during training and has shown its effectiveness in improving both performance and training efficiency in many other NLP tasks. In NMT, a model’s performance has been shown sensitive to the difficulty of training examples, and CL has been shown effective to address this. In SEC, the data from different language learners are naturally distributed at different difficulty levels (some errors made by beginners are obvious to correct while some made by fluent speakers are hard), and we expect that designing a curriculum correspondingly for model learning may also help its training and bring about better performance. In this paper, we study how to further improve the performance of the state-of-the-art SEC method with CL, and propose a Self-Supervised Curriculum Learning (SSCL) approach. Specifically, we directly use the cross-entropy loss as criteria for: 1) scoring the difficulty of training data, and 2) evaluating the competence of the model. In our approach, CL improves the model training, which in return improves the CL measurement. In our experiments on the SIGHAN 2015 Chinese spelling check task, we show that SSCL is superior to previous norm-based and uncertainty-aware approaches, and establish a new state of the art (74.38% F1).


Introduction
Spelling Error Correction (SEC) aims to automatically correct the spelling errors in written text either at word-level or character-level (Yu and Li, 2014;Zhang et al., 2015;Wang et al., 2018;Hong et al., 2019;Wang et al., 2019a). * Corresponding author.
Although being a very valuable natural language application, SEC is a challenging task and needs high-level language understanding.
Curriculum Learning (CL) (Bengio et al., 2009) facilitates model training in an easy-to-hard order. Previous studies (Kocmi and Bojar, 2017;Platanios et al., 2019; use sentence length or word rarity for CL, but merely consider features over sentences, which is not capable to fully reflect the data challenge for a model. SEC data difficulty is influenced by many factors, such as sentence length, word rarity and a great diversity of errors. In addition, previous CL approaches require careful design for data difficulty and training curricula. Ruiter et al. (2020) show that self-supervised learning is a curriculum learner, which might be useful to avoid such efforts. In this paper, we propose a novel Self-Supervised CL (SSCL) approach to evaluating data difficulty from the model's perspective and automatically arranging curricula for the model. Specifically, we use the training loss as the measurement of data difficulty (i.e., data of higher loss are harder to learn), and evaluate the model competence based on the loss reduction during training (i.e., a model checkpoint of lower loss is of higher performance). We expect CL to improve the model training, which in return improves the CL measurements in a virtuous circle.
Our main contributions are as follows: • We propose a novel SSCL approach which avoids human design of CL measurements to improve the SOTA SEC model; • We empirically show that our SSCL approach is better than the previous norm-based and uncertainty-aware CL approaches, and establish a new SOTA (74.38% F1) on the SIGHAN 2015 spelling error check task.

5:
Generate training subset D t = x n , y n d ( x n , y n ) <č (t) , x n , y n ∈ D .

6:
Compute instance-level data weight W d = {w d ( x n , y n , t)| x n , y n ∈ D t }.

7:
Compute token-level data weight W t = {w t ( x n i , y n i , t)| x n i , y n i ∈ x n , y n , x n , y n ∈ D t }.

8:
Update θ with the loss of examples E x n ,y n ∼Dt calculated by W d , W t and Eq. 6. 9: end while 10: return θ 2 Self-Supervised Curriculum Learning Curriculum learning requires to evaluate data difficulty and model competence during training, so as to selectively feed data of similar competence as the model's ability to the model. The algorithm is shown in Algorithm 1. We use the SEC model trained on the 5M synthetic data for one epoch to compute the data difficulty. For every epoch, we first compute model competence, and then select instances whose data difficulties are no more than the model competence to train model. In every training step, we compute data weights for backpropagation.

Data Difficulty
We use the training loss of each data instance as the measurement of data difficulty. Intuitively, the data with a lower loss are easier for the model. For a dataset with N instances X, Y = { x n , y n } N n=1 , where x n and y n are the input and the reference respectively, SSCL measures the data difficulty by the training loss.
We use the Cumulative Density Function (CDF) to transfer the distribution of data difficulty into (0, 1], following : The score of more difficult data tends to be 1, while that of easier data tends to be 0.
Rather than using the random initialized model directly for the data difficulty evaluation, the SEC model is first pre-trained for one epoch on the full synthetic training set to ensure evaluation quality of the start point.
Compared to previous approaches, SSCL has the following advantages: • It does not require manually designed data difficulty evaluation metrics; • The evaluation quality of data difficulty can be improved together with the training of the model.

Data Weight
In the training process of competence-based CL (Platanios et al., 2019), the model treats all the selected data equally, which may overuse the easy data with low difficulty. It is however counterintuitive and wastes computational resources . To address this issue, we additionally introduce a weight to the loss function at instancelevel or token-level or both levels.
Following , the instance-level weight is defined as: where λ w is the scaling hyperparameter smoothing the data weight,d ( x n , y n ) is the loss-based data difficulty, andč (t) is the model competence (described in Section 2.3). For training step t and the corresponding model competenceč (t), the weighted training loss of the instance w d ( x n , y n , t) is: (4) where w d ( x n , y n , t) encourages the training to pay more attention to more difficult data with higher data weights than to easier data. Inspired by the token-level confidence , we also weigh different tokens of a data instance differently, and present the tokenlevel weight based on the squared token-level crossentropy loss normalized at the sentence-level: where l ( x n i , y n i , t) stands for the cross-entropy loss of the ith token of the example x n , y n of the tth training step. We ensure all weights to be larger than 1 to ensure the gradient norm during backpropagation (Gu et al., 2020).
The token-level weight unties tokens from training instances and encourages the model to pay more attention to the difficult tokens in the sentence.
We consider the combination of both instancelevel and token-level as: where c 0 = 0.01, l t denotes the loss reduction in the training, l 0 is the total initial loss, and λ s is a task-independent hyperparameter to control the length of the curriculum. With l t increasing from low to high, the model's training gradually includes increasingly more difficult training data.

Settings
We apply CL approaches to the SOTA Soft-Masked BERT model  to test their effectiveness.
Soft-Masked BERT  is a model architecture for SEC. It employs a Bi-GRU as the detection network and the pre-trained BERT (Devlin et al., 2019) as the correction network. The detection network predicts the probabilities of errors and the correction network predicts the probabilities of error corrections, while the former passes its prediction results to the latter.
Experiments were conducted on the SIGHAN 2015 Chinese spelling check task, we followed  for experiment settings. Models were first pre-trained on 5M synthetic data, and then fine-tuned on the SIGHAN data. Parameters were initialized under the Lipschitz constraint (Xu et al., 2020).
We also compared our SSCL approach with the Norm-Based CL (NBCL) (Liu et al., 2020) and the Uncertainty-Aware CL (UACL) approaches . NBCL uses the norm of word embeddings to measure the difficulty of the sentence, the competence of the model and the weight of the sentence. UACL utilizes the average cross-entropy of words in an example as its data difficulty, and exploits the variance of distributions over the Monte Carlo Dropout (Gal and Ghahramani, 2016) results of the model's output probabilities to present the model uncertainty.
Performance of different approaches was evaluated by the sentence-level accuracy, precision, recall, and F1 score.

Main Results
The results of our approach and baselines are shown in Table 1.   brings about more improvements over both NBCL (+1.17% F1) and UACL (+0.65% F1), indicating that our automatic SSCL is superior to the previous approaches that require careful design for data difficulty and training curricula; and 3) our SSCL approach establishes a new SOTA (74.38% F1).

Effects of Instance-Level Weight and Token-Level Weight
We carried out an ablation study for the instancelevel weight and token-level weight mechanisms.
The results are shown in Table 2.
Table 2 depicts that the instance-level weight brings more improvements (+0.80% F1) than the token weight. But they are complementary and their combination leads to the best performance.

Effects of Hyperparameter λ s
We study the effects of the hyperparameter λ s (in Equation 7), and the results are shown in Table 3.
A larger λ s value means a more elaborate CL process for the model. Table 3 shows that the highest F1 score was obtained with 0.90 as λ s , which indicates that 0.9 is a proper value for the learning with the curriculum.

Related Work
Spelling Error Correction. SEC is helpful for many applications, such as essay scoring (Burstein and Chodorow, 1999), search (Martins and Silva, 2004;Gao et al., 2010), Optical Character Recognition (OCR) (Afli et al., 2016), machine translation and tagging (Heigold et al., 2018), and many studies have been conducted on the SEC task. Unsupervised approaches using language models and rules (Yu and Li, 2014;Tseng et al., 2015) are widely adopted. SEC is treated as a sequential labeling problem in machine learning approaches, and conditional random fields or hidden Markov models (Tseng et al., 2015;Zhang et al., 2015)  Curriculum Learning. CL (Bengio et al., 2009) aims to facilitate the model training in an easy-tohard order, which leads to improved model performance (Tsvetkov et al., 2016;Sachan and Xing, 2016;Amiri et al., 2017). Many studies adopt CL to reinforce learning to optimize the model parameters (Saito, 2018;Kumar et al., 2019). CL also has shown to be useful for data processing to improve the quality of the training data (Huang and Du, 2019). Recently, CL has been widely employed in the machine learning for NLP. It improves the performance and the training efficiency of the NMT models based on linguistic features Wang et al., 2020a), enhances the multi-domain correlation, and addresses the domain imbalance issue (Wang et al., 2020b). It also has been explored in other tasks, such as response generation (Shen and Feng, 2020) and reading comprehension (Tay et al., 2019).
Self-Supervised Learning. The basic idea of self-supervised learning (SSL) is to automatically generate or find supervision signals to solve tasks. For instance, it is used to learn representations from unlabeled data (Raina et al., 2007;Bengio et al., 2013). Tang et al. (2019) use SSL to mine useful attention supervision information from the training corpus to refine attention mechanisms. Kedia and Chinthakindi (2021) combine the SSL with pseudo-labels and meta-learning during inference to improve generalization. Ruiter et al. (2019) use an emergent NMT system to simultaneously select training data and learn internal NMT representations in a SSL way without parallel data. SSL is also adopted to solve many other problems, such as document-level context or sentence summarization (West et al., 2019;Wang et al., 2019b), dialogue learning (Wu et al., 2019), improving data scarcity or labeling costs (Fu et al., 2020;Yuan et al., 2020) and generating meta-learning tasks from unlabeled text (Bansal et al., 2020).
Comparison to Previous Work. Compared to previous CL studies, we apply SSL to CL and propose SSCL that uses the model to measure data difficulty for training instance selection in an easy-to-hard order. Compared to previous SEC approaches, we employ SSCL for the training of SEC, which establishes a new SOTA (74.38% F1) on the SIGHAN 2015 Chinese spelling check task.

Conclusion
In this paper, we applied curriculum learning to spelling error correction and present a novel Self-Supervised Curriculum Learning method.
We verify the effectiveness of the SSCL ap-proach on the SIGHAN 2015 Chinese spelling check task. Experiment results show that SSCL is able to significantly improve the performance of the state-of-the-art Soft-Masked BERT model and establishes a new state-of-the-art performance (74.38% F1). The fact that SSCL brings about more improvements than the previous norm-based and uncertainty-aware CL approaches also supports its effectiveness as a CL approach.