Learning to Detect Noisy Labels Using Model-Based Features

Label noise is ubiquitous in various machine learning scenarios such as self-labeling with model predictions and erroneous data annotation. Many existing approaches are based on heuristics such as sample losses, which might not be flexible enough to achieve optimal solutions. Meta learning based methods address this issue by learning a data selection function, but can be hard to optimize. In light of these pros and cons, we propose Selection-Enhanced Noisy label Training (SENT) that does not rely on meta learning while having the flexibility of being data-driven. SENT transfers the noise distribution to a clean set and trains a model to distinguish noisy labels from clean ones using model-based features. Empirically, on a wide range of tasks including text classification and speech recognition, SENT improves performance over strong baselines under the settings of self-training and label corruption.


Introduction
State-of-the-art deep neural networks require large amounts of annotated training data.Though the success of large pre-training models (Devlin et al., 2018) alleviates such requirements, high-quality labeled data are still crucial to obtain the best performance on downstream tasks.However, it is extremely expensive to acquire large-scale annotated data for every new task.
To overcome the challenge of rigorous data requirements, recent works utilize weak labels for supervision, including heuristic rules (Augenstein et al., 2016;Bach et al., 2019;Awasthi et al., 2020), large-scale datasets with cheap but noisy labels (Li et al., 2017;Lee et al., 2018);, and self-training (Park et al., 2020;Wang et al., 2020).In selftraining, one trains a model on a labeled dataset and then predicts on a large amount of unlabeled data.Then the clean labeled data and pseudo-labeled data are combined to further train the model.
The aforementioned sources of supervision share two important characteristics: First, they are learning from noisy labels.Second, the noise is dependent on data features.Thus, they could be unified into the framework of noisy label learning, for which numerous approaches have been proposed to reduce the negative impact of noise.However, there are three problems in prior work on noisy label learning: First, many existing approaches are based on heuristics such as sample losses which are not flexible enough (Han et al., 2018;Yu et al., 2019;Zhou et al., 2020;Song et al., 2019); Second, many previous works require prior knowledge of the noise distribution of the dataset to adjust the hyperparameters, which is often not available in real-world applications (Song et al., 2019(Song et al., , 2020)).Third, meta learning based methods avoid previous problems but suffer from optimization difficulties (Ren et al., 2018;Zheng et al., 2021) such as longer training time, heavy hyper-parameter tuning and an unstable convergence process.To address the above problems, we propose a simple data-driven approach which does not rely on meta learning while being flexible.
Our contributions are summarized as follows: • We propose a simple yet effective de-noising approach which avoids the optimization difficulty of meta learning while enjoying the flexibility of being data-driven.• We unify the settings of both self-training and label corruption into a noisy label learning framework and demonstrate the effectiveness of our approach under both settings.• Our approach improves performance over state-of-the-art baselines on a wide range of datasets, including text classification and auto-matic speech recognition (ASR).Last but not least, our approach achieves even larger gains on few-shot learning.
2 Related Work

Self-training
Self-training is a powerful learning method that enables models to learn from huge amounts of unlabeled data by generating weak labels through either the teacher model predictions or heuristic rules.Self-training has been shown to be effective in many scenarios, including image classification (Yalniz et al., 2019), text classification (Li et al., 2019), machine translation (Wu et al., 2019) (Zou et al., 2019); in other words, pseudo labelling on unlabeled data might bring noise to the training set which leads to the degradation of further training.Most previous work simply set a fixed threshold to filter samples with low confidence (Sohn et al., 2020;Xie et al., 2020).Wang et al. (2020) used meta learning for adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.Zhang et al. (2021) used a curriculum learning approach to re-weight unlabeled data according to the model's learning status.In our work, we alleviate the error propagation from another perspective, by learning a selection model using model-based features based on a clean dataset.

Noisy Label Learning
Learning from noisy labels has long been a research area.One of the most classical works is to add a noise adaptation layer on top of the main model to learn a label transition matrix for label correction (Goldberger et al., 2017).Bootstrapping (Reed et al., 2014) introduces the notion of perceptual consistency that a model predicts correct labels for noisy samples before overfitting to noisy labels.Co-Teaching (Han et al., 2018) and Co-Teaching+ (Yu et al., 2019) train two networks while each network selects its small-loss samples as clean samples for its peer network.However, the aforementioned approaches only deal with class dependent noise (CDN) and make a strong assumption that noise distribution is independent of each instance, which is not flexible enough for many cases.SEAL (Chen et al., 2020) goes beyond previous work to consider instance dependent noise (IDN), which is more realistic and common than CDN on real world datasets.SELFIE (Song et al., 2019) uses a hybrid approach that selects refurbishable samples based on the entropy of model predictions and then refurbishes the labels with model predictions.RoCL (Zhou et al., 2020) utilizes curriculum learning that starts with easy and clean samples and gradually moves to data with pseudo labels produced by a time-ensemble.However, both SELFIE and RoCL require prior knowledge of the noise distribution of the dataset and manual adjustment for hyperparameters.To avoid such efforts, meta learning is introduced to learn selection and refurbishment.
Learning to Re-weight (Ren et al., 2018) is a meta learning algorithm that learns to assign weights to training examples based on their gradient directions.Meta-weight-net (Shu et al., 2019) parameterizes the reweighting function as a multilayer perceptron network.Meta Label Correction (Zheng et al., 2021) trains the target model with corrected labels generated by a label correction model trained on clean validation data which is jointly trained by solving a bi-level optimization problem.These meta learning algorithms afford a large degree of flexibility by directly optimizing a reliable objective.However, meta learning based models are known to be sensitive to hyperparameter tuning and the quality of support data (Agarwal et al., 2021) and suffer from optimization difficulties as they are trained by propagating second-order gradients (Hospedales et al., 2020).
The major differences between our approach and previous methods are as follows.Compared with meta-learning based models, our approach do not suffer from optimization difficulties.Compared with models with CDN assumptions, our approach can handle the IDN settings.Compared with other state-of-the-art IDN methods such as SELFIE and RoCL, the selection strategy in our approach is learnable with model-based features.And our approach does not require the prior knowledge of noise distribution.Last but not least, we unify the settings of self-training and label corruption in the framework of noisy label learning and conduct ex-5797 tensive experiments on both settings.

Method
Now we present our approach, SENT (Selection Enhanced Noisy label Training), that learns to select a subset from a noisy dataset and only uses the selected subset for training to reduce label noise.The core idea is to transfer the noise distribution so that both clean and noisy labels are available on a data subset.A selection model is trained to distinguish clean labels from noisy ones and then applied to selecting a clean subset from a noisy dataset.
Formally, given a noisy training dataset D train = {(x i , y i )|1 ≤ i ≤ N } that is corrupted following some unknown noise distribution P , our approach is to learn a selection strategy f to select a clean subset D ′ train = {x i |y i = y * i , i = 1, 2, ..., N } for training.Here (x i , y i ) is a training sample, y i is the noisy label, y * i is the corresponding unknown true label, and N is the size of training set.Let model M be our main model which is trained on the noisy dataset D train and performs certain tasks such as text classification, speech recognition and others.We will also have a selection model S trained on a small clean development dataset D select .The task of S is to learn the selection strategy f .There are two main stages in our approach: noise transfer and selection learning.

Noise Transfer
Now we describe the noise transfer stage.Given the corrupted training set D train , we learn the unknown noise distribution P and transfer the noise to D select .We train a model M on D train till full convergence, which predicts a (noisy) label given a text input.M is now assumed to have learned to capture the unknown distribution P by its parameters.Then we will use the model M to get noisy labels on D select by making predictions.Here we argue that the noise on D select is approximately following the noise distribution P .Formally, now the development set is ) is the original clean development sample, and y select i is the noisy label predicted by M .

Selection Learning
We model the selection learning stage as a binary classification task.On D select , a model S is trained to classify whether a label is clean or noisy.In our approach, the model S is constructed as a multi-layer perceptron with one hidden layer, which uses a pre-calculated 5-dimensional feature vector as the input and outputs a binary classification probability.Given a sample from the development set (x select i , y select i , y select * i ), the selection training sample is defined as (sg i , sr i ), where sg i is a 5-dimensional feature vector for the i-th sample.We will discuss how to compute the 5dimensional feature in later sections.Meanwhile, sr i is the corresponding true selection result, defined as In other words, if the i-th sample has a clean label, sr i = 1, and otherwise zero.Let P select (sr i |sg i ) denote the probability given by the selection model S. The loss function for each sample can be written as: (1)

Model-Based Features
Now we discuss how to compute the 5-dimensional feature sg i for each selection sample x select i .The first feature is called the instant loss.Given a sample (x i , y i ), let P (y i |x i ) be the predicted probability from the main model M .The instant loss (IL) for the sample is defined as: Intuitively, a larger instant loss indicates a possibly noisier sample, because the model M has low confidence in predicting the label.However, as training proceeds, the model will overfit some of the noisy labels.As a result, the instant loss will decrease for noisy samples as well.To address this issue, following Zhou et al. (2020), we additionally use an exponential moving average (EMA) loss to better differentiate noisy and clean samples.In the t-th training epoch, the EMA loss (EMAL) for the i-th sample is defined as follows: where γ ∈ [0, 1] is a discounting factor.Intuitively, a larger EMAL represents a possibly noisier sample as the model has lower confidence in the training history.Song et al. (2019) has shown that the entropy of model prediction is a strong indicator to differentiate noisy and clean samples as noisy samples tend to have a larger entropy.Here we adopt two entropy signals as additional features: Instant Entropy (IE) and History Entropy (HE).The calculation of IE follows Song et al. (2019).Let ŷi t be the predicted label of the i-th sample at epoch t and let H i (t) = { ŷi 0 , ŷi 1 , ...., ŷi t } be the prediction history of first t epochs.Then we formulate an empirical distribution which equals the ratio of prediction y in the first t epochs.The IH feature at the t-th epoch is computed as where τ = − log(1/k) is a normalizing factor with k being the number of labels.The HE feature at the t-th epoch is computed as (5) We explore another informative feature inspired by (Han et al., 2019) who discovered that distances between low level features and high level features in a convolution model are larger on noisy samples than on clean samples.We find this holds for transformer models.Thus, we adopt the cosine similarity between the hidden states of the first layer and the last layer as another feature.Formally, FLS (First Last Similarity) is defined as: where h F ij and h L ij represent the hidden states of the j-th token in the first and last layers respectively, and cos refers to cosine similarity.We normalize the FLS feature into the range of [0, 1].
Finally, we concatenate the above all features to be the input for the selection model S as follows, where [, ] denotes concatenation.

Overall Training Procedure
Now we show the training procedure with pseudo code in Algorithm 1 and illustrations in Figure 1.In the first box of Figure 1, we clarify our data setting, where a clean set D select and a corrupted set D train are the input to our method.In the second box, we learn the noise distribution P of train is not guaranteed to be entirely clean but is expected to be cleaner than D train .This repeated training phase corresponds to the code from lines 6 to lines 13 in Algorithm 1.

Adaptation to Self-Training
Our above framework can be directly applied to learning scenarios with noisy labels.In this section, we will further discuss how to adapt this method to self-training, a classic semi-supervised learning paradigm.Generally, it trains the teacher model on labeled data to infer pseudo labels on unlabeled data and add them back to the original training set.Then a student model is trained on the combined data.After that, the student becomes a new teacher model and the above process is repeated.Obviously, the process of pseudo labelling will bring noise since it cannot ensure 100% accuracy on the predicted labels.Therefore, it fits the nature of our proposed framework.Specifically, the noise dis-tribution P in self-training is known because the source of noise is the teacher model.Thus, it is natural to directly leverage the teacher model to infer noisy labels on D select .

Overview
To verify the effectiveness of SENT, we conduct extensive experiments on text classification and automatic speech recognition (ASR) benchmarks.
We use text classification tasks to evaluate the self-training setting.We perform finetuning based on the BERT (Devlin et al., 2018) (Zhang et al., 2015), and sentiment classification on IMDB movie reviews (Maas et al., 2011).Our data splits follow previous work (Karamanolakis et al., 2021).Related details are shown in Table 1.For the SMS and TREC datasets, we consider two separate versions.
The datasets with * have smaller development sets D eval and D select while the ones without * have larger developments.Smaller development sets are more challenging for noisy label learning because selection learning has to perform on a smaller clean set.We use these two separate versions to test the robustness of our approach.
For ASR in the label corruption learning setting, we use AISHELL-1 (Bu et al., 2017) as the benchmark.Following prior work, we model IDN using DNNs' prediction error (Du and Cai, 2015;Menon et al., 2018).Specifically, we train three small transformer models to corrupt the training set to different corruption levels: hard, medium, easy.The higher the error rate, the harder the corrupted dataset.In the following experiments of SENT, the prior noise information (i.e.how the training set is corrupted) is assumed to be unknown.Related statistics are shown in Table 8 in Appendix.

Evaluation
For text classification, we report micro F1 for SMS and accuracy for the rest of the datasets.For ASR, we follow Bu et al. (2017) to use the character error rate (CER) for evaluation.
Since our approach relies on an additional clean set D select , we split the normal development set into two halves.We use one half as D select for selection learning and the other half D eval for standard model tuning, so as to set up fair comparison with the baselines.

Baselines
For text classification, we compare with the following baselines: (a) "Supervised" refers to supervised learning using only labeled data; (b) "self-train" is standard self training that utilizes both labeled and unlabeled data for iterative train-  ing; (c) "self-train (thres)" means self-training that uses the development set to select a threshold of confidence score for filtering pseudo-labeled data; (d) "noisy student" (Xie et al., 2020) adds dropout noise to the student model in self training; (e) "co-teaching+" (Yu et al., 2019) uses two neural networks to select small-loss samples for each other and applies a disagreement strategy; (f) "L2R" (Ren et al., 2018) learns to re-weight noisy labels via meta-learning (g) "SELF" (Nguyen et al., 2019) utilizes self-ensemble predictions to progressively remove noisy labels.We also evaluate the performance when combining noisy student and our method, denoted as "ours + noisy".

Results in Text Classification.
As shown in Table 2, the self-training baseline improves text classification performance.In comparison, our selection approach can stably lift the performance, which shows that our selection model has learned an informative selection strategy.Last, although using SENT alone outperforms self-train and noisy student, our approach can be combined with the noisy student approach to achieve even better performance.Overall, this combined approach achieves the best performance among all the datasets we consider.the vanilla model.L2R is effective on our problem setting with improved performance.However, meta learning based L2R underperforms RoCL and SELFIE.Compared with above baselines, our approach consistently excels on all error levels, which demonstrates the effectiveness of our approach.

Empirical Analysis
Case Study: Key Metrics During Self-Training In order to gain a deeper understanding of how our method improves over the traditional self-training methods, we investigate some of the key metrics regarding pseudo labelling and selection performance.We consider self-train, self-train (thres), and our model.Specifically, we display the accuracy of the pseudo labelling on unlabeled data (Pse-Acc.),the precision of sample selection (Sel-Pre.),the recall of sample selection (Sel-Rec.)and the number of selected samples in the best round (i.e., the training round that achieves the best performance in the repeated self-training process).As can be seen in Table 4, our approach achieves a better pseudo labeling accuracy.This is because our approach obtains a more balanced tradeoff between selection precision and selection recall compared to self training.Because of a more rigorous selection model, our approach tends to only select samples with a higher probability of having clean labels; i.e., increasing the selection precision.This is also reflected in small decrease in the number of selected samples.We believe this is crucial for mitigating error propagation (Xie et al., 2020)   Ablation Study: Substituting with Simpler Models To further test the robustness of our approach, we substitute the base model from BERT to simple multi-layer percetrons (MLPs) without pretraining.
As show in Table 5, the performance will decrease after applying simpler MLPs.However, our approach remains effective compared to the other baselines.The relative gain and absolute improvement from "Supervised" to our approach is still significant.
Ablation Study: Features We study the effects of the model-based features we introduced in Section 3.3 with an ablation.Note that we did not use FLS for ASR because the the first layer and the last layer have different lengths.As shown in Table 6 and Table 7, all of the features contribute to the final performance.Among the features, adding the instant loss (IL) feature results in the most relative gain for performance.Also, we do similar experiments on text classification tasks.As seen in Table 7, we can do basic search on feature engineering to improve the final performance.Among all signals, IL leads to the greatest relative gain of the performance.But across all the datasets, each signal has its own role and contributes to our final performance.Performance Under Few-shot Setting LST (Li et al., 2019) has shown that self-training paradigm can be customized on few-shot classification.Here, we also investigate the effectiveness of our method when applying to the few-shot setting.Specifically, we evaluate on selected text classification datasets (i.e., YOUTUBE, SMS and IMDB).Figure 5 showcases that generally self-train performs better than "Supervised" (using only labeled data), while our model achieves the best performance in most cases, indicating the robustness of our method.More results are displayed in Appendix A.3.2.

Conclusions
In this paper, we propose SENT to address the problem of label noises.Compared with meta learning based models, our selection model is trained with full supervision using cross entropy loss which facilitates the convergence process.In the meantime, we model IDN noise without the prior knowledge of noise distribution.We also unify the setting of self-training and label corruption in the framework of noisy label learning and conduct extensive experiments on both settings.
level tasks such as named entity recognition (NER) and machine translation will also be interesting.
Besides, the selection model in our framework is feature-based.These features are very informative but might be limited in terms of expressivity.This reserves the space for further improvement to learn more data-driven features under our framework.

A Appendix
We provide details of our datasets (Section A.1) and experimental results (Section A.2). A

A.1.1 Baselines
In the first experiment of text classification, we also take the same baselines which consider rules and report the same results as utilized in ASTRA ( Karamanolakis et al.For ASR baselines, Vanilla is a naive encoderdecoder transformer network without any denoising moduels.All following baselines and our approache are build on top of the vanilla model.(b) Co-Teaching+ trains two networks with each network selecting its small-loss samples as clean samples for its peer network.(c) L2R is the same as mentioned before.(d) RoCL starts learning with easy and clean samples and gradually moves to learn noisy-labeled data with pseudo labels produced by a time-ensemble of the model and data augmentations.(e)SELFIE selects refurbishable samples based on the entropy of model predictions and refurbs the samples with model predictions.

A.2 Details of Experiments
For ASR models, the transformer model contains 12 layers of encoder and 6 layers of decoder.For each transformer block, the number of heads in the multiheadattention module is 4. The dimension of the encoder and decoder input is 256.The dimension of the feedforward network is 2048.We use 80-dimensional filter bank coefficient as input features.The hyper parameter for training is show in Table 9. Batch_size_in_s2 means the maximum allowed length of audio in seconds in one batch.History_length represents the maximum allowed length for stored history predicted labels.These history predictions are used to calculate entropy.
It should be noted that in both text classification as ASR tasks, we split the total development set into dev_select and dev_eval, where dev_select is used to train the selection model and dev_eval is used to evaluate the selection model.
There are three details worth to noice in ASR: a) we perform an additional correction module for ASR.The correction module has the same architecture as the selection module.Correction module takes the same signal as selection module, and it outputs three weights which sum to one.The weights are assigned to the noisy labels (NL), model predicted probabilities (Pred), accumulated corrected labels respectively.The corrected label(CL) at T-th epoch is: after correction, we will perform the normal selection module to select clean labels from corrected labels.b) Since ASR is sequence level problem, we can not correct and select each token independently which would ignore the word dependencies.We first align the predicted word sequence to the noisy target sequence accoding to their longest common sequence.Then we will correct and select the token that are not common in both sequence.c) As ASR is a generation problem and the length of input and output is different, we do not extract FSL as a feature for our approach.

Figure 1 :
Figure 1: The pipeline of our general framework.Firstly, we are given a noisy training set and a small clean set for selection (also as development set).Secondly, we transfer the noise from the training set to the selection set.Thirdly, we compute predefined signals on both sets and train our selection model S using the selection set.Fourthly, We apply the trained selection model on the training set to distinguish clean samples for training model M. Finally, we repeat step 3 and 4.
: M ;S;Dtrain;D select ; total_epochs; pre-train_epochs 2: Train M on Dtrain.{Learn noise distribution P } 3: Infer on D select using M .{Noise Transfer} 4: epoch = 0; Reinitialize M; 5: Pretrain M for pretrain_epochs; 6: while epoch < total_epochs do 7: Infer signals [IL, EMAL, HE, IE, FLS] on Dtrain and D select using M ; 8: while not early stopping S do 9: Train S on D select ; 10: end while 11: Select a clean subset D ′ train out of Dtrain using S; 12: Train M on D ′ train ; 13: end while Output: Mthe training set by fitting model M on D train and then transfer P to D select .After this step, we will have both noisy and clean labels on D select .This box corresponds to lines 2 to 4 in Algorithm 1. Next, we will move on to the repeated training stage for model S and model M , which is shown in box 3 and 4 of the figure.First, as illustrated in box 3, we use model M to infer the aforementioned signals on D train and D select .Then we will use the signals on D select as the input to train the selection model S. Since D select has both clean and noisy labels, we can easily get the training targets sr as mentioned in Section 3.2 for the selection model S. Once we are done with selection learning, given the inferred signals on D train , we can use S to predict and select clean samples from D train to get D ′ train .Then we will train M on the clean subset D ′ train and repeat above steps.D ′

Figure 2 :
Figure 2: The pipeline of our general system applied to the classical self-training paradigm.The whole pipeline of adapting our framework to classic self-training is displayed in Figure 2, and the corresponding algorithm is shown in Algorithm 2 in Appendix.After training the teacher model on labeled data L, we use the teacher model to infer on the unlabeled data U and the dev set D select .This is followed by training the selection model based on the signals of D select .Then, we utilize the selection model to predict on the unlabeled data U to judge whether to choose each pseudo-labeled sample or not.Then, we train the student model on the combination of original labeled data and selected samples from unlabeled data.The above procedure is repeated as in classical self-training. 4 Experiments 4.1 Experimental Setup

Figure 3 :
Figure 3: Evaluation on SMS under the few-shot setting.
(2021)).(a) Majority predicts the majority vote of the rules with ties resolved by predicting a random class.(b) Snorkel+Labeled (Ratner et al. (2017)) trains classifiers using weaklylabeled data with a generative model. .(c) L2R (Ren et al. (2018)) learns to re-weight noisy labels from rules by meta learning.(d) PosteriorReg (Hu et al. (2016)) uses rules as soft constraints for training by posterior regularization.(e) ImplyLoss (Awasthi et al. (2020)) learns both instance-specific and rule specific weights by minimizing an implication loss (h) ASTRA(Karamanolakis et al. (2021)) introduces an rule attention network to leverage multiple sources of weak supervision with trainable weights to compute soft weak labels for unlabeled data.

Figure 5 :
Figure 5: Evaluation on Youtube, SMS and IMDB under the few-shot setting.

Figure 6 :Figure 7 :
Figure 6: Mean of signals on hard level training and development set , etc.However, noise contained in weak labels could largely hinge the performance of self-training.Recently, Xie et al. (2020) improved the performance of self-training on image classfication by injecting noise to the student model, which is called NoisyStudent.Park et al. (2020) customized NoisyStudent on automatic speech recognition.One problem related with self-training is error propagation

Table 1 :
model.We use ASR tasks to evaluate the label corruption setting.Our approach and other baselines are built on top of an encoder-decoder transformer network.Details of model configuration can be found in Ap-Statistics of text classifcation datasets.

Table 2 :
Results for text classification in the self-training setting.We compare our approach with baselines under the BERT-based models.

Table 3 :
Comparison with baselines on AISHELL-1 test set in the label corruption setting.We use CER as the metric of performance.A lower CER indicates a better model.

Table 4 :
Table3shows that Co-Teaching+ is not able to handle our setting as it only achieves similar result to The key metrics of pseudo labeling and sample selection during the self-training.We report the accuracy of pseudo labeling, the precision and recall of sample selection, and the number of selected samples.

Table 5 :
Comparison with baselines under the MLP models.

Table 6 :
Ablation experiments for features on AISHELL-1.'All' is all features.Each line removes one feature based on the last line.

Table 7 :
Ablation experiments for features on text classification.'All' is all signals.Each line removes one feature from the previous line.YT represents YouTube dataset, and AG represents AG News dataset.
Unlabeled data U , total dev data D total , dev_select data D select ; 2: Initialize selected_train_set L chosen = L; 3: Initialize the teacher model M teacher , selection model S; 4: Train the teacher model M teacher on L chosen ; 5: Infer on U, D select using M teacher ; EMAL, HE, IE, FLS] on U, D select using M teacher and get signals sg unlabeled , sg select ; 7: Train S on D select ; 8: P select = S(U, sg unlabeled ); 9: sr = argmax(P select ); 10: L chosen = L ∪ U [sr]; 11: Train the student model M student on L chosen ; 12: Make the student a teacher and go back to step 5 and repeat.

Table 11 .
We can see that the tendency is same as the tendency on the test set.

Table 11 :
Ablation experiments for signals on dev set of AISHELL-1.Each line is removing one signal from previous line.