Adaptive Textual Label Noise Learning based on Pre-trained Models

The label noise in real-world scenarios is unpredictable and can even be a mixture of different types of noise. To meet this challenge, we develop an adaptive textual label noise learning framework based on pre-trained models, which consists of an adaptive warm-up stage followed by a hybrid training stage. Specifically, an early stopping method, relying solely on the training set, is designed to dynamically terminate the warm-up process based on the model’s fit level to different noise scenarios. The hybrid training stage incorporates several generalization strategies to gradually correct mislabeled instances, thereby making better use of noisy data. Experiments on multiple datasets demonstrate that our approach performs on-par with or even better than the state-of-the-art methods in various noise scenarios, including scenarios with the mixture of multiple types of noise.


Introduction
In recent years, deep neural networks (DNNs) have been successfully applied in many fields (Pouyanfar et al., 2018;Alinejad et al., 2021;Liu et al., 2023) and the performance largely depends on well-labeled data.However, accessing large-scale datasets with expert annotation in the real world is difficult due to the significant time and labor costs involved.Instead, the noisy data obtained directly from the real world is often utilized in practical scenarios, even though it inevitably contains some incorrect labels.Thus, research on learning with noisy labels has gained attention in various fields such as natural language processing (NLP) (Jindal et al., 2019;Jin et al., 2021;Wu et al., 2022).
There are two main types of label noise in NLP: class-conditional noise (CCN) and instancedependent noise (IDN).CCN assumes that label noise is dependent on the true class, which can simulate the confusion between similar classes like "Game" and "Entertainment".On the other hand, IDN assumes that label noise is dependent on the instance, which simulates the confusion caused by the characteristics of the instance.For example, a piece of news containing the phrase "played on a pitch as slow as a bank queue" may be misclassified as Economic news instead of Sports news due to the specific wording.However, most studies focus on a particular type of noise.For example, Jindal et al. (2019) introduces a non-linear processing layer to learn the noise transition matrix of CCN.Qiao et al. (2022) designs the class-regularization loss according to the characteristic of IDN.However, there is a premise for applying these methods, which is that the noise is known and of a single type.
In the real-world, the noise scenarios are more complex and involve a mixture of multiple noises arising from various factors, such as data ambiguity, collection errors, or annotator inexperience.Methods that specifically target one type of noise are less effective when dealing with other types of noise.This limitation hinders their applicability in real scenarios where the noise is unknown and variable.
To address the challenges posed by real noise scenarios, we develop an adaptive textual label noise learning framework based on pre-trained models.This framework can handle various noise scenarios well, including different types of noise and mixed noise types.Specifically, our approach begins with an adaptive warm-up stage, then divides the data into clean and noisy sets by the correctness statistic of samples, and utilizes different generalization strategies on them.In particular, there are three key designs in our approach.First, the warm-up stage is designed to automatically stop early based on the model's fit level to the noise scenario, which effectively prevents the model overfitting erroneous labels especially under IDN scenarios or with a high ratio of noise.No-tably, the adaptive warm-up method relies solely on the raw training set, rather than a clean validation set, making it more suitable for practical scenarios.Second, unlike previous works (Li et al., 2020;Qiao et al., 2022) that fit GMM (Permuter et al., 2006) on the training losses to separate data, the data is separated according to the correctness statistic of each sample.The correctness statistic is accumulated by assessing the consistency between the model's predictions and the given labels during the warm-up stage.This prolonged observation provides more accurate grounds for data partitioning.Third, a linear decay fusion strategy is designed to gradually correct the potential wrong labels to generate more accurate pseudo-labels by adjusting the fusion weights of the original labels and the model outputs.
We conduct extensive experiments on four classification datasets, considering different types of noise: class-conditional noise, instance-dependent noise, and a mixture of multiple noises.To the best of our knowledge, previous methods have not explored such mixed noise.The experimental results demonstrate that our method surpasses existing general methods and approaches the performance of methods specifically designed for particular noise types in different noise settings.Our contributions can be concluded as follows: • We design an early stopping method for finetuning the pre-trained models, which adapts to various noise scenarios and datasets and achieves near the best test accuracy by relying solely on training set.
• We develop an adaptive noise learning framework based on pre-trained models, which can make good use of different types of noisy data while effectively preventing the model from overfitting erroneous labels.
• Experimental results of various noise settings show that our approach performs comparably or even surpasses the state-of-the-art methods in various noise scenarios, which proves the superiority of our proposed method in practical scenarios.

Related work
Universal Label Noise Learning.Label noise learning methods can be divided into two groups: loss correction and sample selection methods (Liang et al., 2022).Loss correction tries to reduce the effect of noisy labels during training by adding regularization item in loss, designing robust network structure for noisy label and so on.For example, Wang et al. (2019) adds a reverse cross-entropy term to the traditional loss to reduce the disturbance brought by noise.ELR (Liu et al., 2020) adds a regularization term prevent the model from memorizing the noisy labels because it would not be fitted in the early training stage.Sample selection divides the data into clean and noisy subsets, and uses different methods for different subsets.For example, Co-Teaching (Han et al., 2018) maintains two networks.During training, the two networks respectively pick out some small-loss samples as clean data for each other to learn.DivideMix (Li et al., 2020) uses Gaussian mixture model to separate clean and noisy samples.The noisy samples are treated as unlabeled data, whose pseudo-labels are generated by the model.Finally, Mixmatch (Berthelot et al., 2019) method is adopted for mixed training on clean set and noisy set.
Labels Noise Learning in NLP.The above works mainly focus on vision tasks, the researches on textual scenarios are relatively fewer.Jindal et al. (2019) and Garg et al. (2021) add additional noise modules based on lightweight models such as CNN and LSTM to learn the probability of noise transfer.(Liu et al., 2022;Tänzer et al., 2021;Zhu et al., 2022;Qiao et al., 2022) conduct research on pre-trained models and find that pre-trained models demonstrate superior performance in noise learning compared to trained-from-scratch models.However, most works focus on one certain type of noise such as CCN.Qiao et al. (2022) studies both CCN and IDN, but still conducts experiments in settings where the type of noise is known and designs specific regularization loss for IDN.
Few works have focused on general methods of label noise in NLP.Zhou and Chen (2021) develops a general denoising framework for information retrieval tasks, which reduces the effect of noise by adding a regularization loss to samples whose model predictions are inconsistent with the given label.Jin et al. (2021) proposes an instanceadaptive training framework to address the problem of dataset-specific parameters and validates its versatility across multiple tasks.However, these methods rely on additional components, such as supplementary structures or auxiliary dataset, which limits their practicality.In contrast, our method relies

Adaptive warm-up
The goal of this warm-up stage is to obtain a classification model that fits clean data well but not noisy data.As shown in Figure 2, however, the optimal warm-up time may vary significantly for different noise scenarios.To meet this challenge, Training Data.We directly use the raw noisy dataset D to warm up the classification model and determine when to stop early.
Learning objective.The standard cross-entropy loss is used to warm up the model: Overfitting regarding noisy data.Our stopping condition is based on the following Assumption 1 and Assumption 2.
Assumption 1: As learning progresses, the clean data is fitted faster than the noisy data.We empirically demonstrate that the clean samples can be recalled earlier than the noisy samples during learning (see Figure 2), regardless of the noise types.This is mainly caused by the memorization effect (Arpit et al., 2017) which means DNNs tend to learn simple patterns before fitting noise.As a result, the whole learning process can be roughly divided into Clean Rising (CR) phase (where most clean samples are quickly learned) and Noisy Rising (NR) phase (where the noisy samples are fitted slowly), which are differentiated by backgrounds in Figure 2.
Assumption 2: The prediction regarding noisy data tends to swing between the true label and the noisy label.It originates from another memorization phenomenon regarding noisy samples (Chen et al., 2021).The results in Figure 3 also demonstrate that the prediction for the noisy samples exhibits a higher level of inconsistency with the given label due to the activation of the true label.
Correctness statistic.According to these two assumptions, we update the correctness statistic of training set at intervals to judge whether the model begins to overfit the noisy data.Specifically, for a given sample x i with label ỹi , its correctness coefficient m t i is: obtained by, (3) where the correctness r t i indicates whether the prediction of sample x i is consistent with the given label at moment t, and m t i is the statistic of correctness in a given time range.
The higher the number of correct predictions compared to incorrect predictions for a sample, the higher the degree to which the model fits that sample.Intuitively, positive m t i indicates x i has been fitted by the model at moment t.
Furthermore, we have a proportion ratio t , which indicates the sample fitting level for the whole dataset with N samples.Early stopping condition.Through observing the change of ratio t , we can determine whether the learning process has entered the NR phase.According to Assumption 1, in the CR phase, ratio t should increase fast since the model fits clean samples quickly.In the NR phase, ratio t should stay within a certain range for an extended period because the noisy samples are difficult to fit (Assumption 2).
As a consequence, if ratio t stays within a range ε for η times, we can assume that the learning process has entered the NR phase.To approach the optimal stopping point, we continue to warm up the model until a certain improvement in the training accuracy.The magnitude of improvement is set to be a fraction ρ 1 of the current remaining accuracy.The pseudo-code for adaptive warm-up process is shown in Appendix B. Note that this adaptive warm-up process is suitable for different noise scenarios due to Assumption 1 and Assumption 2 can be widely satisfied by different noise types.

Hybrid training
To further leverage the underlying values of noisy samples, we propose a hybrid training method applying different training strategies to clean samples and noisy samples respectively.
Data.Based on the correctness statistic M = {m t ′ i } N i=1 (assuming t ′ is the stopping time of the warm-up stage), the whole training set D = {(x i , ỹi )} N i=1 can be divided into the "clean" set Dc and "noisy" set Dn as follows.
where l is the N ρ 2 -th largest correctness statistic value of M because a larger number indicates that the sample is fitted earlier and has a higher probability of being a clean sample.ρ 2 is a given percentage, which is set to 20%.Pseudo-labeling.To minimize the side-effects of the noisy labels, we regenerate labels for both the clean set Dc and noisy set Dn through the pseudolabeling method combining the original labels and the prediction of the classification model.Note that there may inevitably be some noisy samples in set Dc , albeit fewer than in Dn .
For each sample (x i , ỹi ) in clean set Dc , the corresponding pseudo-labels ŷi is obtained by, where the weight w t c decays linearly with the training step t, which is calculated by, where T indicates the all training steps of this stage, w t c decays from 1 to δ 1 due to relatively reliable labels in the clean set Dc .
Likewise, for each sample in noisy set Dn , its pseudo-labels ŷi are obtained by, where w t n decays from δ 2 to 0 due to most labels in Dn are incorrect.
By linearly decaying the weight of original labels, a good balance is achieved between leveraging the untapped label information and discarding noise.As the accuracy of the classification model's predictions continues to improve, the generated pseudo-labels will be closer to the true labels.
Furthermore, for each pseudo-label ŷi , we adopt the sharpen function to encourage the model to generate low entropy predictions, i.e., ŷi = , where τ is the temperature parameter and ∥•∥ 1 is l 1 -norm.Finally, the new whole set D = {(x i , ŷi )} N i=1 is reformed by clean and noisy sets.
Mixup.To enhance the generalization of the model, mixup technique (Zhang et al., 2017;Berthelot et al., 2019) is adopted to introduce diverse and novel examples during training.Different from mixing pictures directly in vision tasks, the text input cannot be directly mixed due to the discreteness of words.Like previous work (Berthelot et al., 2019;Qiao et al., 2022) in NLP, the sentence presentations encoded by pre-trained encoder are used to perform mixup: where α is the parameter of Beta distribution, p(x; θ) is the sentence embedding which corresponds to "[CLS]" token.(x i , ŷi ) and (x j , ŷj ) are randomly sampled in the corrected set D.
The mixed hidden states {h ′ i } N i=1 and targets {y ′ i } N i=1 are used to train the classifier by applying entropy loss: where M is the number of noisy samples X n .p 1 (x i ; θ) and p 2 (x i ; θ) are two predictions of x i .
Learning objective.The total loss is: where β is the hyper-parameter of KL loss, which is set to 0.3 in experiments.

Experimental settings
Datasets Experiments are conducted on four text classification datasets, including Trec (Voorhees et al., 1999), Agnews (Zhang et al., 2015), IMDB (Maas et al., 2011) and Chnsenticorp (Tan and Zhang, 2008) are binary classification datasets, so their symmetric and asymmetric noises are the same.
Instance-dependent noise: Follow (Qiao et al., 2022), a LSTM classifier is trained to determine which sample features are likely to be confused.The labels of the samples closest to the decision boundary are flipped to their opposite category based on the classifier prediction probability.All datasets except Trec are used because of its small sample size and uneven categories.Main experiments only show results for a single type of noise with a ratio of 40%, results for other ratios can be found in the Appendix C.
Mixture of multiple noises: We mix different types of noise to verify the ability of the algorithm to deal with complex and unknown scenarios.For datasets with only two types of noise, including Trec, IMDB and Chnsenticorp, we mix the two noises evenly.Specifically, Sym and Asym are evenly mixed on Trec, Asym and IDN are evenly mixed on IMDB and Chnsenticorp.For Agnews, we mix three noises unevenly for a larger challenge.The result of evenly mixing of two types of noise on Agnews is in Appendix C.

Baselines
BERT-FT (Devlin et al., 2018): the benchmark classification model without special methods, directly trained on noisy data.Co-Teaching (CT for short) (Han et al., 2018): a well-known noise learning method, which maintains two models simultaneously and lets each model select clean samples for the other to train.ELR (Liu et al., 2020): ELR designs a regularization term to prevent the model from memorizing noisy labels by increasing the magnitudes of the coefficients on clean samples to counteract the effect of wrong samples.SelfMix (Qiao et al., 2022): SelfMix first warms up the model and then uses GMM to separate data to perform semi-supervised self-training, and designs the normalization method according to the characteristics of IDN, resulting in a significant performance improvement compared to other methods lacking specific designs.
For fair comparison, the backbone model of each method is the same, including a pre-trained encoder BERT1 and a two-layer MLP.The training time of all methods except SelfMix is set to 6 epochs and their hyper-parameters are the same under different noise settings.For SelfMix, we follow the instructions of its paper to set different hyper-parameters for different datasets and noise types.Specially, the warm time for symmetric and asymmetric noise is 2 epochs, the warm time for instance-dependent noise is 1 epoch, and the semi-supervised selftraining time is 4 epochs.The average results are run five times on the code base provided by Qiao et al. (2022).

Parameter settings
The same hyper-parameters as baselines remain unchanged, where the maximum sentence length is 256, the learning rate is 1e-5, the drop rate is 0.1, the size of middle layer of MLP is 768 and optimizer is set to Adam.Additionally, the temperature τ is 0.5, α of Beta distribution is 0.75.
There are some specific hyper-parameters in our method.Specifically, in warm up stage, the interval s of monitoring training set is 1/10 epoch, the converge range ε is 0.01, the times η is 3 and the fraction ρ 1 is 0.1, the batch size is 12.In hybrid training stage, the fraction ρ 2 to divide data is 0.2, the threshold δ 1 is 0.9 and δ 2 is 0.4, and the loss weight β is 0.3.The batch size of clean set in hybrid training stage is 3 and the batch size of noisy set is 12 to match the ration ρ 2 of their numbers.
The duration of the warm up stage varies under different noise settings.The hybrid training stage lasts for 4 epochs to compare with SelfMix.For generality, the parameters used in all cases are same as above.

Main Results
Table 2 demonstrates the main results of our method and baselines.
Baselines BERT-FT is highly affected by noise, with its performance and stability decreasing significantly as the complexity of noise increases.CT and ELR demonstrate strengths in handling simple noise scenarios, such as Trec with 40% symmetric noise and a mixture of symmetric and asymmetric noise.However, they perform poorly under more complex noise settings, such as IDN noise.In contrast, SelfMix excels in cases involving IDN due to its design of class-regularization loss and reduced warm-up time.Among all the methods, only SelfMix needs to adjust specific hyper-parameters based on the dataset and noise type.
Our method In contrast to the baselines, our method consistently performs well across all noise scenarios.With simple CCN setting, our method exhibits further improvements in performance compared to well-performing baselines like ELR.With IDN setting, our method outperforms general baselines and even surpasses SelfMix in certain cases, such as on Agnews and Chnsenticorp.Moreover, when confronted with mixed noise, our method consistently achieves top or near-top results while maintaining stability.In conclusion, our approach demonstrates superior performance and adaptability in various noise scenarios, making it more practical for real-world scenarios.

Analysis
In this section, some experiments are designed to make a more comprehensive analysis of our proposed method.The main results are shown in Table 3 and the correspond analyses are as follows.More analyses can be found in Appendix D. The highlighted rows denotes the ratio of correct labels in the separated clean set.

Ablation experiment
To verify that each component contributes to the overall approach, we remove each component separately and check the test accuracy.To remove linear decay fusion, the original labels of clean set are kept and the pseudo-labels of noisy set are generated by the model.For the design of correctness statistic, the relevant parts including early stopping and linear decay fusion are removed and replaced by normal training.The standard cross entropy loss are directly applied on all samples and the corresponding pseudo labels to remove mixup operation.For R-Drop, the KL-divergence term is eliminated.
The first part of Table 3 shows each component is helpful to the final performance.Correctness statistic and R-Drop are more critical to complex setting (i.e.40% IDN), because the model is greatly affected by this noise.And due to the introduction of linear decay fusion, R-Drop becomes more important in the aspect of keeping the model stable.

Analysis of early stopping
Stopping warm up properly is important for adapting to various noise cases and datasets.second part of Table 3.We can observe that no one fixed warm-up time is suitable for all situations.For simple cases such as asymmetric noise, warm up for 3 epochs is beneficial.For complex cases such as instance-dependent noise, warm up for 1 epoch is enough.But adaptive warm-up works better than fixed times on processing all cases.Therefore, finding appropriate stopping point according to different noisy training datasets may be a good strategy of noise learning based on pre-trained models.
To check whether our early stopping method finds the stopping point close to the maximum obtainable test accuracy (MOTA), we draw some finding instances compared with other two heuristic approaches, including stopping through noise ratio (Arpit et al., 2017) and clean validation set (we use test set for validation here).As shown in Figure 4, the estimated start and end points of early stopping (ES) are close to the MOTA under different noise settings.And in some cases ES is closer to the MOTA than stopping through noise ratio (Ratio) especially under IDN.Because IDN is easier fitted by deep models, which resulting in high training accuracy at early stage.Compared with the two methods relying on additional conditions, our method only relies on the original noisy training set, which is more adaptable to the real world.

The design of correctness statistic
During warm up stage, a correctness statistic is maintained to judge the time of early stopping.In addition, it is used to separate the training data into clean set and noisy set for training in hybrid stage.We directly select the top 20% samples of the high- est value in the correctness statistic as the clean set.
It is different from previous noise-learning works (Li et al., 2020;Qiao et al., 2022;Garg et al., 2021) where the sets are separated by GMM or BMM.To make comparison, GMM is used in hybrid training stage and divide data at each epoch.In addition to the results, the ratios of correctly separation (i.e.how many samples in clean sets are truly clean) are also listed in Table 3, and highlighted with the gray background.As shown, the right ratio of our method is slightly higher than that of GMM, and the performance is better especially under the higher noise ratio settings.

Computational cost
Compared to previous noise learning approaches that include a normal warm-up phase (Han et al., 2018;Qiao et al., 2022), our method involves an additional cost incurred by inferring the entire training set at intervals during the adaptive warm-up phase.This cost is directly proportional to the number of training set samples and the frequency of inference.To strike a balance between monitoring and cost, we set the interval to 1/10 epoch in our experiments.Additionally, during the hybrid training stage, we reduce the cost of inferring the training set and fitting the GMM compared to Self-Mix.Table 4 provides an example of the time costs of BERT-FT, SelfMix and our method under the same settings, with all methods trained on a single GeForce RTX 2080Ti.

Conclusion
Based on the unknown and complex nature of noise, we propose a noise learning framework based on pre-trained models that can adapt to various textual noise scenarios.This framework automatically stops the warm-up process based on the magnitude and complexity of the noise to prevent the model from overfitting noisy labels.To further leverage mislabeled data, the linear decay fusion strategy is combined with mixup and R-Drop to improve performance while maintaining stability.Experimen-tal results demonstrate that our method achieves performance comparable to state-of-the-art in all settings within common noise range.

Limitations
We

A Examples of Assumption
The more examples of two assumptions are shown in Figure 5 and 6.

B Pseudo-code of Adaptive Warm-up
We summarize the pseudo-code of adaptive warmup in Algorithm 1.

C Experiments C.1 Detailed Results
The detailed results are given in this section.Table 5 shows the results of different ratios of CCN, Table 6 shows the results of different ratios of IDN, and the results of the mixture of two types of noise on Agnews are supplemented in Table 7.
Class-conditional noise: With this simple assumption, our proposed method basically achieves the best scores across all noise cases in Table 5.All methods can obtain very good performance under a low noise ratio due to the deep model is robust to simple noise.However, there are still some gains to be made with our approach in some cases such as the results on Trec and Chnsenticorp with 20% asymmetric label noise.Under a high noise ration i.e. 40%, the baselines are all affected to a greater or lesser extent, but our method performs well on different datasets and outperforms all methods.Additionally, we list the results on clean datasets without noise.The performance gap between clean data and noisy data are getting smaller by applying our method.
Instance-dependent noise: Table 6 shows that the complexity of IDN results in a substantial decrease in performance for general methods CT and ELR, indicating that they cannot cope well with this noise setting.In these experiments, SelfMix gets more best scores, which is related to its special designs based on noise type such as reducing the warm-up time, designing class-regularization loss and etc.In contrast, our method exhibits strong performance in all cases and achieves scores close to or even better than SelfMix without making assumptions about the type of noise.Particularly noteworthy are the cases where SelfMix underperforms, such as Trec and Chnsenticorp with 10% noise, whereas our method surpasses even the strongest baseline.This further highlights the versatility and effectiveness of our method across different scenarios.
Mixture of multiple noises: As shown in Table 7, mixed noise pose challenges, and various baselines exhibit distinct characteristics.BERT-FT is highly impacted by noise, particularly with a significant drop in the Last score.CT and ELR demonstrate their strengths in handling simple noise scenarios, such as a mixture of symmetric and asymmetric noise.On the other hand, SelfMix excels in cases involving IDN.In contrast, our method consistently achieves top or near-top performance while maintaining stability across all scenarios.It offers a more flexible and adaptable solution for unknown noise cases.
In conclusion, our method gets good enough scores under all noise conditions without knowing the noise type, noise ratio or the characteristics of dataset.Besides, the same hyper-parameters are used for all datasets and noise settings.These factors make our approach more practical in real scenarios.

D More Analysis D.1 Analysis of linear decay fusion
In linear decay fusion, we set δ 1 = 0.1 to let the weight of label of cleaner data decay from 1 to 0.1, and set δ 2 = 0.6 to let that of noisier data from 0.6 to 0. To understand the role of two thresholds, Table 8 lists some results of fixing one of them and changing the other.The first three rows shows that δ 1 has little effect on asymmetric noise but limits the performance of IDN.Because IDN is easily fitted by the model, some noisy samples are inevitably assigned to cleaner set.It is more appropriate to reduce the weight to a small value.As can be observed from the last two lines, small value of δ 2 is also detrimental to IDN, because it introduces much noise from noisier set.But large value lets some valuable information to be lost, the trade-off value 0.6 is a good choice.The setting of the two thresholds allows the proposed method to handle complex noise cases and cover possible noise situations, thus making it well applicable to other datasets or the real world.

D.2 Stability Analysis
In main experiments, the time of hybrid training is set to 4 epochs to fairly compare with SelfMix.The stability of noise learning methods is also an important aspect, and the training time is extended to verify the stability.Since the warm up time is adaptive, we set different hybrid training times.

Figure 1 :
Figure 1: The overall diagram of the proposed method

Figure 2 :
Figure 2: The memorization observation of different noise cases.Label recall on top figures, train and test accuracy on bottom figures.MOTA is short for the maximum obtainable test accuracy .More cases can be found in Appendix A.

Figure 3 :
Figure 3: Examples of the model output during training under different noise settings.The given label of clean samples is true.More examples can be found in Appendix A.

Figure 4 :
Figure 4: The instances of early stopping under different noise scenarios.

Table 1 :
The statistics of datasets R-Drop.Since the prediction of the model is fused in pseudo-labeling, it needs to be as accurate as possible.Therefore, the R-Drop strategy (liang et al., 2021) is applied on noisy samples X n to promote the model to have a consistent output distribution for the same input.R-Drop minimizes the Kullback-Leibler divergence between two distributions predicted by the model with dropout mechanism for the same sample :

Table 2 :
, where Chnsenticorp is Chinese dataset and the rests are English datasets.The statistics of datasets are presented in Table1, where the training set of Chnsenticorp is a merger of the original training and the validation set.Results (%) of five runs on Trec, Agnews, IMDB and Chnsenticorp under different noise settings.The bolded results means the highest among all best scores and the underlined results means the highest among all last scores.Chn is short for Chnsenticorp.* means some hyper-parameters are adjusted according to the type of noise and the composition ratio of mixture noise.

Table 3 :
Ablation study results in terms of test accuracy (%) on Trec and Chnsenticorp under different noise.

Table 4 :
The time cost example of different methods under the same noise setting.