Noise-Robust Fine-Tuning of Pretrained Language Models via External Guidance

Adopting a two-stage paradigm of pretraining followed by fine-tuning, Pretrained Language Models (PLMs) have achieved substantial advancements in the field of natural language processing. However, in real-world scenarios, data labels are often noisy due to the complex annotation process, making it essential to develop strategies for fine-tuning PLMs with such noisy labels. To this end, we introduce an innovative approach for fine-tuning PLMs using noisy labels, which incorporates the guidance of Large Language Models (LLMs) like ChatGPT. This guidance assists in accurately distinguishing between clean and noisy samples and provides supplementary information beyond the noisy labels, thereby boosting the learning process during fine-tuning PLMs. Extensive experiments on synthetic and real-world noisy datasets further demonstrate the superiority of our framework over the state-of-the-art baselines.


Introduction
In recent years, the development of language models has significantly expanded the applications within the field of natural language processing (NLP).Fine-tuning Pretrained Language Models (PLMs) like BERT (Devlin et al., 2018) for specific downstream tasks has become an essential step in real-world implementations (Alt et al., 2020;Wang et al., 2021).In general, achieving significant performance gains in fine-tuning PLMs necessitates the availability of task-specific data for finetuning (Zhou and Chen, 2021;Wang et al., 2022).However, obtaining high-quality labeled datasets for this purpose poses significant challenges due to the expensive, complex, and labor-intensive nature of the annotation process (Yu et al., 2019;Bae et al., 2022).For example, large-scale datasets, often derived from web-crawling (Li et al., 2017;Song et al., 2019) or crowd-sourcing (Yan et al., 2014;Williams et al., 2017;Sakaguchi et al., 2021), frequently suffer from the presence of noisy labels.Prior research (Arpit et al., 2017;Cheng et al., 2021;Zhang et al., 2021c;Wang et al., 2023a) has shown that PLMs are prone to overfitting and generally deliver subpar performance when fine-tuned on datasets containing label noise.Since fine-tuning involves incorporating supervision information to enhance performance on downstream tasks, the presence of noisy labels will mislead the training process and significantly impede the efficacy of PLMs (Zhu et al., 2022).Therefore, there is an immediate need to design effective algorithms for fine-tuning PLMs in the presence of noisy labels.
In the context of learning from noisy labels, an intuitive approach is to separate clean samples from the noisy ones in the training set for model training (Han et al., 2018).For the remaining noisy samples, a prevalent strategy is to pseudo-label them based on model predictions to reduce the adverse impact of noise (Berthelot et al., 2019;Sohn et al., 2020).However, it remains challenging when applying this paradigm to PLMs.This is because PLMs, equipped with prior knowledge encoded in a large size of parameters, tend to easily memorize noisy samples early on during the fine-tuning process.As illustrated in Figure 1, PLMs exhibit similar prediction confidences for both clean and noisy samples, which will result in two challenges when utilizing existing methods: (1) Existing approaches often rely on confidences generated by models trained on potentially noisy samples (Li et al., 2020;Karim et al., 2022).However, as shown in Figure 1, it is challenging to accurately distinguish clean and noisy samples solely based on confidences.(2) Existing methods are susceptible to erroneous information during pseudo-labeling, as they directly utilize model predictions as pseudolabels, which can be detrimental when the predictions are inaccurate.PLMs, being capable of easily remembering noisy samples, further exacerbate the risk of capturing and amplifying erroneous information during pseudo-labeling.For example, in the 20Ng dataset (Lang, 1995), when an "autos" sample is assigned a wrong label "hardware", PLMs can easily memorize the erroneous information and thus cannot infer the correct labels in predictions, resulting in incorrect pseudo-labeling.
To overcome these challenges, we propose a novel framework, named LAFT, that harnesses the power of Large LAnguage Models (LLMs) for Fine-Tuning PLMs.We leverage LLMs trained on extensive corpora and fine-tuned with human instructions, such as GPT4 (OpenAI, 2023), to obtain external guidance in the form of confidence values.Our approach addresses the first challenge of separating reliable samples by utilizing LLMgenerated confidences for each class of all training samples.By comparing these confidences with the assigned labels (i.e., the given noisy labels) of training samples, we categorize them into three disjoint subsets: the Easy Clean (EC) Set, the Hard Clean (HC) Set, and the True Noisy (TN) Set.Regarding the second challenge, we propose a novel method to incorporate LLM-generated confidence scores as robust supervision information for all samples to ensure that PLMs learn useful information from them.As LLM-generated confidences are not affected by label noise, they can provide potentially relevant labels that are useful even if they are not entirely accurate.In summary, our contributions are as follows: • We are the first to explore the potential of leveraging supervision information (i.e., confidence scores) generated by LLMs to tackle the noisy label problem in fine-tuning PLMs.
• We propose a novel framework LAFT that can effectively separate clean and noisy samples and learn from noisy labels, based on the LLMsgenerated confidences.
• We conduct extensive experiments on synthetic and real-world noisy datasets and demonstrate the superiority of our framework in fine-tuning PLMs with noisy labels.
2 Related Work

Learning from Noisy Labels
Various strategies for learning from noisy labels have been proposed, falling primarily into three categories.The first category is Sample Selection methods, which typically employ loss or confidences to identify reliable samples for model optimization.These methods generally necessitate a predefined threshold (Li et al., 2021;Karim et al., 2022) or prior knowledge concerning the noise label rate (Han et al., 2018;Yu et al., 2019) to choose the instances.The second category, dubbed Label Transition methods, aims to learn (Tanaka et al., 2018) or generate pseudo-labels (Zhang et al., 2021c;Sohn et al., 2020) to replace the original noisy labels.Lastly, Regularization methods design robust loss functions (Liu et al., 2020;Englesson and Azizpour, 2021) or regularization techniques (Zhang and Sabuncu, 2018) that can effectively utilize all samples to enhance the model robustness against label noise.Nevertheless, these methods generally do not consider the scenario of fine-tuning a pretrained model with noisy labels.

Confidence-Guided Sample Separation
To separate noisy samples, existing works leverage the loss or confidences that provide insights into the model's prediction behavior as training proceeds (Yu et al., 2019;Karim et al., 2022).In the context of learning from noisy labels, the key concept is to leverage these dynamics as criteria for identifying and separating noisy samples.Several works propose to identify samples with lower training loss as the clean subset (Li et al., 2020;Han et al., 2018;Jiang et al., 2018;Zhao et al., 2022;Wang et al., 2023b), however, they are generally simplistic and inflexible, resulting in the selection of only easy samples.To address this limitation, alternative approaches have been proposed to effectively utilize the loss or confidences during training, as demonstrated in (Zhang et al., 2021a) and (Nishi et al., 2021).In contrast to existing methods that only rely on confidences generated by PLMs, we leverage LLMs as external guidance that provides more precise separations for fine-tuning PLMs.

Problem Definition
In this work, we study the problem of fine-tuning PLMs for text classification with noisy labels.Formally, consider a noisy training dataset containing n samples: D tr = {(x i , y i ), i = 1, 2, . . ., n}, where x i is an input sample and y i denotes the assigned label of x i .Note that y i is potentially corrupted, and we denote the true label of x i as y i , which is inaccessible during fine-tuning.Specifically, we aim to fine-tune PLMs with samples in D tr to achieve satisfactory prediction performance on test samples, while utilizing LLMs as external guidance.Notably, the LLMs remain fixed in our framework, which means we can also use blackbox LLMs.In practice, we implement PLMs for text classification by attaching an additional classifier, which will take the output of the PLM as input and generate class probabilities for classification.

Methodology
The overall framework of LAFT is illustrated in Fig. 2. In particular, we propose to divide all training samples into three subsets based on the accordance among LLM-generated confidences, PLMgenerated confidences, and the assigned labels of samples.In particular, we query an LLM to provide confidences for each training sample, spanning all classes.Combining confidences obtained from both LLMs and PLMs, we perform two steps of separation to segregate the entire training set into three subsets: Easy Clean (EC) Set, Hard Clean (HC) Set, and True Noisy (TN) Set.Each of these subsets, displaying unique behaviors, is then subjected to specific noise-robust fine-tuning strategies.

Confidences
Existing studies establish a correlation between confidences and the degree to which deep models memorize specific samples during training (Arpit et al., 2017;Karim et al., 2022).It has been observed that as the model's memorization of a particular sample strengthens, the model tends to assign higher confidence for this sample (Li et al., 2023).Therefore, these methods generally employ confidences to distinguish between clean and noisy samples based on the assumption that the model cannot easily memorize noisy samples (Pleiss et al., 2020;Swayamdipta et al., 2020).However, applying these strategies to fine-tuning PLMs is suboptimal, as PLMs also present high confidence for noisy samples even in the early stage of fine-tuning, as  shown in Fig. 1.To deal with this issue, we propose the utilization of external guidance, in the form of confidences generated by LLMs.Before delving into the specifics of our framework, we provide a formal definition of confidences.Denote the output of the final layer (i.e., the classifier) in PLMs for sample x i as z(x i ) ∈ R N , where N is the number of classes.The confidence of x i for the j-th class c j can be represented as follows: where z(c j ; x i ) ∈ R is the j-th value in z(x i ).Notably, the confidences are obtained from z(x i ) after a softmax function and thus sum up to one.
Although LLM-generated confidences can provide external guidance, they are not completely accurate.Thus, We conduct a two-step sample separation process based on both LLM-generated and PLM-generated confidences, with the second step providing a more granular distinction.

Coarse-Grained Separation
For the first step of separation, we aim to select samples that are easy to be identified as clean data with guidance from LLMs.Thus, we perform Coarsegrained Separation, utilizing confidences generated by LLMs with the raw text data included in the prompt.Here we provide an example of the prompt for querying the LLM to obtain the confidence value for each class: Classify the following content: {input text}.Select the label from {Class 1}, {Class 2}, . . ., {Class N } and output a confidence value for each of them.
We denote the LLM-generated confidence for sample x i regarding class c j as p(c j ; x i ), where c j ∈ C = {c 1 , c 2 , . . ., c N }, and C is the class set.Then the label obtained by LLM can be represented as (2) To perform coarse-grained separation, we first provide an assumption as the justification: Assumption 1.Samples whose assigned labels are the same as LLM-generated labels (i.e., y i = y i ) can be considered almost clean.
Note that Assumption 1 is empirically verified in Table 5 in Sec.5.5.The underlying intuition is that the probability of an LLM-generated label being identical to the assigned label and inconsistent with the ground-truth label is significantly low and thus negligible.In concrete, this assumption allows us to segregate the training samples into two distinct subsets based on the concordance between the LLM-generated label y i and the assigned (potentially noisy) label y i of x i .More specifically, we define the resulting two subsets, the Easy Clean (EC) Set E and the Disagreed Set D, as follows: It is naturally satisfied that E ∪ D = D tr and E ∩ D = ∅, where D tr is the training set.Since the samples in E are already clean, we can directly finetune PLMs using LLM-generated labels y i based on the cross-entropy loss as follows: where p(c j ; x i ) is the PLM-generated confidence for x i regarding the j-th class c j .Here y i,j = 1 if y i = c j , and y i,j = 0, otherwise.

Fine-Grained Separation
It is noteworthy that, however, the samples in D are not completely noisy.This is because the LLM-generated labels are not perfectly correct, as shown in Table 5.Thus, the samples that are clean but incorrectly classified by LLMs will still be categorized into D. Therefore, although we can learn from samples in E directly with their LLMgenerated label y i , it is still challenging to learn from samples in D, which are only partially clean.Therefore, we further propose to separate the samples in the disagreed set D into two subsets: the Hard Clean (HC) Set H and the True Noisy (TN) Set N , referred to as fine-grained separation.
The intuition is that within the disagreed set D, the LLM-generated labels can be incorrect for specific hard samples with correct assigned labels, referred to as Hard Clean (HC) samples.Specifically, the ideal separation for D is as follows: where y i is the true label of x i .Note that although this ideal separation can be completely precise for separating noisy samples in D, it is infeasible in practice, as the true labels are unknown.Therefore, to precisely separate the true noisy samples in D, we propose two thresholds for LLM-generated and PLM-generated confidences, respectively.In order to achieve more robust LLM-generated confidences distributed over the label space C, we adopt M different augmentations for each input sample to encourage input diversity while keeping the semantics immutable.Denote the augmented samples of x i as v m (x i ) , where m = 1, 2, . . ., M , and M is the number of augmentations.We can obtain the M LLM-generated confidences as p(c j ; v m (x i )) for x i regarding the j-th class c j , where m = 1, 2, . . ., M and j = 1, 2, . . ., N .We aggregate the LLM-generated confidences via the M augmentations as follows: where v m (x i ) is the input example after applying the m-th augmentation.As LLM-generated confidences remain fixed during fine-tuning PLMs, we adopt a fixed threshold τ for fine-grained separation based on p a (c j ; x i ).
On the other hand, however, PLMs-generated confidences for each sample will change as the finetuning proceeds, which results in subpar separation performance if we adopt a fixed threshold to select high-confidence samples as clean ones.Intuitively, the confidences should be lower for HC samples at the beginning of fine-tuning, as the model cannot easily fit these hard samples within several epochs.Nonetheless, the noisy samples are relatively easier to achieve higher confidence, and thus their confidences will easily achieve higher at the beginning.Therefore, we propose an adaptive threshold τ (t) that will increase as fine-tuning proceeds while not reaching an excessively high value: where λ and τ are hyper-parameters that control the value of threshold τ (t) as fine-tuning proceeds.t denotes the current number of fine-tuning epochs.
Combining the two thresholds, we can perform fine-grained separation by selecting the HC set H as follows: where τ and τ (t) are the thresholds for LLMgenerated confidences ( p a (c j ; x i )) and PLMsgenerated confidences ( p(c j ; x i )), respectively.In this manner, the process of separating HC and TN samples, i.e., fine-grained separation, can benefit from both the LLMs and PLMs.Then the remaining samples are categorized into the True Noisy (TN) set N : 4.4 Learning from the Hard Clean (HC) Set Now we have divided the disagreed set D into H and N .Recall that ideally, samples in H are hard yet correct.Thus, for these samples, we can directly utilize their assigned labels y as training labels.
Nevertheless, since the fine-grained separation of H and N cannot be perfect, the samples in H may still be noisy.As both LLMs and PLMs fail to provide a confident prediction for samples in H, we propose to prioritize the assigned label while also incorporating additional information from LLMs and PLMs.Specifically, we employ a weighted loss based on cross-entropy.The weight is enlarged if the summed confidence of the LLM and the PLM is high.We define the loss as follows: ( y i,j + ϕ i,j ) log p(c j ; x i ), (10) where p(c j ; x i ) is the PLM-generated confidence for x i regarding the j-th class c j .Here y i,j = 1 if y i = c j , and y i,j = 0, otherwise.ϕ i,j acts as a weight adjustment that increases when the sum of LLM confidence p a (c j ; x i ) and PLM confidence p(c j ; x i ) is larger.Intuitively, if the LLM and PLM are with high confidence regarding a specific class, then the information in their confidence can be useful as they are more likely to be correct.Therefore, we define ϕ i,j as follows: where α > 1 is a hyper-parameter that controls the threshold α τ (t) for ϕ i,j .As such information can still be inaccurate, we subtract it by the threshold α τ (t) to control its magnitude, such that ϕ i,j also acts as an adaptive loss weight for these samples.

Learning from the True Noisy (TN) Set
After the fine-grained separation, most samples in the True Noisy (TN) Set should be identified as noisy ones.Nevertheless, it is still challenging for PLMs to use their output as pseudo-labels for fine-tuning, as the prediction errors will accumulate and affect subsequent pseudo-labeling results.Fortunately, the confidences generated by LLMs can provide additional guidance to identify the potential labels for samples in the TN set.We first provide Remark 1 to justify the effectiveness of using LLM-generated labels for optimization.
Remark 1. LLM-generated labels on the True Noisy Set preserve the same accuracy as that on the whole dataset, as LLMs are not affected by noise.
Remark 1, as empirically verified in Sec.5.5, demonstrates that even when we categorize most noisy samples into True Noisy Set, the LLMgenerated labels can still provide decent guidance without sacrificing accuracy.Given the confidences provided by LLMs, we employ a loss that can enable the PLMs to benefit from them.Specifically, (12) Here in the first term, we leverage the confidences generated by LLMs to learn from potentially correct labels.This is because although LLMs cannot completely predict the correct labels, the confidences still preserve the potentially useful information in other incorrect but relevant labels.Such benefits cannot be provided by pseudo-labeling, which tends to output a definitive label.For the second term, we utilize the model predictions to exploit useful information from PLMs.The intuition is that, with the relevant label information provided by LLMs, the PLMs can learn accurate label information from the noisy samples in TN set.Consequently, if the model output tends to be confident on specific samples, we can utilize the prediction to further enhance the learning from it.Thus, we further set a threshold for this term, i.e., δ(x i ), defined as follows: where τ (t) is computed by Eq. ( 7).To reduce the effect of confirmation bias, we multiply τ (t) by β, a hyper-parameter that controls the final adaptive threshold, i.e., β τ (t).

Fine-tuning Objective
After we separate all training samples into E, H, and N , we can combine the three individual losses for them for PLM fine-tuning.Our final fine-tuning objective can be represented as follows: where λ H and λ N are hyper-parameters that control the importance of L H and L N , respectively.
Baselines.We compare our framework to stateof-the-art baselines for learning from noisy labels.
In particular, we compare to (1) Base (Devlin et al., 2018) that performs fine-tuning with standard cross-entropy loss; (2) Regularization Methods: Mixup (Zhang et al., 2018) and GCE (Zhang and Sabuncu, 2018); (3) Sample-selection Methods: Co-teaching (Han et al., 2018), Co-teaching+ (Yu et al., 2019), JoCoR (Wei et al., 2020), CR (Zhou and Chen, 2021), and NPC (Bae et al., 2022).Additional details are provided in Appendix C. Implementation Details.We use BERT (Devlin et al., 2018) as the text encoder for all datasets except Hausa, for which we use mBERT.The classifier is implemented as a fully-connected layer and randomly initialized at the beginning, while both the encoder and the classifier will be updated via gradient descent during fine-tuning.All experiments are evaluated on a clean test set, and the average accuracy along with the standard deviation over ten runs is reported for each dataset.
We provide more details about the implementation in Appendix D, and our code is provided at https://github.com/SongW-SW/LAFT.

Comparison on Synthetic Datasets
In this subsection, we compare our framework with other baselines on synthetic datasets 20Ng and AGNews, considering different noise types and ratios.Specifically, for Symmetric Noise (SN), we conduct experiments on three different noise rates: 20%, 40%, and 60%.For the other two types of noise, i.e., Asymmetric Noise (AN) and Instance-Dependent Noise (IDN), we adopt two noise rates: 20% and 40%.We present the results in Table 1, 2, and 3.The key observations drawn from the outcomes are as follows: (1) Regardless of the noise type, our LAFT framework persistently surpasses other state-of-the-art baselines, thus showcasing its effectiveness in fine-tuning PLMs with noisy labels.
(2) LAFT's performance improvement over other baselines is slightly more pronounced on the 20Ng dataset compared to AGNews.This is attributable to 20Ng containing a greater number of classes (N = 20) as opposed to AGNews (N = 4).Consequently, our strategy of utilizing LLM-generated confidences can capitalize on the similar labels within 20Ng for PLM fine-tuning, even if LLM predictions are not entirely accurate.(3) Certain regularization methods, such as Mixup and GCE, exhibit subpar performance when applied to datasets with a higher noise ratio of 60%.This is because these methods rely on all training samples for PLM fine-tuning, making them more susceptible to strong label noises.

Comparison on Real-world Datasets
In this subsection, we conduct experiments on real-world datasets: SemEval (Zhou et al., 2020), TREC (Awasthi et al., 2020), andHausa (Hedderich et al., 2020).From the results in Table 4, we observe that LAFT outperforms other baseline methods on all three datasets.Moreover, the results of LAFT are noticeably competitive on TREC and Hausa with relatively higher noise ratios, which further demonstrates the strong robustness and generalization ability of LAFT to label noise.

Ablation Study
In this subsection, we systematically remove specific components from our framework and ana- lyze the resulting impact on performance.We conduct experiments with the following variants: (1) LAFT\C performs coarse-grained separation without LLMs and thus only separates the Easy Clean Set based on PLM predictions.(2) LAFT\F performs fine-grained separation without LLMs, which means when separating Hard Clean Set, only PLM-generated confidences are used.
(3) LAFT\N removes our proposed loss L N for True Noisy Set and replaces it with pseudo-labeling based on the most confident PLM prediction.From the results presented in Figure 3, we observe that LAFT outperforms all variants, which verifies the effectiveness of these designs in LAFT.Specifically, removing the learning loss L N leads to significant performance degradation, indicating that such a design can effectively alleviate the adverse impact of noisy labels.Moreover, without the fine-grained separation strategy, the performance deteriorates rapidly when the noise ratio is larger, indicating the importance of fine-grained separation in the presence of higher noise ratios.

Evaluation of LLMs-generated Labels
In this subsection, we evaluate the effectiveness of LLMs regarding the quality of generated confidence, presented in Table 5.Specifically, we report the accuracy of LLM-generated prediction on the training set in 20Ng and AGNews.The statistics are from experiments on 20Ng and AGNews with 20% Symmetric Noise (SN).From the results, we can observe that: (1) For Easy Clean Set, the LLMgenerated labels exhibit a remarkably high degree of correctness.This empirical finding provides the justification for Assumption 1, which allows for leveraging LLMs to perform coarse-grained separation in our framework.
(2) For the TN set, LLMgenerated labels are not entirely correct, while still preserving similar accuracy compared to that on all samples.This result empirically verifies Remark 1 and justifies our strategy of utilizing LLMgenerated confidences for learning from the TN set.
(3) Considering the overall result, LLM-generated labels generally exhibit lower accuracy than PLMbased baselines.That being said, LLMs cannot be  directly used to replace PLMs for the text classification task when the noise ratio is not extremely high.Nevertheless, our strategy can effectively leverage the LLM-generated confidences, despite their lack of complete correctness, to enhance the fine-tuning performance of PLMs with noisy labels.

Results with Different Noise Ratios
Figure 4 presents the evaluation of our proposed framework across a spectrum of noise ratios on 20Ng, ranging from 0.1 to 0.9, while considering three distinct types of noise: SN, AN, and IDN.From the results, several significant observations are discovered: (1) With increasing noise ratios, all baseline methods consistently show a deterioration in performance.This decline can be tied to the amplified difficulty of label noise during PLM finetuning.(2) Despite the escalating noise ratios, our framework maintains a relatively more robust performance.This is attributed to our sample separation strategy which adeptly discerns clean samples within the training set, thus alleviating the adverse effects of label noise.(3) Our framework demonstrates a slower performance decrease with AN noise compared to other types of noise.This can be credited to the adaptive nature of our approach that employs LLM-generated confidences to effectively learn from similar labels for each sample.

Conclusion
This paper delves into the issue of fine-tuning Pretrained Language Models (PLMs) with noisy labels for text classification tasks.We address the detrimental effects of label noise by harnessing external guidance from Large Language Models (LLMs) in the form of confidences.Our approach entails a two-step separation strategy that accurately segregates all training samples into three distinct subsets, each treated with innovative noise-resilient strategies.In summary, our framework adeptly finetunes PLMs in the face of noisy samples, requiring minimal external guidance from LLMs.

Limitation
Despite the promising results, our work presents several limitations that we must acknowledge: (1) Dependence on Large Language Models (LLMs): Our framework leverages the guidance from LLMs in the form of confidences.This implies that the quality of these confidences and hence the effectiveness of our model are closely tied to the performance of the LLM.Unsatisfactory LLM performance could thus directly impact our framework's efficacy.(2) Noise Types and Ratios: The experiments in this paper mainly focus on three types of noise: Symmetric Noise (SN), Asymmetric Noise (AN), and Instance-Dependent Noise (IDN), with noise ratios up to 60%.However, there may be other noise types or higher noise ratios in practice, and it remains to be investigated how well our framework performs under these circumstances.(3) Limitations in Sample Separation Strategies: The efficacy of our framework relies on accurately dividing the training samples into clean and noisy subsets.If this distinction is not clear or becomes more complex, our current approach may encounter difficulties.(4) Domain-Specific Text Classification: While the experiments are performed on specific text classification tasks, we do not investigate the effectiveness of our approach in more domainspecific contexts where the nature of the noise could be different.(5) Computational Costs: Finally, our approach entails using LLMs which can be computationally expensive and could thus pose a challenge when applied to large datasets or in resource-constrained environments.In summary, future research could focus on overcoming these limitations and exploring the adaptability of our proposed framework to other noise types, higher noise ratios, more complex noise patterns, as well as different task domains.

Ethics Statement
This research adheres to the principles of ethical conduct in artificial intelligence research.The development and utilization of our work are carried out with full recognition of the potential implications.While our method improves the performance of Pretrained Language Models (PLMs) with noisy labels in text classification tasks, we acknowledge that it could be potentially used in contexts that may involve ethical concerns, such as disinformation or manipulation of opinion on social platforms.Our experiments are performed on five publicly available datasets, namely 20Ng, AGNews, Se-mEval, TREC, and Hausa.The datasets are already anonymized and do not contain any personally identifiable information.Moreover, we have complied with all guidelines and standards for the use of these datasets.Furthermore, the use of Large Language Models (LLMs) in our work is subject to the terms and conditions put forth by the creators and hosts of these models.We only emply them for the purpose of enhancing the fine-tuning performance of PLMs and do not exploit them for any inappropriate or unauthorized purposes.While our work focuses on improving the robustness of PLMs to label noise in data, we recognize the broader societal concerns related to the potential misuse of our framework.We advocate for the responsible use of our research findings and encourage continued discourse around the ethical use of machine learning and artificial intelligence technologies in real-world applications.Finally, it is crucial to note that the development of technologies and methodologies in machine learning should be done in conjunction with ongoing considerations about their ethical implications, potential misuse, and societal impacts.It is our responsibility to foster an environment of ethical vigilance in AI research and development.
2. Hard Clean (HC) Set.In this (ideal) category, the labels predicted by LLMs deviate from the assigned labels, while the true labels are in accordance with the assigned labels.In this case, we are acknowledged that these samples are also clean.However, due to the semantic difficulties of these samples, the LLMs cannot easily predict the correct labels of them.Nevertheless, since the assigned labels are correct as true labels, we can use assigned labels for model training on these samples.It is noteworthy that in practice, it is infeasible to achieve a perfect separation of this subset.Therefore, in our framework, we propose to leverage LLM-generated and PLM-generated confidences to improve the separation performance.
3. True Noisy (TN) Set.In this (ideal) category, the labels predicted by LLMs and the true labels are both different from the assigned labels.
In other words, these samples are all noisy as their assigned labels are different from true labels.Since they are noisy, the labels predicted by LLMs are also likely to be different from the assigned labels.As these samples are noisy, we cannot directly use their labels for training.However, since we narrow down the potential range of noisy samples to this category, we can resort to specific techniques to learn from these noisy samples.In our framework, we utilize LLM-generated confidences as additional supervision information to effectively learn from these noisy samples.

B Datasets
In this section, we introduce the details of the datasets used in our experiments.In particular, 20Ng (Lang, 1995) and AGNews (Li and Roth, 2002;Zhang et al., 2015) are news topic classification datasets that are prevalently used for text classification tasks.We manually inject different types of noise (SN, AN, and IDN) into these two datasets for evaluation in our experiments.For real-world datasets, we utilize SemEval (Zhou et al., 2020), TREC (Awasthi et al., 2020), andHausa (Hedderich et al., 2020).Specifically, Se-mEval is a relation extraction dataset, and we follow the process introduced in (Zhou et al., 2020) to obtain noisy labels.TREC is a question classification dataset in the weak supervision benchmark WRENCH (Zhang et al., 2021b).Moreover, Hausa is a text classification dataset in the language of Hausa, the second most spoken indigenous language in Africa, with 40 million native speakers.For this dataset, gazetteers are used for automatic labeling, which results in feature-dependent label noise.Here we provide the detailed statistics of these datasets in Table 6.

C Baselines
In this section, we provide the details of the baselines used in our experiments.
• Base (Devlin et al., 2018) is the BERT base model fine-tuned based on the standard crossentropy loss.Note that for dataset Hausa, we utilize the multilingual BERT model.
• Mixup (Zhang et al., 2018) is a semi-supervised approach that performs a linear interpolation between clean and noisy samples.
• GCE (Zhang and Sabuncu, 2018) denotes Generalized Cross-Entropy loss, which can be regarded as a general loss that combines mean absolute error (MAE) and cross-entropy (CE).
• Co-teaching (Han et al., 2018) trains two different models and selects small-loss samples to feed them into each other for optimization.
• Co-teaching+ (Yu et al., 2019) is an updated version of Co-teaching that explicitly ensures the difference between the two models.
• JoCoR (Wei et al., 2020) trains two models and selects samples based on the sum of loss from the two models.
• CR (Zhou and Chen, 2021) also trains multiple models with a regularization strategy based on a soft target.
• NPC (Bae et al., 2022) utilizes a generative model to estimate the transition matrix from noisy predictions to the ground-truth labels of samples and uses it to correct noisy labels.

Figure 1 :
Figure 1: The average confidence of clean and noisy samples during fine-tuning BERT on datasets 20Ng and AGNews with 20% noisy labels.

Figure 2 :
Figure 2: The detailed process of our framework LAFT.We perform two steps of separation to divide all training samples into three subsets with different degrees of label noise: Easy Clean Set E, Hard Clean Set H, and True Noisy Set N .We further propose three different losses to effectively learn from them: L E , L H , and L N .

Figure 4 :
Figure 4: The results of our framework and the best baseline CR on 20Ng with different noise ratios for three types of noise: SN, AN, and IDN.

Table 1 :
The overall performance of various models on synthetic noisy datasets 20Ng and AGNews with Symmetric Noise (SN), where accuracy and standard deviation are reported in %, and the best results are in bold.

Table 2 :
The overall performance on 20Ng and AGNews with Asymmetric Noise (AN).

Table 3 :
The overall performance on 20Ng and AGNews with Instance-Dependent Noise (IDN).

Table 5 :
The LLM-generated prediction accuracy on the training set in 20Ng and AGNews in the ideal separation and real separation (shown in gray) during fine-tuning.

Table 6 :
The detailed statistics of the five datasets used in our experiments.