Instance-adaptive training with noise-robust losses against noisy labels

In order to alleviate the huge demand for annotated datasets for different tasks, many recent natural language processing datasets have adopted automated pipelines for fast-tracking usable data. However, model training with such datasets poses a challenge because popular optimization objectives are not robust to label noise induced in the annotation generation process. Several noise-robust losses have been proposed and evaluated on tasks in computer vision, but they generally use a single dataset-wise hyperparamter to control the strength of noise resistance. This work proposes novel instance-adaptive training frameworks to change single dataset-wise hyperparameters of noise resistance in such losses to be instance-wise. Such instance-wise noise resistance hyperparameters are predicted by special instance-level label quality predictors, which are trained along with the main classification models. Experiments on noisy and corrupted NLP datasets show that proposed instance-adaptive training frameworks help increase the noise-robustness provided by such losses, promoting the use of the frameworks and associated losses in NLP models trained with noisy data.


Introduction
The wide availability of neural network models has allowed development of novel and complex natural language processing tasks, many of which are in low-resource settings. With new definitions of tasks comes challenges of constructing new datasets, which is still an expensive and timeintensive endeavor. Many researchers have resorted to constructing datasets by using completely automated pipelines (e.g. Lan et al., 2017;Joshi et al., 2017;Paul et al., 2019;Lange et al., 2019;Sousa et al., 2019;. However, silver labels collected this way are still quite noisy compared to expert annotation. Because such methods have been gaining popularity and practicality, it is impor-tant to explore ways to ensure good performance in spite of noisy labels in training data. The widely-used cross entropy (CE) loss as the optimization objective in classification tasks has been shown to overfit to label noise (Ghosh et al., 2017). Several noise-robust losses have been designed for training models with noisy labels (Reed et al., 2015;Zhang and Sabuncu, 2018;Wang et al., 2019c), which were a convenient way to address the noisy label issue and shown to be more robust than CE. Experiments are usually conducted on computer vision datasets such as CIFAR (Krizhevsky, 2009) and MNIST (LeCun et al., 1998).
These noise-robust losses usually have hyperparameters for determining the strength of the noiserobustness at the dataset level. However, individual training instances may have different amounts of noise, derived from biases within models used in the automated pipeline. Moreover, noisy labels in natural language datasets potentially pose a greater challenge because instances of the same true label may not share similar surface features. Therefore, this work focuses on the improvement of training with noisy labels using noise-robust losses in NLP. We propose two robust training frameworks where the noise-robustness hyperparameters are instancespecific. They are predicted by label quality predictors, which are trained either jointly or iteratively with main models in order to take advantage of any correlation between label quality and input features. Such frameworks are tested with many noise-robust losses on several noisy and corrupted NLP datasets. Results from experiments show that: 1. Instance-adaptive noise-robust training proposed in this work enhances the noiserobustness of the losses on noisy and corrupted datasets, which results in large performance gains when instance-specific noiseresistance hyperparameters are used.
2. Noise-robust losses are an effective way to combat noise in silver-standard NLP datasets, especially when the noise rate is high. ER-GCE loss proposed in this work achieves the best performance on all datasets compared the noise-robust losses from previous work.

Noise-robust losses
We first define a dataset D for single-label classification as a tuple of input features and corresponding labels {x i , y i } N i=1 , where y i ∈ {1, ..., K} is the annotated label and y i ∈ {0, 1} K is a one-hot representation of the annotated label with K total possible classes for training instance i. Given a classification model f with trainable parameters θ, the predicted conditional distribution of the classes from the model is d i = f (x i ). Training the model f is then trying to find the set of parameters θ * which minimizes the empirical risk θ * = arg min θ N i=1 L(f (x i ), y i ), with L being a loss function which takes the model output and the annotated label, and returns a non-negative value. CE is the commonly used loss for classification, which is defined as the negative log-likelihood of the annotated class CE(y, d) = −y log f (x).
Theoretical results (Du Plessis et al., 2014;Ghosh et al., 2017) have shown that losses which satisfy K k=1 L(f (x),ȳ k ) = C, ∀x ∈ D, ∀f, with C being some constant andȳ k being a onehot representation of a label at k, are robust against symmetric and label-dependent noise with noise rate η < K−1 K , which is the probability that the annotated label y is not the true labelŷ. However, for losses which cannot satisfy this condition where the sum of loss values with respect to all classes is constant, they are more noise-robust if the above term is bounded instead of unbounded. Examples of each condition include CE being unbounded, mean squared error being bounded and mean absolute error (MAE) being constant.

Overview of noise-robust losses
Many noise-robust losses have been proposed and evaluated, mostly on vision datasets. The noiserobust losses that are examined in this work include the soft and hard variants of the bootstrapping loss (BSL, Reed et al., 2015), generalized cross entropy (GCE, Zhang and Sabuncu, 2018), symmetric cross entropy (SCE, Wang et al., 2019c), a new loss -entropy-regularized general cross entropy (ER-GCE), and two baselines based on simple modifications of CE -weighted cross entropy (WCE) and label smoothing (LS, Szegedy et al., 2016). These noise-robust losses are formulated below with a hyperparameter β which is negatively correlated with the noise-robustness. When β approaches 1, they become the least noise-robust but have fast convergence. When β approaches 0, they become the most noise-robust, but may underfit the training data (Wang et al., 2019c). Weighted corss entropy (WCE): One simple way to use CE to combat noise is to apply weights to different training instances according to their quality: where β is a noise-robustness hyperparameter.
With a dataset-specific β, WCE is equivalent to CE with no noise-robustness. Noise-robustness may be achieved when each training instance x i gets a β i , as described in Section 3. Label smoothing (LS): Another simple way to use CE to combat noise is to convert the one-hot targets into soft targets: where β controls how smooth a target is.
Bootstrapping loss (BSL): BSL combines two components in the loss: the distance to the noisy training target, which is measured by CE, and model confidence of its predictions, which is measured by the entropy of model prediction H(d).
The soft BSL is the sum of both terms: For the hard BSL, the entropy function is replaced by max: It has been shown empirically (Reed et al., 2015; that BSL is noise-robust. Generalized cross entropy (GCE): GCE is the negative Box-Cox transformation (Box and Cox, 1964) of the predicted distribution d: GCE is equivalent to MAE when β = 0, and to CE when β approaches 1. Therefore GCE is the generalization of CE and MAE, with the sum of loss values with respect to all classes in Eqn 1 bounded by . This makes it more noise-robust than unbounded losses like CE. Symmetric cross entropy (SCE): SCE is defined as the sum of CE and reverse cross entropy (RCE): with log 0 defined to be a negative constant A. RCE is reduced to MAE when A = −2. RCE has been shown robust to label noise (Wang et al., 2019c). Similar to BSL, SCE includes a noise-robust part of RCE and non-noise-robust part of CE. 1 Entropy regularized GCE (ER-GCE): The noise-robustness of GCE can be further improved by interpolating it with an entropy regularizer. Because both GCE and the entropy are bounded, the sum of both losses results in a noise-robust loss with tighter bounds than GCE by itself. ER-GCE is defined as β here controls both the importance of CE in the GCE as well as the weight of the entropy term. When β approaches 1, ER-GCE still is equivalent to CE, but when β equals 0, ER-GCE is equivalent to MAE regularized by the entropy of the predicted label distribution. When β satisfies the following condition: we can show that the bounds of ER-GCE are tighter than that of GCE, indicating theoretically ER-GCE is more robust than GCE. Proofs regarding to the noise-robust properties of ER-GCE can be found in the appendix. The noise resistance hyperparameter β in the noise-robust losses listed above controls how much 1 One recent noise-robust loss derived from SCE is the normalized cross entropy with reverse cross entropy (NCE-RCE, Ma et al., 2020). Although in a similar surface form to other losses, both parts of NCE-RCE are noise-robust, and β is mostly for controling the importance of the active loss NCE, which leads to a much larger range than losses mentioned here and harder to tune. More discussion can be found in the appendix. noise-resistance the loss function may provide, which is a single real number tuned and kept fixed for each dataset. However, different training instances may have labels of varying quality, and we propose that noise resistance should be assessed and utilized at the instance level, explained below.
3 Instance-adaptive noise-robust training frameworks When supervised models are used in pipelines to generate silver labels, the resulted machineannotated dataset reflects biases and inaccuracies learned by such models, which may be caused by spurious relations between instance-level features, such as words, phrases and syntactic constructions, and labels in datasets on which these models are trained. This in turn causes some instances more likely to receive noisy silver labels than others. Ideally, each training instance should have its own noise-robustness β value, which is certainly hard to manually tune.
We propose that each instance should be assigned a different β with its value calculated by a label quality function Q : {x i , y i } → β i , β i ∈ (0, 1). 2 Motivated by the intuition that label errors in automated pipelines are correlated with difficult input features or learned biases, we model the function Q with a neural network, which is expected to capture the complex relationship between inputs and quality of silver labels. We propose two instance-adaptive frameworks for training classification models with noise-robust losses with instance-specific β: the first training framework jointly trains the main model and the data-quality predictor (Section 3.1) and the second one takes additional supervision of label quality and iteratively trains the main module and the data predictor (Section 3.2), shown in Figure 1b and 1c.

Joint instance-adaptive training
Algorithm 1 as well as Figure 1b describe the joint instance-adaptive training method. We consider a classification model to have two main components: an input encoder E to encode input tokens x into vectors H, and a classifier C which makes a label prediction based on the encoded input. For example, E may be a neural network with an embedding layer and a multilayered BiLSTM, and C may be a Input Encoder

Nois
Classifier CE loss Input Encoder

Nois
Classifier CE loss < l a t e x i t s h a 1 _ b a s e 6 4 = " L U J B p T B O G q G o T x g E 1 k A J i S F b p / A = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P b U j L p n T Y 0 k x m S j F i G / o U b F 4 q 4 9 W / c + T d m 2 l l o 6 4 H A 4 Z x 7 y b n H j w X X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 i h R D B s s E p F q + 1 S j 4 B I b h h u B 7 V g h D X 2 B L X 9 8 k / m t R 1 S a R / L e T G L s h X Q o e c A Z N V Z 6 6 I b U j P w g f Z r 2 y x W 3 6 s 5 A l o m X k w r k q P f L X 9 1 B x J I Q p W G C a t 3 x 3 N j 0 U q o M Z w K n p W 6 i M a Z s T I f Y s V T S E H U v n S W e k h O r D E g Q K f u k I T P 1 9 0 Z K Q 6 0 n o W 8 n s 4 R 6 0 c v E / 7 x O Y o K r X s p l n B i U b P 5 R k A h i I p K d T w Z c I T N i Y g l l i t u s h I 2 o o s z Y k k q 2 B G / x 5 G X S P K t 6 F 1 X 3 7 r x S u 8 7 r K M I R H M M p e H A J N b i F O j S A g Y R n e I U 3 R z s v z r v z M R 8 t O P n O I f y B 8 / k D / o u R I A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " u N 7 e v 7 V N l v + b 4 7 h l 0 8 3 j U Y r z J C s = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i m y 4 r 2 A e 2 p W T S O 2 1 o J j M k G a E M / Q s 3 L h R x 6 9 + 4 8 2 / M t L P Q 1 g O B w z n 3 k n O P H w u u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 l G i G D Z Z J C L V 8 a l G w S U 2 D T c C O 7 F C G v o C 2 / 7 k L v P b T 6 g 0 j + S D m c b Y D + l I 8 o A z a q z 0 2 A u p G f t B W p 8 N y h W 3 6 s 5 B V o m X k w r k a A z K X 7 1 h x J I Q p W G C a t 3 1 3 N j 0 U 6 o M Z w J n p V 6 i M a Z s Q k f Y t V T S E H U / n S e e k T O r D E k Q K f u k I X P 1 9 0 Z K Q 6 2 n o W 8 n s 4 R 6 2 c v E / 7 x u Y o K b f s p l n B i U b P F R k A h i I p K d T 4 Z c I T N i a g l l i t u s h I 2 p o s z Y k k q 2 B G / 5 5 F X S u q h 6 V 1 X 3 / r J S u 8 3 r K M I J n M I 5 e H A N N a h D A 5 r A Q M I z v M K b o 5 0 X 5 9 3 5 W I w W n H z n G P 7 A + f w B t Z u Q 8 A = = < / l a t e x i t > H < l a t e x i t s h a 1 _ b a s e 6 4 = " T g L 9 m + m r H X Z b 9 v 3 W H U K t i h J S v 0 k = " > A A A B 8 X i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 2 2 o W y 2 L + 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G T Z u D t g 4 s D D P v s f M m S A T X x n W / n d L K 6 t r 6 R n m z s r W 9 s 7 t X 3 T 9 o 6 z h V D F s s F r H q B F S j 4 B J b h h u B n U Q h j Q K B D 8 H 4 J v c f n l B p H s t 7 M 0 n Q j + h Q 8 p A z a q z 0 2 I u o G Q V h N p j 2 q z W 3 7 s 5 A l o l X k B o U a P a r X 7 1 B z N I I p W G C a t 3 1 3 M T 4 G V W G M 4 H T S i / V m F A 2 p k P s W i p p h N r P Z o m n 5 M Q q A x L G y j 5 p y E z 9 v Z H R S O t J F N j J P K F e 9 H L x P 6 + b m v D K z 7 h M U o O S z T 8 K U 0 F M T P L z y Y A r Z E Z M L K F M c Z u V s B F V l B l b U s W W 4 C 2 e v E z a Z 3 X v o u 7 e n d c a 1 0 U d Z T i C Y z g F D y 6 h A b f Q h B Y w k P A M r / D m a O f F e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A 4 C e R D A = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 f z 4 l b n t 1 F W p D N g 3 Q S 2 4 R M r B U 0 Y = " > A A A B 8 X i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 2 2 o W y 2 m 3 b p Z h N 2 X 4 Q S + i + 8 e F D E q / / G m / / G T Z u D t g 4 s D D P v s f M m S K Q w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h + 0 T Z x q x l s s l r H u B N R w K R R v o U D J O 4 n m N A o k f w j G N 7 n / 8 M S 1 E b G 6 x 0 n C / Y g O l Q g F o 2 i l x 1 5 E c R S E 2 W T a r 9 b c u j s D W S Z e Q W p Q o N m v f v U G M U s j r p B J a k z X c x P 0 M 6 p R M M m n l V 5 q e E L Z m A 5 5 1 1 J F I 2 7 8 b J Z 4 S k 6 s M i B h r O 1 T S G b q 7 4 2 M R s Z M o s B O 5 g n N o p e L / 3 n d F M M r P x M q S Z E r N v 8 o T C X B m O T n k 4 H Q n K G c W E K Z F j Y r Y S O q K U N b U s W W 4 C 2 e v E z a Z 3 X v o u 7 e n d c a 1 0 U d Z T i C Y z g F D y 6 h A b f Q h B Y w U P A M r / D m G O f F e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A A B + R I Q = = < / l a t e x i t > y (a) Common training for classification with CE.

Quality predictor
Input Encoder

Nois
Input Encoder Nois Classifier Noiserobust loss < l a t e x i t s h a 1 _ b a s e 6 4 = " L U J B p T B O G q G o T x g E 1 k A J i S F b p / A = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i G 5 c V 7 A P b U j L p n T Y 0 k x m S j F i G / o U b F 4 q 4 9 W / c + T d m 2 l l o 6 4 H A 4 Z x 7 y b n H j w X X x n W / n c L K 6 t r 6 R n G z t L W 9 s 7 t X 3 j 9 o 6 i h R D B s s E p F q + 1 S j 4 B I b h h u B 7 V g h D X 2 B L X 9 8 k / m t R 1 S a R / L e T G L s h X Q o e c A Z N V Z 6 6 I b U j P w g f Z r 2 y x W 3 6 s 5 A l o m X k w r k q P f L X 9 1 B x J I Q p W G C a t 3 x 3 N j 0 U q o M Z w K n p W 6 i M a Z s T I f Y s V T S E H U v n S W e k h O r D E g Q K f u k I T P 1 9 0 Z K Q 6 0 n o W 8 n s 4 R 6 0 c v E / 7 x O Y o K r X s p l n B i U b P 5 R k A h i I p K d T w Z c I T N i Y g l l i t u s h I 2 o o s z Y k k q 2 B G / x 5 G X S P K t 6 F 1 X 3 7 r x S u 8 7 r K M I R H M M p e H A J N b i F O j S A g Y R n e I U 3 R z s v z r v z M R 8 t O P n O I f y B 8 / k D / o u R I A = = < / l a t e x i t > x < l a t e x i t s h a 1 _ b a s e 6 4 = " u N 7 e v 7 V N l v + b 4 7 h l 0 8 3 j U Y r z J C s = " > A A A B 8 X i c b V D L S g M x F L 1 T X 7 W + q i 7 d B I v g q s y I o s u i m y 4 r 2 A e 2 p W T S O 2 1 o J j M k G a E M / Q s 3 L h R x 6 9 + 4 8 2 / M t L P Q 1 g O B w z n 3 k n O P H w u u j e t + O 4 W 1 9 Y 3 N r e J 2 a W d 3 b / + g f H j U 0 l G i G D Z Z J C L V 8 a l G w S U 2 D T c C O 7 F C G v o C 2 / 7 k L v P b T 6 g 0 j + S D m c b Y D + l I 8 o A z a q z 0 2 A u p G f t B W p 8 N y h W 3 6 s 5 B V o m X k w r k a A z K X 7 1 h x J I Q p W G C a t 3 1 3 N j 0 U 6 o M Z w J n p V 6 i M a Z s Q k f Y t V T S E H U / n S e e k T O r D E k Q K f u k I X P 1 9 0 Z K Q 6 2 n o W 8 n s 4 R 6 2 c v E / 7 x u Y o K b f s p l n B i U b P F R k A h i I p K d T 4 Z c I T N i a g l l i t u s h I 2 p o s z Y k k q 2 B G / 5 5 F X S u q h 6 V 1 X 3 / r J S u 8 3 r K M I J n M I 5 e H A N N a h D A 5 r A Q M I z v M K b o 5 0 X 5 9 3 5 W I w W n H z n G P 7 A + f w B t Z u Q 8 A = = < / l a t e x i t > H < l a t e x i t s h a 1 _ b a s e 6 4 = " T g L 9 m + m r H X Z b 9 v 3 W H U K t i h J S v 0 k = " > A A A B 8 X i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 2 2 o W y 2 L + 3 S z S b s b o Q S + i + 8 e F D E q / / G m / / G T Z u D t g 4 s D D P v s f M m S A T X x n W / n d L K 6 t r 6 R n m z s r W 9 s 7 t X 3 T 9 o 6 z h V D F s s F r H q B F S j 4 B J b h h u B n U Q h j Q K B D 8 H 4 J v c f n l B p H s t 7 M 0 n Q j + h Q 8 p A z a q z 0 2 I u o G Q V h N p j 2 q z W 3 7 s 5 A l o l X k B o U a P a r X 7 1 B z N I I p W G C a t 3 1 3 M T 4 G V W G M 4 H T S i / V m F A 2 p k P s W i p p h N r P Z o m n 5 M Q q A x L G y j 5 p y E z 9 v Z H R S O t J F N j J P K F e 9 H L x P 6 + b m v D K z 7 h M U o O S z T 8 K U 0 F M T P L z y Y A r Z E Z M L K F M c Z u V s B F V l B l b U s W W 4 C 2 e v E z a Z 3 X v o u 7 e n d c a 1 0 U d Z T i C Y z g F D y 6 h A b f Q h B Y w k P A M r / D m a O f F e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A 4 C e R D A = = < / l a t e x i t > d < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 f z 4 l b n t 1 F W p D N g 3 Q S 2 4 R M r B U 0 Y = " > A A A B 8 X i c b V B N S 8 N A F H y p X 7 V + V T 1 6 W S y C p 5 K I o s e i F 4 8 V b C 2 2 o W y 2 m 3 b p Z h N 2 X 4 Q S + i + 8 e F D E q / / G m / / G T Z u D t g 4 s D D P v s f M m S K Q w 6 L r f T m l l d W 1 9 o 7 x Z 2 d r e 2 d 2 r 7 h + 0 T Z x q x l s s l r H u B e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A A B + R I Q = = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " R 4 G G E u 3 m R G Q B U I w 6 8 X D p H r + g g 5 Noiserobust loss < l a t e x i t s h a 1 _ b a s e 6 4 = " e X c + 5 q M l p 9 g 5 h D 9 w P n 8 A A B + R I Q = = < / l a t e x i t > y < l a t e x i t s h a 1 _ b a s e 6 4 = " R 4 G G E u 3 m R G Q B U I w 6 8 X D p H r + g g 5 The iterative instance-adaptive noise-robust training. Differently colored dashed shapes indicate the models being updated in the two steps of the iterative process. neural network with an average pooling layer and a feedforward layer. Given the encoded input H and the annotated label y, the label quality predictor Q learns to compare them, and predicts a β value. Although Q can take many complex forms depending on the prior knowledge about the relation between the inputs and the labels, two simple variants are explored in this work. The feedforward Q is a generic model for abstract labels such as the binary labels in paraphrase detection: where h m is the m-th row of H and the encoding of m-th input token, and E Q is the embedding matrix of labels. f Q is a neural network with feedforward layers, and σ is the sigmoid function. Intuitively, the quality predictor looks for features in the input that have the highest correlation with label quality. For tasks where labels and inputs share direct semantic relationship such as relation extraction, the similarity-based quality predictor may be used: Noise-robust losses require β to be set above a threshold for good balance of robustness and fast update all models w.r.t l 9 end 10 end convergence, as discussed in Section 2.1. Therefore, the final β value is lower-bounded by β µ : The common CE training scheme can be recovered when β µ approaches β upper = 1 for losses described in this work. Lower β µ indicates higher robustness and slower convergence.
Finally, because randomly initialized models are not reliable in providing meaningful β values, the joint training framework takes advantage of a warming-up period by setting β to be 1 for a number of epochs before joint training of the quality predictor and the classification models.

Iterative instance-adaptive training
The iterative training framework utilizes an auxiliary dataset A = {x j , y j , y A } J j=1 to provide supervision to the quality prediction model, where y A ∈ {0, 1} and x and y are from D. Instead of correcting the original annotation which can be expensive, only manual annotation of the correctness of a label is needed. If the original label is incorrect, the auxiliary label for this training instance will be 0, otherwise it will be 1. This supervision of data quality may help the data quality predictor better capture the relationship between the input, the original label and the noise level of the instance.
Algorithm 2 and Figure 1c show how the iterative training framework is executed. E and Q are first trained by using training instances sampled from the auxiliary dataset. In the training phase of E and C, the β values from Q are used for computing the losses, but Q is not updated.

Datasets and models
Two sets of experiments with the noise-robust losses and the adaptive training frameworks are conducted to show the effectiveness of the frameworks against label noise with NLP datasets. The first set of experiments is conducted on two real noisy datasets generated by automated pipelines: a user attribute extraction dataset Getting to Know You (GTKY,  and the English Conversational Semantic Role Labeling dataset (eC-SRL, . The GTKY dataset was created by automatically adding user attribute annotation on the PersonaChat dataset . The eCSRL dataset is created by first automatically translating the hand-annotated Chinese CSRL  dataset to English and then aligning words and annotation (Daza and Frank, 2020) from the CSRL dataset with multilingual BERT (Devlin et al., 2019). The test sets of both datasets used in the experiments, which are subsets of the original noisy test sets, have been manually corrected by annotators.
The models used for evaluation on the noisy datasets are the user attribute extractor  and the biaffine semantic role labeler (Cai et al., 2018) for GTKY and eCSRL respectively. The user attribute extractor  has three modules: a context encoder, a predicate classifier and an entity generator. The context encoder  is a BiGRU encoder, and the predicate classifier is a multi-hop memory network (Sukhbaatar et al., 2015b) which uses the all possible predicates to query the encoded input, and predict which predicates appear in the input. The entity generator is a GRU decoder with the copy mechanism where the predicate and the encoded input tokens are used as input to generate the arguments of the predicate. For example, the sentence now I live in Florida for long. has a predicate live_in, and the entities of this predicate are I and Florida. The reported score is the average F1 score of predicate prediction and entity prediction. The semantic role labeler (Cai et al., 2018) for eCSRL has a BiLSTM encoder and a biaffine scorer. The encoder first encodes the input tokens, such as a dialogue consisting of several sentences, into representations of argument candidates and predicates. The biaffine scorer compares the representation of a predicate, usually a verb in the dialogue, with representations of argument candidates through a biaffine and a feedforward layer, computing the scores of the argument candidates having an argument label. For example, for the sentence above, for the predicate live, I has the arg0 label, but long has no label. The reported score is argument token F1. Auxiliary datasets are created for the noisy datasets with a budget of 12 man hours with human annotators labeling a small portion of the training instances as 0 (wrong label) or 1 (correct label), which is much easier to annotate than correcting the noisy labels. The second set of experiments is conducted on a clean dataset with corrupted labels from the GLUE dataset (Wang et al., 2019a)

Experiments
For all experiments, a model selection procedure similar to a common use case is adopted: one β µ is first selected from 0.1 to 0.95 in increments of 0.05, and performance of the trained models on the noisy development dataset with 3 different random seeds and the chosen β is compared. Performance of the model with the best development result is reported for the noisy datasets to simulate the common use case. Means and variances of the three models with the best performing β µ are reported for the clean datasets for better understanding of model behavior. All other hyperparameters of the models, such as the learning rate and the batch size, are tuned with the CE loss and kept fixed. 3 The experiment conditions include six different noiserobust losses including WCE, BSL s , BSL h , GCE, SCE and ER-GCE, along with LS and CE as baselines. A for SCE is set to −4 following previous work (Wang et al., 2019c). They also include three training settings: Fix for using a fixed dataset-level β, Joint for using the joint noise-robust training for instance-adaptive β and Iterative for using the iterative framework for noise-robust training with auxiliary data. The similarity-based quality predictor is used with models trained on noisy datasets, and the feedforward one is used with models trained on corrupted datasets, as explained in Section 3.1. The warm-up period for the joint training framework is set to 5 epochs.

Noisy datasets
Results of the noisy dataset experiments, shown in Table 2, confirm the effectiveness of the instanceadaptive training frameworks. Comparing the three training frameworks, the joint training framework outperforms fixed training with a dataset-level β, showing that instance-adaptive β can help models become more noise-resistant. The iterative framework achieves the highest results, indicating distance supervision of data quality can help models further combat noisy labels. In fact, with the help from an small auxiliary set and the iterative training framework, the mean performance gains reach 4.2% and 6.5% respectively compared to CE, and 3.5% and 1.6% compared to models trained with fixed β values. Finally, comparison between noise-robust losses shows that models trained with ER-GCE are the most robust against label noise.

Clean datasets with corrupted labels
The relationship between noise rates, training frameworks and noise-robust losses are further explored with the clean SST-2 dataset where the noise rate and the size of the auxiliary dataset can be easily manipulated. In these experiments, we randomly corrupt the original labels at a noise rate r ∈ {0.2, 0.3, 0.4} 4 and construct an auxiliary dataset with 10% or 30% of the corrupted training set. The label corruption is done through two different ways: the uniform noisy datasets are created by randomly corrupting labels to reach a noise rate, and the model-based noisy datasets are created with a five-fold cross-corruption process: an ALBERT-based classifier is trained on four-fifths of the clean training set, and labels of instances in the held-out one-fifth set in which the trained model has lowest confidence are corrupted to reach a noise rate. 5 Table 3 show the model performances under various experiment conditions. First, there are significant differences between the best performing models and the baseline models trained with CE, which can reach 9.7%. Similarly, the performance difference between the iterative models and the fix models can reach 5.4%. These significant performance gains showcase again the value of the proposed frameworks for training with noisy datasets. Also, experiment data on different noise-robust  losses indicates ER-GCE loss to be the most noiserobust among the losses explored. Second, the performance trend between these training frameworks is similar to what has been shown with noisy datasets: the joint training framework outperforms the fixed training framework consistently, and the iterative training framework provides a further boost to model performance compared to joint and fixed frameworks.

Model analysis
β values reflect instance quality: Figure 2a shows how different training frameworks influence the distribution of β for the best-performing β m u values on eCSRL, which are 0.9 for ER-GCE and 0.4 for SCE. The dashed line indicates the initial value at β µ +(1−β µ )/2. The final instance-specific β values trained with the joint framework tend to concentrate around a value different from the original β µ and the initial value, showing the adaptive nature of the training framework. The small amount of auxiliary quality data is able to increase the variance of the individual β values, indicating that the quality model has learned to assign different β values to different training instances according to their label quality. This can be seen in Table 2b, which shows means and standard deviations of predicted β values for test instances of SST-2 when a portion of the test labels is also corrupted. The models are trained with ER-GCE with 30% model-based noise in the training set. Because the test instances are not seen in training, the predicted β values represent  assessment of data quality by the model. The beta values from models trained iteratively are much higher when no label is corrupted compared to when all labels are corrupted, indicating that the quality predictors are able to make generalizable judgments about data quality.
Finetuning benefits from noise-robust training: Finally, the BiLSTM encoder is replaced by a pretrained ALBERT (Lan et al., 2019) base model for evaluating proposed methods in the finetuning framework. Table 4 shows the average accuracy values in various experiment conditions as well as the performance difference compared to BiLSTM models in Table 3. Results show that the proposed methods also work with the popular finetuning paradigm, achieving better results in all experiment conditions and further weakening the harmful influence of noisy labels.

Related work
There have been many different approaches to address the noisy label problem. One such approach relies on knowledge of clean labels (Xiao et al., 2015;Li et al., 2017;Lee et al., 2018), while another tries to estimate the label-dependent (Natarajan et al., 2013;Patrini et al., 2017) or annotatordependent (Khetan et al., 2018 noise distributions, many with neural network layers (Sukhbaatar et al., 2015a;Bekker and Goldberger, 2016;Goldberger and Ben-Reuven, 2017). Such methods have seen some application in natural language processing (Hedderich and Klakow, 2018;Lange et al., 2019;Wang et al., 2019b). Different training strategies have also been proposed to increase the robustness (Huang et al., 2020), many of which require training of auxiliary networks to reweight samples (Jiang et al., 2018;Han et al., 2018;Wang et al., 2019b). Complementary labels (Ishida et al., 2017; are also used for negative learning for robustness (Kim et al., 2019;Shu et al., 2019). Regularization techniques such as drop-out (Srivastava et al., 2014;Li et al., 2020) also show positive results in combating noisy labels. Hu et al. (2020) proposed adding auxiliary variables into normal loss functions for regularization, which act as instance-specific priors over the predicted distributions to ease the training difficulty when labels are noisy. Conceptually similar to the joint training with instance-adaptive βs proposed in this work, the regularization method in (Hu et al., 2020) may be complementary to noise-robust losses explored in this work, because noise-robust losses may enjoy further improvement when combined with trainable instance-specific priors.
Noise-robust losses are another way to counter label noise (Beigman and Beigman Klebanov, 2009). The noise-robustness of some losses, such as MAE, was shown theoretically in Ghosh et al. (2017). New noise-robust losses have also been proposed where some of the losses have a passive component attached to CE for noise-robustness (Reed et al., 2015;Wang et al., 2019c), which was used by  for reading comprehension with noisy data. Others behave like a mixture of MAE and CE (Zhang and Sabuncu, 2018). Other methods, such as normalization (Ma et al., 2020) of losses and use of determinant-based mutual information  as a finetuning loss have also shown to be robust to noise.

Conclusion
This work focuses on combating noisy labels in NLP datasets by means of adaptive training with noise-robust losses. Two novel instance-adaptive training frameworks are proposed and investigated along with several noise-robust losses including a new ER-GCE loss. Experiments on different datasets show the effectiveness of the approach: the adaptive training frameworks help models achieve the best performance on noisy datasets, and the ER-GCE shows great noise-robustness among the previously proposed losses.
and log 0 is defined to be a negative constant A. Because both component losses are noise-robust, β here tunes how much active learning is in the interpolation, which may be correlated to dataset complexity (Ma et al., 2020). The β values for NCE-RCE in the previous work Ma et al. (2020) usually include {0.001, 0.01, 0.99, 0.999}. This indicates different function of the hyperparameter β compared to noise-robust losses examined in the paper, which is usually how noisy an instance is. Preliminary experiments also support this observation, where NCE-RCE tends to underfit and perform poorly with β values close to the true error rate.

B Lower bound of sum of losses with respect to all classes
Typically loss functions penalize prediction distributions which are further away from the gold labels than ones closer at least equally or more: Proof. Supposed is a vectorial representation of a uniform categorical distribution whered k = 1 K for k ∈ {1, ..., K}. If ∆d k 1 is moved from d k 1 to d k 2 for k 1 , k 2 ∈ {1, ..., K}, k 1 = k 2 where 0 ≤ ∆d k 1 , becaused k 2 + ∆d k 1 ≥d k 2 =d k 1 ≥d k 1 − ∆d k 1 , according to Eqn 15, we have then the change in the sum of losses with respect to all classes is k∈{1,...,K} This shows that any change to the uniform vector causes the sum to increase, thus proving the lower bound can be found at the uniform vector.
In the case of ER-GCE, when β ∈ [0, 1), for the GCE part of the loss, the sum of the loss with respect to all classes is bounded by: where d k is the k-th element of the prediction vector (Zhang and Sabuncu, 2018). The lower bound of GCE is at d being a uniform categorical distribution, and upper bound is at d being a one-hot categorical distribution. However, the entropy part of ER-GCE has a lower bound when d being onehot, and a upper bound when d being uniform: Therefore, for ER-GCE loss where the GCE part and the entropy part are summed up, β needs to satisfy the following condition for good learning behavior: with the lower bound being the sum of losses with regard to all classes for ER-GCE at uniform, and the upper bound being the sum at one-hot.

C Theorem 1: Noise-robustness of ER-GCE under uniform noise
Under uniform noise with η ≤ 1 − 1 K which is the probability of the true labelŷ being corrupted to the observed label y, For the model f , the empirical risk of the model with a dataset D is When the noise is uniform with noise rate η where η jk = 1 − η for j = k and η jk = η k−1 for j = k, we have: Let the bounds of ER-GCE in Eqn 18 be [φ, ψ], the bounds of the risk with noise can be written as: .
Let f * be the global minimizer of risk R L β (f ) andf be the global minimizer of the risk R η L β (f ). When R η L β (f * ) − R η L β (f ) = 0, the loss L at β is completely noise-robust, meaning the optimal model trained with noisy or clean data has no different in risk. Forf and ER-GCE loss, As β decreases and ψ approaches φ, A approaches 0, making the loss more tolerant to noise.
Proof. We can compare the lower and upper bounds of ER-GCE and GCE: Since K > 1, K β > 1, therefore 1 − K β < 0 and −(1 − β)K log K < 0. Since the range difference between the bounds is negative, ER-GCE has a smaller range or tighter bounds than GCE for a given β.

F Model structures and hyperparameters F.1 GTKY
The model proposed by  serves as our baseline model for the GTKY dataset trained with different noise-robust losses. There are three modules in the model: a context encoder that consumes the given word sequence w 1 , w 2 , . . . , w N , a relation classifier that predicts each associated relation (e.g. r), and an entity generator that generates the subject s and object o strings for a given relation r.

Context Encoder
The context encoder encodes input tokens with embeddings, which are then consumed by a bi-directional GRU (Cho et al., 2014)  Relation Classifier This classifier is a I-hop (I is set to 3 following  end-to-end memory network (Sukhbaatar et al., 2015b), which takes the hidden states from the context encoder as its input queries. The memory M i ∈ R K×D at each hop i are trainable parameters that contain the representations for all candidate relations, where K and D indicate the number of candidate relations and memory depth, respectively. The memory representation for each relation (e.g. live_in) is initialized by averaging the embeddings of its words ( live and in). At each hop i, the attention scores between the query q i ∈ R D and the corresponding memory are computed as: Here α i is a distribution over all relations, showing model confidence at layer i on what relations are mentioned in the given text. The memory update is computed as the weighted sum of the current memory matrix: The first query q 1 is initialized as h N , and the query at each step i is updated by: In the final layer, we apply a sigmoid function to trigger relations independently such that we can extract all possible relations from the given text: where m K+1 j ∈ M K+1 corresponds to the j-th relation.
Entity Generator Given each predicted relation r, the entity generator aims to generates the corresponding subject s and object o phrases to complete the final user attribute (s, r, o). The entity generator generates the word sequence (w 1 , . . . ,w M ) of concatenated subject and object, where the boundary is represented by a semicolon. For instance, the corresponding word sequence for triplet "(My son, misc_attr, shy)" is "my son ; shy". The model is a GRU decoder (Cho et al., 2014) with a copy mechanism (See et al., 2017) for easier generation of the words that also appear in the inputs. The final distribution over the vocabulary at timestep t is calculated as where P vocab t = softmax(Wh dec t ) is a predicted distribution over the whole vocabulary, and h dec t is the hidden state of the GRU. P source t = softmax(Hh dec t ) is a distribution over the input tokens, and finally P gen controls how they mix: where v c = diag(P source t ) H, and W, W are model parameters.
Hyperparameters The hidden state sizes for all modules are set to 400, with the input embeddings intialized with Glove (Pennington et al., 2014) and character embeddings (Hashimoto et al., 2017). The batch size is set to 32. The models are optimized with Adam with learning rate set to 1×10 −3 . Dropout layers with dropout rate 0.6 are applied to all layer transitions. Model performance is evaluated on the development set every epoch, and training is stopped whenever there is no observed improvement in development F1 score in 6 evaluations starting at the 10-th epoch.

F.2 eCSRL
The model proposed by Cai et al. (2018) serves as our baseline model trained with different noiserobust losses. There are two modules in the model: a context encoder that takes the given word sequence w 1 , w 2 , . . . , w N , and an biaffine role scorer which predicts the semantic role of each input token given a predicate.
Context Encoder The context encoder encodes input tokens with embeddings, which are then consumed by a bi-directional LSTM with 3 layers. where p is the word index of the predicate, and a is the word index of a candidate argument word. Finally, the biaffine layer computes the score for each semantic role an argument candidate is able to take for predicate p: where W pred , b pred , W arg , b arg , W role , U role , b role are model parameters.
Hyperparameters The hidden state sizes for both the encoder and the scorer are set to 768. Only randomly initialized embeddings are used for this model. The batch size is set to 8 dialogues, which may include different numbers of predicates. The models are optimized with Adam with learning rate set to 2 × 10 −5 . Dropout layers with dropout rate 0.1 are applied to all layer transitions. Model performance is evaluated on the development set twice every epoch, and training is stopped whenever there is no observed improvement in development F1 score in 10 evaluations starting at the 10-th epoch.

F.3 SST-2
Two kinds of models are used on these two datasets to evaluate the noise-robust losses: the simple BiL-STM models and the large ALBERT (Lan et al., 2019) models. The classification layers for both models are the same, but they have different encoders. The BiLSTM models use the same encoder as the model for eCSRL, which is a 3-layered BiL-STM, whereas the ALBERT models use the pretrained ALBERT base (v2) model as the encoder. The classification layer for both models is a simple one layer feedforward neural network.
Hyperparameters The hidden state sizes for all models are set to 768. The BiLSTM uses randomly initialized embeddings. The batch size is set to 128 for SST-2 with the BiLSTM encoder and 32 with the ALBERT encoder. The models are optimized with Adam with learning rate set to 2 × 10 −5 for BiLSTM and 1 × 10 −6 for ALBERT. Dropout layers with dropout rate 0.1 are applied to all layer transitions. Model performance is evaluated on the development set twice every epoch for the BiL-STM, and eight times for the ALBERT. Training is stopped whenever there is no observed improvement in development accuracy score in 20 evaluations starting at the 10-th epoch.  Table 5: Accuracy results from experiments with different warm-up cutoffs for the joint training framework. The loss used in these experiments is ER-GCE, the dataset is SST-2 and the corruption method is modelbased.
G Development experiments with different warm-up cutoffs Table 5 shows the development results for the joint training framework with different number of epochs for warm-up. This shows that when training with the joint training framework where the quality predictor and the main classifier are trained together, training the main classifier first for 5 epochs achieves the best performance.

H Model-based label corruption
We utilize a pretrained ALBERT (Lan et al., 2019) base model for model-based label corruption in order to simulate the automatic label generation process. A five-fold corruption process is used. We first split the concatenation of the training and the development datasets into five equal proportions, and further divide each proportion into a training, development and test set following an 3.9:0.1:1 split. For each proportion, a pretrained ALBERT base classifier is finetuned on the training set for two epochs and the model with the highest performance on the development set is saved as the labeler. Finally, an equal amount of instances in test from each label with the lowest confidence score from the label is chosen for corruption to reach a certain noise rate, with the gold labels swapped for a different label. This creates a different scenario in terms of how noise interacts with training, which can be seen in Table 3. At low noise rates, models trained with model-based noise are generally more accurate than models trained with uniform noise, indicating that the corrupted instances are generally located at the decision boundary with only limited negative influence on the majority of the test instances. As the noise rate increases, models trained with model-based noise see a quicker decline of performance than models trained with uniform noise, indicating that more harmful correlations between input and labels are created by model-based corruption process than the uniform corruption process.