LABO: Towards Learning Optimal Label Regularization via Bi-level Optimization

Regularization techniques are crucial to improving the generalization performance and training efficiency of deep neural networks. Many deep learning algorithms rely on weight decay, dropout, batch/layer normalization to converge faster and generalize. Label Smoothing (LS) is another simple, versatile and efficient regularization which can be applied to various supervised classification tasks. Conventional LS, however, regardless of the training instance assumes that each non-target class is equally likely. In this work, we present a general framework for training with label regularization, which includes conventional LS but can also model instance-specific variants. Based on this formulation, we propose an efficient way of learning LAbel regularization by devising a Bi-level Optimization (LABO) problem. We derive a deterministic and interpretable solution of the inner loop as the optimal label smoothing without the need to store the parameters or the output of a trained model. Finally, we conduct extensive experiments and demonstrate our LABO consistently yields improvement over conventional label regularization on various fields, including seven machine translation and three image classification tasks across various


Introduction
Deep neural networks (DNNs) form the backbone of current state-of-the-art algorithms in various fields including natural language processing (Vaswani et al., 2017), computer vision (He et al., 2016;Dosovitskiy et al., 2021) and speech recognition (Schneider et al., 2019;Chen et al., 2021).However, heavily overparameterized models may incur overfitting and suffer from poor generalizations (Goodfellow et al., 2016).To address the issue, many regularization techniques have been developed in the literature: weight decay which constrains the optimization space (Krogh and Hertz, 1991), batch or layer normalization which speeds up the training of feed-forward NNs (Ioffe and Szegedy, 2015;Ba et al., 2016), and dropout which implicitly approximates the effect of averaging the predictions of all sparse subnetworks networks (Srivastava et al., 2014).Label smoothing (LS) is another simple regularization technique; it is widely applied to many applications including image classification (Szegedy et al., 2016) and token-level sequence generation (Pereyra et al., 2017) for enhancing the generalization, without suffering additional computational costs.It encourages a model to treat each non-target class as equally likely for classification by using a uniform distribution to smooth one-hot labels (He et al., 2019a;Vaswani et al., 2017).Although combining the uniform distribution with the original one-hot label is beneficial for regularization, conventional LS does not take into account the true relations between different label categories.More specifically, for token-level generation, uniformly allocating the probability mass on non-target words disregards the semantic relationship between non-target words and the context.On the other hand, the probability of the target is predefined and unchanged.However, the distribution of natural language exhibits remarkable variations in the per-token perplexity (Holtzman et al., 2020), which encourages us to adapt corresponding target probabilities for different contexts.
One of the instance-dependent techniques of learning the relation between different target categories is Knowledge Distillation (KD) (Hinton et al., 2015;Buciluǎ et al., 2006), which is a popular technique of transfer learning utilizing knowledge from related tasks or teacher models (Caruana, 1997;Pan and Yang, 2010;Lu et al., 2019).It is widely applied for model compression and ensembling across applications ranging from computer vision (He et al., 2019b;Xu et al., 2020;Lu et al., 2021) to natural language processing (Jiao et al., 2020;Sanh et al., 2019).However, KD requires training a separate model as a teacher for every new task.Besides, it either introduces an extra inference pass to get the teacher's prediction for each instance during the training or requires saving the teacher's prediction for all training samples to avoid the extra forward pass of teacher models.This greatly increases the time and space complexity in practice.Especially for token-level generation tasks, e.g. machine translation, to save all the output probabilities of teacher models costs O(N LV ) space, where N is the number of sequences, L is the averaged length of all sequences and V is the vocabulary size.Besides the empirical success of KD, it is unclear how student networks benefit from these smoothed labels.A series of investigations have looked at regularization and have demonstrated that the success of both KD and label smoothing is due to a similar regularization effect of smoothed targets (Yuan et al., 2020;Zhang and Sabuncu, 2020).Based on this finding and the low training overhead of LS, there is a significant interest, in the community, in algorithms that can enhance conventional LS. (Zhang and Sabuncu, 2020) demonstrates the importance of an instance specific LS regularization.They demonstrate better performance compared to LS, but use a trained model to infer prior knowledge on the label space and thereby sacrifice some of the efficiency of LS.
In this work, we first revisit the conventional LS and generalize it to an instance-dependent label regularization framework with a constraint on overconfidence.Within this framework, we demonstrate that both LS and KD can be interpreted as instances of a smoothing distribution with a confidence penalty.Finally, we propose to learn the optimal smoothing function along with the model training, by devising a bi-level optimization problem.We solve the inner problem by giving a deterministic and interpretable solution and applying gradient-based optimization to solve the outer optimization.
Our contributions can be summarized as follows: • We explicitly formulate a unified label regularization framework, in which the regularization distribution can be efficiently learnt along with model training by considering it as a bi-level optimization problem.
• We derive a closed-form solution to solve the inner loop of the bi-level optimization, which not only improves efficiency but makes the algorithm interpretable.
• We conducted extensive experiments on Machine Translation, , , IWLST'17 ({DE,, image classification (CIFAR10, CIFAR100 and ImageNet) and show that our method outperforms label smoothing consistently while maintaining the training efficiency.

Background
We will provide a brief overview of existing label regularization methods.
and k is the ground-truth class.We note from Eq. 1 that a higher α can lead to a smoother label distribution and a lower α to a peakier label.
Confidence Penalty.Another important technique similar to label smoothing is the confidence penalty (CP) (Pereyra et al., 2017), in which a regularization term H(p θ (x)) is introduced into the objective function to directly punish the overconfidence of predictions: where q(x) is the one-hot label, p θ (x) is the output distribution of models, H(q(x), p θ (x)) is the crossentropy loss between the labels and the student's output, H(p θ (x)) is the entropy of the output and β controls the strength of the confidence penalty.
Knowledge Distillation.Given access to a trained teacher model, assume that we want to train a student.Denote by P T (x) and p θ (x) the teacher's and student's predictions respectively.For a classification problem, the total KD loss is defined as: where KL is the Kullback-Leibler (KL) divergence and α is the scaling parameter.Note that we assume a temperature of 1 and omit it without loss of generality.KD training is equivalent to training with a smoothed label P (x), where: where k is the ground-truth class.
Both LS and KD incorporate training a model with a smoothed label.KD tends to perform better as the teacher's predictions over non-target classes can capture the similarities between different classes (Müller et al., 2019;Shen et al., 2021).However, LS and CP techniques are more efficient for training, since they do not require training a separate teacher network for every new task.

Methodology
In this section, we interpret our optimal label regularization from the bi-level optimization perspective.First, we provide a close look at the conventional uniform label regularization and show the generalized framework bridges the objectives of conventional LS and KD methods.Then, we introduce a closed-form solution to our optimal instance-dependent label smoothing and describe an online implementation under this formulation.

Generalized Label Regularization: A Close Look
Suppose that we have a K-class classification task to be learned by a neural network, S θ (•).Given a training set of {x i , y i } N i=1 samples where x i is a data sample and y i is the ground-truth label, the model S θ (•) outputs the probability p θ (j|x i ) of each class j ∈ {1, . . ., K}: where z i,j = [S θ (x i )] j is the logit for j-th class of input x i .
A general label smoothing can be formally written as: where Pls is a smoothed label, P ls is a smoothing distribution, k is the ground-truth class of x i , and α is a hyperparameter that determines the amount of smoothing.If α = 0, we obtain the original onehot encoded label.For the original label smoothing method, the probability P ls (•|x) is independent on the sample x and is taken to be the uniform distribution P ls (j) = U (j), However, the training can benefit from instance-dependent regularization.(Yuan et al., 2020;Zhang and Sabuncu, 2020).In this work, we consider the general form of LS Pls (•|x i ) which is instance-dependent and not necessarily uniform.
Let us consider the Cross Entropy (CE) loss with the smoothed labels: Note that computing the weighted sum of the negative log-likelihood of each probability of label can be viewed as taking expectation of the negative log-likelihood over label space under a certain distribution Pls .We modify this loss by adding a KL divergence term KL(P ls (•|x i )∥U (•)) into Eq.7 which encourages the sample-wise smoothing distribution P ls (•|x i ) to be close to the uniform distribution to handle the over-confidence.
where β is a hyper-parameter.The instancedependent smoothing P ls (•|x i ) can be viewed as the prior knowledge over the label space for a sample x i .This KL divergence term can be understood as a measure of the 'smoothness' or 'overconfidence' of each training label.Specifically for token-level generation tasks, the over-confidence of model results in the output of repetitive or most frequent but irrelevant text, which is detrimental to the generalization of the model (Chorowski and Jaitly, 2017;Holtzman et al., 2020;Meister et al., 2020a).We choose a Uniform distribution to constrain P ls in the KL term because we would like P ls to contain more information on non-target classes.
It plays a role similar to the temperature in KD and controls the sharpness of the distribution P ls .
On the one hand side, in case of KD, it regulates the amount of prior knowledge we inject in the smoothed label.On the other hand side, it reduces the overconfidence of the trained network by increasing the entropy of smoothed labels (a phenomenon studied by Zhang and Sabuncu (2020)).
Next, we show the post-hoc interpretability of this formulation.The following two Remarks discuss the relationship of this objective with conventional LS, confidence penalty and KD.
Remark 1.When P ls (•|x i ) is taken to be the uniform distribution U (•) for any x i , the objective in Eq. 8 reduces back to the one in Eq. 7 since Remark 2. This framework can include KD with temperature τ = 1 as a special case.Suppose P T (•|x i ) to be the output probability of a teacher model T (•) for any x i , the objective of KD can be rewritten as an expectation of negative loglikelihood w.r.t. a transformed distribution PT plus a KL term between the teacher output and uniform distribution.
where PT (j|x i ) = (1 − α + α • P T (k|x i )) for j = k the ground truth and PT (j|x i ) = α • P T (j|x i ) for j ̸ = k.This objective function Eq. 8 bridges the objective of label smoothing and knowledge distillation, since it is easy to convert our objective to KD or LS by replacing the P ls (•|x i ) in Eq. 8 There is an inherent relation among these methods.KD and LS, in particular, can be interpreted as instances of our method with a fixed smoothing distribution, which is consistent with recent work (Yuan et al., 2020;Zhang and Sabuncu, 2020).

Learning Label regularization via Bi-level Optimization
The choice of the smoothing distribution P ls determines the label regularization method.As described above, in KD the smoothing distribution is an output of the pre-trained teacher model, and in LS it is a uniform distribution.Generally speaking, the optimal smoothing distribution is unknown, and ideally, we would like to learn the optimal P ls .
In this regard, we set up the following two-stage optimization problem: This optimization setting, also called bilevel optimization (Colson et al., 2007), is strongly NP-hard (Jeroslow, 1985) so getting an exact solution is difficult.To solve this problem in our case, we first prove that the inner loop is a convex optimization problem by computing the Hessian matrix H i of R θ (P ls ) w.r.t.P ls (•|x i ) for each training instance x i .
When β is greater than zero, the Hessian is positive definite.Therefore, for the inner optimization loop we can derive the closed-form solution by using a Lagrangian multiplier.The following Theorem 1 gives the explicit formulation of R θ (P * ls ).For the details of this derivation please refer to Appendix A.
Theorem 1.The solution to the inner loop of optimization in Eq. 10 is given by: where p θ (j|x i ) is the output probability of j-th class of model S θ (•), α is the smoothing coefficient defined in Eq. 6, and β is defined in Eq. 8.
As a result, we reduced the two-stage optimization problem in Eq. 10 to a regular single-stage minimization: where P * ls (j|x i ) is given in Theorem 1, and P * ls (j|x i ) is defined in Eq. 6.Note that the solution P * ls is deterministic and interpretable.Moreover, the two remarks below demonstrate the relation of our LABO with LS and KD methods.
Remark 3. When β is extremely large, P * ls will be close to the Uniform distribution, and our objective function will be equivalent to optimizing the CE loss with a Uniform LS regularizer.
Remark 4.There is an intrinsic connection between the P * ls distribution and generating softmax outputs with a temperature factor.Specifically, when β = α • τ , we could have The smoothing distribution in this case becomes the temperature smoothed output of the model, which is similar as the smoothed targets used in KD methods 1 .
1 The derivation can be found in Appendix B.
To summarize, our method can be expressed as an alternating two-stage process.We generate optimal smoothed labels P * ls using Theorem 1 in the first stage.Then, in the second phase, we fix the P * ls to compute loss R θ (P * ls ) and update the model parameters.

Implementation Details
Two-stage training.Our solution provides the closed-form answer to the inner loop optimization (1st-stage) and for the outer loop (2nd-stage) the model f (θ) is updated by gradient-descent.LABO conducts a one-step update for the 2nd stage, namely, for each training step, we compute the optimal smoothing P * ls and update the model, which eliminates the need for additional memory or storage for the parameters or outputs of a prior, trained model.The training process is shown in Algorithm 1.
Adaptive α.The value of α determines the probability mass to smooth.To get rid of hyperparameter searching, we provide an instancespecific α as a function of the entropy of the output.
where ρ ∈ [0.5, 1], in our experiments, we use ρ = 0.5.In the experiments, we fix the ratio of β α as a hyper-parameter, so the value of β will change accordingly.
Hypergradient In the outer loop, the derivative of loss R(P * ls (θ), θ) w.r.t.θ consists of two components: because P * ls is the global solution of objective R in the inner loop, ∇ P * ls R equals zero.Therefore, the hypergradient equals zero, we neglect this component in computation for efficiency.

Experiments on Machine Translation
We evaluate our method on six machine translation benchmarks including IWSLT'14 German to English (DE-EN), English to German (EN-DE), English to French (EN-FR), French to English (FR-EN), WMT'14 English to German (EN-DE) and German to English (De-EN) benchmark.We use the 6-layer encoder-decoder transformer as our backbone model for all experiments.We follow the hyper-parameter settings for the architecture and training reported in (Gehring et al., 2017;Vaswani et al., 2017).Specifically, we train the model with a maximum of 4,096 tokens per mini-batch for 150 or 50 epochs on IWSLT and WMT datasets, respectively.For optimization, we apply Adam optimizer with β 1 = 0.9 and β 2 = 0.98 and weight decay 1e-4.For LABO, we also perform explore {1.15, 1.25} for the only hyper-parameter τ = β/α.We report on the BLEU-4 (Papineni et al., 2002) metric to compare between the different models2 .
Baselines.We compare our methods with three baselines that try to handle the over-confidence problem.LS uses the combination of a one-hot vector and a uniform distribution to construct the target distribution.CP (Pereyra et al., 2017) punishes over-confidence by regularizing the entropy of model predictions.FL (Lin et al., 2017)  The backbone transformers showed close or better BLEU scores with the numbers reported in (Vaswani et al., 2017).All confidence penalizing techniques except FL can improve the performance of transformer models.The drop in performance of FL is consistent with the finding of long-tail phenomena in neural machine translation systems (Raunak et al., 2020).Our methods consistently outperform the baseline Transformer (w/ LS), which demonstrates its effectiveness.This is across different language pairs and dataset sizes.
Multilingual Translation.We also evaluate our LABO method on IWLST'17 ({DE, FR}-EN) dataset by using multilingual transformers.We learn a joint BPE code for all three languages and use sacrebleu for evaluating the test set.Tab.2 shows LABO achieves consistent improvement over the original label smoothing on the multilingual translation dataset.We sampled 10% images from the training split as a validation set.The models are trained for 200 epochs with a batch size of 128.For optimization, we used stochastic gradient descent with a momentum of 0.9, and weight decay set to 5e-4.The learning rate starts at 0.1 and is then divided by 5 at epochs 60, 120 and 160.All experiments are repeated 5 times with random initialization.For the KD experiments, we used ResNeXt29 as the teacher.All teacher models are trained from scratch and picked based on their best accuracy.To explore the best hyper-parameters, we conduct a grid search over parameter pools.We explore {0.1, 0.2, 0.5, 0.9} for α and {5, 10, 20, 40} for KD temperature.
Results.Next, we conduct a series of experiments on two models to compare our approach with other methods without requiring any pre-trained models on two CIFAR datasets.Note that KD is reported as a reference.All experiments are repeated 5 times with random initialization.The baseline methods include CS-KD which doesn't require training a teacher and constrains the smoothed output between different samples of the same class to be close (Yun et al., 2020), TF-reg which regularizes the training with a manually designed teacher (Yuan et al., 2020) and Beta-LS which leverages a similar capacity model to learn an instance-specific prior over the smoothing distribution (Zhang and Sabuncu, 2020).KR-LS utilizes a class-wise target table which captures the relation of classes (Ding et al., 2021).We follow their best hyper-parameter settings.The time for all baselines is measured based on the original imple-mentations. 345 Tab. 3 shows the test accuracy and training time for 200 epochs of different methods on two models.It can be seen that our method consistently improves the classification performance on both lightweight and complex models, which indicates its general applicability.Besides, it shows the training time of our method is close to Base or LS methods, which show its stable efficiency over other baselines such as Beta-LS, which still requires a separate model to output a learned prior over the smoothing distribution.Our method achieves better performance than other strong baseline methods consistently.We have to mention that our computation for smoothing distribution is deterministic and excluded from the computation graph for the gradient calculation.Therefore, the implementation of our method requires less time and space during the training as we don't need to train a separate model.

Discussion
One of important problems of neural language models is the large discrepancy of predicted probabilities between tokens with low and high frequency, In other words, the Long-Tailed Phenomena in the neural language models (Zhao and Marcus, 2012;Demeter et al., 2020).In this section, we study the impact of our method on the prediction of tokens with different frequencies.
We first computed the averaged frequency of tokens in every source sentence x = [x i ]| N i=1 for the validation and test sets of IWSLT 2014 (De-En).This Frequency Score (FS) is defined in (Raunak et al., 2020): where f (x i ) is the frequency of the token x i in the training corpus.Next, we divide each dataset into three parts of 2400 sentences in order of decreasing FS.Tab. 4 shows the results of LS and LABO on different splits.All models perform much better on split-most than split-medium and least.Our method first demonstrates consistent improvements on three splits for both validation and test datasets.Besides, our LABO provides at least the same magnitude of improvements on least and medium splits as the split-most.Next, We investigate the probability distribution of the selected prediction in each step of beam search, namely, the the probability of top hypothesis finally chosen during decoding.Figure 2 shows the histogram for LS and LABO on three splits.The probabilities of our LABO method concentrated on around 0.4 while the corresponding probabilities of LS concentrated on 0.9.Our method reduces the discrepancy between predicted probabilities of different tokens, which facilitates the inference process during beam search by avoiding creating extremely large probabilities.

Related Work
There is a great body of works inspired by the KD technique.Some of them focus on boosting performance with explicitly regularized smoothed targets.Zhang and Sabuncu (2020) interpret studentteacher training as an amortized maximum aposteriori estimation and derive an equivalence between self-distillation and instance-dependent label smoothing.This analysis helped them to devise a regularization scheme, Beta Smoothing.However, they still use an extra model to infer a prior distribution on their smoothing technique during the training.Yun et al. (2020) introduce an additional regularization to penalize the predictive output between different samples of the same class.There are several works discussing the empirical impact of KD and giving different practical suggestions.Kobyzev et al. (2023) conducted extensive experiments to explore the effect of label regularization to the generalization ability of compact pre-trained language models.Other works propose to learn class-wise label smoothing or progressive refining of smoothed la-bels.Ding et al. (2021) propose to capture the relation of classes by introducing decoupled labels table which increases space complexity by O(K × K).The concurrent work (Kim et al., 2021) utilizes the model trained at i-th epoch as the teacher to regularize the training at (i + 1)-th epoch along with annealing techniques.However, contrary to our work, it either requires a separate model as teacher, or stores the labels generated at the i-th epoch for training the next epoch, whereby increasing space complexity by O(N × K).Meister et al. ( 2020b) investigate the relationship between generalized entropy regularization and label smoothing and provide empirical results on two text generation tasks.Chen et al. (2022) propose a masked Label Smoothing to remove the conflict by reallocating the smoothed probabilities based on the difference among languages.Lu et al. (2022) propose to learn confidence for NMT by jointly training a ConNet model to estimate the confidence of the prediction.We develop LABO motivated by a principled starting point: to generalize the smoothing distribution to a general form with neither time nor space complexity increase during the training and inference.

Conclusion
Our aim in this work is to fill the accuracy gap between Label Smoothing and Knowledge Distillation techniques while maintaining the training efficiency of LS regularization.We proposed learning an instance-dependent label smoothing regularization simultaneously with training our model on the target.We began by generalizing the classical LS method and introduced our objective function by substituting the uniform distribution with a general, instance-dependent, discrete distribution.Within this formulation, we explained the relationship between the LS and KD.Then, using a bi-level optimization approach, we obtained an approximation for the optimal smoothing function.We conducted extensive experiments to compare our model with conventional LS, KD and various state-of-the-art label regularization methods on popular MT and V benchmarks and showed the effectiveness and efficiency of our technique.Finally, we analyze the impact of our methods on the prediction of neural machine translation systems under different averaged token frequency settings and show our methods can greatly reduce the discrepancy between predicted probabilities of different tokens.
In practice, apart from general regularization techniques like dropout and weight decay, many advanced techniques are designed for specific tasks like FixNorm (Nguyen and Chiang, 2018), Cut-Mix (Yun et al., 2019) and SwitchOut (Wang et al., 2018).We leave the future work to combine these methods with our technique.Moreover, we plan to explore the practical applications of our method for large-scale model training.One specific application is to improve the pre-training of large language models and vision transformers.

A Derivation of the proof for Theorem 1
Proof.First, observe that the function R θ defined in Eq. 8 is a convex combination of N nonnegative functions R i = E Pls [− log p θ (j|x i )] + βKL(P ls (•|x i )∥U (•)), for i = 1, . . ., N .We will show that each of R i is a convex function of the components of simplex P ls (•|x i ) by computing the Hessian matrix of R i with respect to P ls (•|x i ): . .., . .., . . ., . . .
When β is greater than zero, the Hessian is positive definite.Therefore, each R i is a convex function of the components of P ls (•|x i ).As a result, R θ is a convex function of the collection of components of every simplex P ls (•|x i ).
For simplicity, we derive the global optimum solution for each R i with a Lagrangian multiplier: where we omit the dependency on x i to simplify the notation.We set the corresponding gradients equal to 0 to obtain the global optimum for j = 1, . . ., K.
Since K j ′ =1 P ls (j ′ ) = 1, we have So the optimal P * ls (j) is given by the formula: B Derivation from optimal smoothing to softmax output with temperature When τ = β α , we could have

C Additional Results on ImageNet
Setup for ImageNet experiments.We evaluated our method on two model architectures including ResNet50 and ResNet152 with standard data augmentation schemes including random resize cropping and random horizontal flip.The models are trained for 90 epochs with a batch size of 256.We use SGD for optimization with a momentum of 0.9, and weight decay set to 1e-4.The learning rate starts at 0.1 and is then divided by 10 at epochs 30, 60 and 80. B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.
No response.
C Did you run computational experiments?section 4 C1. Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?section 4 The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing assistance.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.

5773
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: Comparison of Label Smoothing and LABO.a) shows Label Smoothing.b) shows LABO.c) gives an illustration of a one-hot target, label smoothing and our LABO method given a translation pair sampled from the IWSLT'14 (DE-EN) dataset.It shows a part of the probability vector when predicting the next ground-truth word "video".Our method assigns larger probabilities to the context-relevant tokens compared to label smoothing.
utilizes the focal loss to assign smaller weights to well-learned tokens for every training iteration.AFL (Raunak et al., 2020) is a generalize Focal loss which establish a the trade-off for penalizing low-confidence predictions.Bilingual Translation.Tab. 1 shows the results of IWLST'14 DE-EN, EN-DE, EN-FR, FR-EN and WMT'14 EN-DE, DE-EN translation tasks.

Figure 2 :
Figure2: The distribution of probabilities of the selected predictions in each step of beam search for LS (left) and LABO (right) on splits with most, medium and least averaged token frequencies.All experiments are using beam size 5.

Table 2 :
BLEU scores for method LS or LABO on multilingual Translation Tasks.

Table 3 :
Comparison between different Smoothed labels methods.Averaged test accuracy and training time are reported.The training time is measured on a single NVIDIA V100 GPU.

Table 4 :
Performance of LS and LABO on splits with different averaged token frequencies.

Table 5 :
Comparison between different Smoothed labels methods.Validation accuracy and training time are reported.The training time is measured on 4 NVIDIA V100 GPUs.B1.Did you cite the creators of artifacts you used?Not applicable.Left blank.B2.Did you discuss the license or terms for use and / or distribution of any artifacts?Not applicable.Left blank.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?section 4C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)? section 4 D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.