Partitioned Gradient Matching based Data Subset Selection for Compute-Efficient & Robust ASR Training

,

These successes in the ASR have come at a cost, as most of the practical RNN-T models are trained on thousands of hours of labeled datasets (Rao et al., 2017;Zhao et al., 2021).Model training on these massive datasets leads to significantly increased training time, energy requirements, and consequently the carbon footprint (Sharir et al., 2020;Strubell et al., 2019;Schwartz et al., 2020;Parcollet and Ravanelli, 2021).As per Parcollet et al. (Parcollet and Ravanelli, 2021), training an RNN-T model on Librispeech 960H (Panayotov et al., 2015) emits more than 10kg CO 2 if trained in France, which becomes much worse for developing countries.This is exacerbated due to the many more training runs required for hyper-parameter tuning.This warrants a need for greener training strategies that rely on significantly lower resources while still achieving state-of-the-art results.
One way to make ASR training more efficient is to train on a subset of the training data, which ensures minimum performance loss (Killamsetty et al., 2021a;Wei et al., 2014;Kaushal et al., 2019;Coleman et al., 2020;Har-Peled and Mazumdar, 2004;Clarkson, 2010;Mirzasoleiman et al., 2020;Killamsetty et al., 2021b;Liu et al., 2017).Since training on a subset reduces end-to-end time, the hyperparameter tuning time is also reduced.While greedy subset selection algorithms employ various criteria to identify the appropriate subset of training points, the process of forming the subsets remains sequential.However, for a large scale speech corpus such as Librispeech (Panayotov et al., 2015) this requirement may be difficult to meet.In this work, we propose a Partitioned Gradient Matching (PGM) approach, which scales well with huge datasets used in ASR and takes advantage of distributed setups.To the best of our knowledge, this is the first such study performed for ASR systems.

Contributions of this work
The PGM Algorithm: We present PGM a data subset selection algorithm which constructs partial subsets from data partitions of the original dataset.This circumvents the need to load the entire dataset at a time into the memory, which is otherwise prohibitively expensive for ASR systems such as RNN-T(see Section 3).PGM is a distributable Algorithm: Training with a subset of the training data is beneficial only when the cost of selecting a subset is also less.Therefore, for subset selection algorithms to scale to larger datasets used in speech recognition, they must work across multiple GPUs, since training for ASR systems can then be distributed.In Section 4, we present PGM which is more suitable for ASR systems, more specifically for RNN-T.Trade-off between efficiency and accuracy: A subset selection algorithm has to counter the contrasting goals of efficiency and accuracy.We perform extensive experiments to demonstrate the trade-off between efficiency and accuracy for PGM and provide a general recipe for a user to control the trade-off.Effectiveness of PGM in a Noisy ASR setting: A subset selection algorithm should work well when the training data is corrupted with noise.In this work, we show the efficacy of PGM, even when a fraction of the labeled dataset is augmented with noise across varying signal-to-noise ratios.

Background: RNN Transducer
The RNN-T model (Graves et al., 2013;Graves, 2012) maps an input acoustic signal (x 1 , x 2 , . . ., x T ) to an output sequence (y 1 , y 2 , . . ., y U ), where each output symbol y i ∈ M, and M is the vocabulary.An RNN-T model consists of three components -(i) Transcription Network -which maps an acoustic signal (x 1 , x 2 , . . ., x T ) to an encoded representation (h 1 , h 2 , . . ., h T ), T being the length of the acoustic signal and x i being a W dimensional feature representation, (ii) Prediction Network -which is a language model that maps the previously emitted non-blank tokens y <U = y 1 , y 2 , . . ., y u−1 to an output space g U for the next output token.(iii) Joint Network -that combines the Transcription Network representation h t and Prediction Network representation g u to produce z t,u using a feed-forward network J and ⊕ as a combination operator (typically a sum).
During the training, the output probability P rnnt (y t,u ) over the output sequence y is marginalized over all possible alignments using an efficient forward-backward algorithm to compute the loglikelihood.The training objective is to minimize the Negative Log Likelihood of the target sequence.
P rnnt (y t,u |y <u , x t ) = softmax(J(h t ⊕ g u )) (3) For inference, the decoding algorithms (Graves, 2012;Saon et al., 2020) attempt to find the best (t, u) and their corresponding output sequence y using a beam search.In this work, we use the gradients of the joint network layer (J) for PGM, since the linear layer helps in fusing the audio(h t ) and the text(g u ) representations.

Limitations of existing subset selection algorithms
An approach to the selection of a subset of points from the entire dataset is to rank points based on their suitability.This ranking can be done either via a some static metric such as diversity or representation among features (Wei et al., 2014;Kaushal et al., 2019) or via a dynamic metric using instancewise loss gradients1 to construct the subset greedily (Mirzasoleiman et al., 2020;Killamsetty et al., 2021b,a).In the latter case, ranking and re-ranking happens using instance-wise loss gradients.Specifically, during the selection process, loss gradients of the entire set of instances have to be available in the memory in order to perform greedy selection, since otherwise, subset selection time would be prohibitively large owing to disk reads, etc.
As keeping all the loss gradients in the memory would be resource intensive, we employ the following approximations, which have been also previously employed by (Mirzasoleiman et al., 2020;Killamsetty et al., 2021b,a), viz., (i) only last layer gradients are used and (ii) subsets are constructed for each class.The latter technique is not relevant in ASR systems since ASR requires sequential decoding into a large size vocabulary.Similar to the Figure 1: As PGM is a adaptive DSS algorithm, PGM is invoked after for every R epochs training RNN-T using stochastic gradient descent.At every time step, using the latest set of parameters, PGM forms partial subsets via Gradient Matching (GM) across GPUs.These partial subsets are combined and used for the next R epochs of RNN-T training.This is repeated until the final set of parameters is obtained.last layer approximation, for the RNN-T model, we use the gradients of the joint network layer (J) which performs the important task of fusing speech (h t ) and text (g u ) features for sequence transduction.In Table 1 we present the memory footprint of the last layer gradient obtained while training ResNet18 (He et al., 2016) using CIFAR10 (Krizhevsky, 2009) and gradients of the joint network layer of RNN-T using Librispeech 100H.We compare against training ResNet18 using CIFAR10, since most of these subset selection algorithms are applied to image classification settings.In the first column of the table 1, we present the memory footprint of single instance's loss gradient.Clearly, the loss gradients used to train RNN-T have a much higher footprint than the ones used in image classification setting.The CIFAR10 dataset has 50,000 instances and Librispeech 100H has 20539.In the second column, we present the total memory required to store all the instance-wise loss gradients.The memory requirement for RNN-T's loss gradients prohibitively huge.Thus, storing all the instance-wise loss gradients at once is not feasible for RNN-T systems.Killamsetty et al. (2021a)  lects mini-batches (like used in SGD) instead of individual instances.Reduction in memory by using this technique is also not much for ASR systems such as RNN-T, since batch size used here is often small.For example, the batch size employed for the CIFAR10 dataset is typically of 128, as proposed by (He et al., 2016) whereas the batch size is 4 for Librispeech 100H as used in the Speech-Brain (Ravanelli et al., 2021) Librispeech RNN-T recipe.We present the memory required to store all the batch-wise loss gradients in the third column of Table 1.Although this requirement may seem satisfiable with some high end computing resource, however shown are the memory requirements to store the instance-wise loss gradients only.If we add other memory needs such as space to store RNN-T model and space to process features and gradient computations, effectively one needs much larger GPU memory that the figures presented in Table 1.These memory issues become even more pronounced while performing subset selection with Librispeech 960H.
Another problem with the existing subset selection algorithms is that they are sequential in nature.This doesn't allow the selection algorithm to enjoy the speedup achieved using state of the art techniques such as parallelizing across multiple GPUs etc.This may cause the subset selection algorithm to be a bottleneck while training RNN-T with datasets of the scale of Librispeech.Therefore, there is a need to design an data subset algorithm that doesn't need all the loss gradient to form a subset and could be distributed across GPUs.

Partitioned Gradient Matching Algorithm
Let U = {(x i , y i )} N i=1 denote the set of training examples, and V = {(x j , y j )} M j=1 , the validation set.Let θ denote the ASR system's parameters with θ t as the ASR system's parameters at the t th epoch.The training loss associated with the i th instance is denoted by L i T (θ) = L T (x i , y i , θ) = − ln P r(y i |x i ).We denote the validation loss by the set of mini-batch gradients associated with the data partition d p , where l = bn D .Let L bn T denote the set of mini-batch gradients.For each data partition d p , we wish to perform gradient matching (GM), by optimising the following problem, where, This selects a subset of batches X t d p and associated weights w t d p , such that the weighted sum of loss gradients associated with each instance in the subset are the best approximation of the loss gradient of the entire data partition d p while honoring the budget constraints.We perform gradient matching on mini-batch wise loss gradients only as it helps in reducing the memory needs.Similarly, we can define gradient matching problem with loss associated with the validation set as, where, The optimization problem given in Eq.( 5) is weakly submodular (Killamsetty et al., 2021a;Natarajan, 1995).Hence, we can effectively solve it using a greedy algorithm with approximation guarantees -we use orthogonal matching pursuit (OMP) algorithm (Elenberg et al., 2018) to find the subset and their associated weights.We also add to Eq.( 5) an l 2 regularization component to discourage large weight assignments to any of the instances selected in the subset, thereby preventing the model from overfitting on some samples.
The complete algorithm is presented in Algorithm 1.In the algorithm, 'Val' is a boolean flag that indicates whether to match the subset loss gradient with validation set loss gradient like in noisy settings ('Val=True') or with training set loss gradient ('Val=False').Depending on the choice of the loss gradient, we perform gradient matching with L d p T , current model parameters θ t , budget b k D , and a stopping criterion ϵ.We describe gradient matching in details in Algorithm 2. Once the appropriate batch for selection is determined, we form X f adding all the samples constituting the selected mini-batch.The model is then trained using the mini-batch SGD.We randomly shuffle elements in the subset X t , divide them up into mini-batches of size B, and run mini-batch SGD with instance weights.
The complete block diagram of PGM is presented in Figure 1.As the subset selection process is dependant on the model parameters, we repeat the subset selection every R epochs.For each data end for Output final model parameters θ T partition d p , we perform gradient matching (GM) individually and obtain partial subsets X t d p , sequentially, one after another.However in the presence of multi-GPU settings, since the gradient matching within a a data partition can be performed independently from gradient matching in other data partitions, the gradient matchings could be executed in parallel.This allows one to take advantage of multi-GPU settings which is critical to efficiently process large datasets typically used to train RNN-T.In Figure 1, we illustrate parallelization of PGM on the system with G GPUs.Here, every G partial subsets are obtained in parallel and this process is repeated D G times.

Connection to existing work
In this section we discuss the connection of PGM with GRAD-MATCHPB (Killamsetty et al., 2021a) where subset is selected via solving the following problem, where Algorithm 2 Gradient Matching (GM) Require: Loss of the entire dataset(train or validation) : L, set of mini-batch gradients ∇ θ L B T , current parameters θ t , budget k, TOL: ϵ; L bn T denotes the set of all mini-batch gradients, defined as and L is either the training loss of the entire dataset L T defined as L T = E(L d p T ) or L V depending on what sort matching we seek for.The problem tries to find subset and it associated weights so that the gradients of the mini-batches best approximate the either gradient associated with the full dataset or the validation set.We show that GRAD-MATCHPB is lower bound to PGM, that is For the proof, we refer the reader to Appendix A.

Experiments
Datasets We perform all our experiments on the Librispeech dataset (Panayotov et al., 2015).We present results on the medium-scale Librispeech 100H as well as on the large-scale Librispeech 960H datasets.
Along with the standard Librispeech benchmark, we also perform experiments on noisy Librispeech, where the speech is augmented with noise across varying signal-to-noise ratios (up to 15db) on a fraction of the training data.We refer to this dataset as Librispeech-noise, where up to 30% examples in the original dataset are augmented with noise across varying signal-to-noise ratios.
Architecture.We perform all our experiments on the Speechbrain's (Ravanelli et al., 2021) Librispeech transducer recipe.The transcription network of the RNN-T consists of a CRDNN encoder which has 2 CNN blocks followed by 4 layers of bi-LSTMs and subsequently followed by 2 DNN layers.The prediction network consists of an embedding layer followed by a single layer GRU unit.A joint network is a single linear layer that projects 1024 dimensional representations to output a vocabulary of 1000 BPE.The decoding is done through a time-synchronous decoding algorithm (Graves, 2012;Hannun et al., 2019) with a beam size of 4. The decoding involves an external transformer language model trained on the Librispeech corpus (Kannan et al., 2018;Hrinchuk et al., 2020;Wolf et al., 2019).
Training Details.For the training, we employ a learning rate of 2.0 with an annealing factor of 0.8 for the relative improvement of 0.0025 on validation loss (sometimes referred to as newbob scheduler).The training on Librispeech 100H is performed on two A100 40GB GPUs with the effective batch size of 8, whereas for Librispeech 960H, we employ two A100 80GB GPUs with an effective batch size of 24.All the training is done for 30 epochs.In all our experiments, the PGM algorithm is invoked after every 5 th epoch (R = 5) after performing warm-start (training on full data) for 7 and 2 epochs on Librispeech 100H and Librispeech 960H datasets respectively.The results for each setting are averaged over 3 runs with different random seeds.
PGM Details.For doing the subset selection with PGM, we use the gradients of the Joint Network parameters, which we believe would have the maximum information concentrated for the sequence.We freeze the rest of the network while we compute the gradient of the Joint Network of the RNN-T.We use D = 7 and D = 50 (data partitions) to obtain subsets using the PGM algorithm over gradients of training data for Librispeech 100H and 960H datasets respectively.Subset selection is performed using training set loss gradients in experiments performed using Librispeech 100H (Figures 2,3) and Librispeech 960H (table 2).For experiments with Librispeech-noise (Table 3) we employ the validation gradients for performing the subset selection, since we are also concerned with robustness in the presence of noise.
Baselines.We compare the results obtained using the PGM method against three intuitive baselines -(i) Random-Subset baseline, in which the subset of the dataset is obtained by choosing points with uniform probability.(ii) LargeOnly -For each subset, we employ only the largest utterances based on duration.(iii) LargeSmall -For each subset size, half of the subset is filled with smallest utterances and the other half with the largest utterances based on duration, to remove the length bias of the LargeOnly baseline.

Results
To compare the efficacy of the PGM, we compare the word error rate (WER), relative test error, and speed-up compared to training with the entire dataset.We compute these metrics for both the Librispeech 100H and Librispeech 960H benchmarks.Additionally, we also, present energy ratios vs. relative test error rate tradeoff on Librispeech 100H.In Figure 2, we present the comparison of WER for PGM against various baselines for various subset sizes of the full dataset.With just 20% of the subset size, the PGM method yields a WER of 10.66 as opposed to 10.08 obtained by training on the full dataset.For Librispeech 100H, PGM consistently outperforms all the baseline, thus illustrating the effect of selecting subsets using the gradient matching algorithm.Also note, Random-Subset baseline is consistently better than other heuristic based baselines such LargeOnly and LargeSmall.In Figure 3, we plot the speed up against the Relative Test Error for Librispeech 100H.While Random-Subset baseline is observed to attain higher speed up in comparison to the PGM because of the simple selection strategy, Random-Subset baseline also incurs higher relative test error in comparison to the PGM.
In Figure 4, we present the plot of relative test error w.r.t energy efficiency for the full training setting.We use pyJoules 2 for measuring the energy consumed by GPU cores.We show that with PGM, the training time is halved and energy efficiency is doubled while incurring the relative test error of 2 https://pypi.org/project/pyJoules/less than 5% as compared to the training on the entire dataset.For higher speedups, where there is a degradation in the WER, the loss is relatively better for PGM as compared to the baseline.We do not show the energy efficiency for LargeOnly and LargeSmall baselines as their relative test error is consistently poor as compared to the Random Subset baseline as shown in Figure 3.
For the ASR task we recommend using at least 30% of the dataset for training the model or using Random-Subset PGM Overlap Index 20.2% 6.37% Noise Overlap Index 0.82% 0.83% Results on Librispeech-noise: We augment randomly selected signals from the dataset with noise across varying signal-to-noise ratios to mimic a more practical setting where subset selection algorithms need to address the noise while selecting useful subsets.We show the results on the Librispeechnoise 100H and 960H datasets for different subsets in Table 3. PGM consistently outperforms the Random-Subset baseline for different subsets with lower relative test error when compared against the full training and still yields significant speed up to reduce training time and maintain robustness.

Ablation Study
Next, we do an ablation study to understand the effect of learning rate on PGM for Librispeech 100H dataset.Since, the goal of subset selection algorithms is to reduce the training data for training, the older recipes (especially learning rate) on full training data do not work as-is for the PGM because of the distributed nature of the training.
In Table 6, we show the effect of learning rate on multi-gpu training of the PGM method.The recipe for single GPU borrowed as-is for the multigpu training setting, performed poorly because the number of gradient updates in the distributed setting halved.To overcome this barrier, we doubled the learning rate to take larger steps and reach convergence within the same number of epochs.
We perform some ablation studies to understand why subsets selected by PGM tend to outperform a relatively simple Random-Subset baseline.We compute the following two metrics: Overlap Index (OI): This is the fraction of common points selected in the last two subset selection rounds with the subset size.This metric computes the diversity of the points being selected by the methods in the subsequent subset selection rounds.
Noise Overlap Index (NOI): This is the fraction of noise points selected by the subset selection methods divided by the total number of noisy points.Both the metrics are computed by averaging the index for all the runs with the same subset selection method.
As shown in Table 4, PGM selects more diverse points across different subset selection rounds which explains the better generalization of the TEST-OTHER test set.At the same time, both the methods select a similar amount of noisy points during the subset selection indicating that PGM selects more diverse points from the non-noisy points.
Finally, we study the effect of warm-start on the performance of the PGM algorithm.Since it is an adaptive data selection algorithm, PGM needs a good starting point for computing reasonable estimates of the gradients for subset selection.Table 5 shows the effect of warm-start epoch ablation on the TEST-CLEAN test set for Librispeech 960H.As we increase the warm start, the performance of the PGM algorithm improves at the cost of speed up.

Comparing PGM and GRAD-MATCHPB
Running GRAD-MATCHPB for Librispeech is prohibitively expensive since the amount of memory required to store all the gradients would exceed the memory size of available commercial GPUs as described in Section 3. To address this, we compare Phone Error (PER) on the TIMIT Phone recognition dataset (Garofolo, 1993) (containing 3680 utterances with 630 speakers) for all the methods.Table 7 shows WER obtained with PGM, GRAD-MATCHPB Random-Subset and other subset selection baselines such as LargeSmall, LargeOnly.For PGM we use data partitioning D = 2.We see that the WER of PGM is slightly higher than that of GRAD-MATCHPB, as the error term that PGM minimises is a upper bound of error term minimised by GRAD-MATCHPB discussed in section 4.1.However, PGM's WER is very close to that of GRAD-MATCHPB, indicating that the partitioning doesn't deteriorate the bounds while allowing to scale for larger datasets and utilize multiple GPUs which allows PGM to enjoy better speedups over GRAD-MATCHPB.
Statistical Significance: WER reductions using PGM compared to the Random-Subset baseline are statistically significant at p < 0.001 using a matched pairs test. 3

Conclusion
We propose PGM, a distributable data subset selection algorithm which avoids the need to load the entire dataset at a time, by constructing partial subsets from smaller data partitions.PGM is an adaptive subset selection algorithm that improves the training time of the ASR models while maintaining low relative test error as compared to the ASR model trained with the entire dataset.This speed-up improves the efficiency of the training process and subsequently reduces the carbon footprint of training such models.Our approach performs consistently better than Random-Subset baseline whilst providing good speed up, and robustness in the presence of noise.Although we test the method on the RNN-T model, we believe A Connections between PGM and GRAD-MATCHPB Lemma 1 (triangle inequality).Let v 1 , ..., v τ be τ vectors in R d .Then the following is true: Let the training data be divided into D partitions, i.e., U = d 1 ∪d 2 ∪• • •∪ d D where each partition comprises of N D instances.Let B be the batch size, b n = N/B be the total number of mini-batches and b k = k/B the number of batches to be selected.Let L d p T be the training loss associated with a data partition d p and ∇

Table 1 :
propose another technique, viz., the PerBatch version, wherein one se- Memory footprint of last layer gradient obtained while training ResNet18 using CIFAR10 and gradients of the joint network layer of RNN-T using Librispeech 100H.We use a batch size of 128 for CI-FAR10 and 4 for Librispeech 100H.

Table 2 :
Results showing WER (Relative Test Error) and Speed Up on TEST-CLEAN and TEST-OTHER test splits of Librispeech 960H.

Table 3 :
Results showing WER on TEST-CLEAN test set of Librispeech 100H trained using noisy Librispeech dataset using PGM and Random-Subset.

Table 4 :
Overlap Indices -measures the overlap between consecutive subsets for PGM and Random-Subset methods.

Table 6 :
Effect of Learning Rate on WER for PGM on TEST-CLEAN test set of Librispeech 100H.more warm-start epochs as described in Section 5.2.In Table 2, we present comparison of the PGM method with the baseline for the Librispeech 960H dataset on both the TEST-CLEAN and TEST-OTHER test sets.As shown in the Table, with just 30% of the training data, PGM is within 10% of the relative test error (1% of absolute error difference) when compared against training on the full data, thus yielding a speedup of 2.64.Similar results hold on the challenging TEST-OTHER test set of the Librispeech which shows the better generalization of PGM in comparison to the Random-Subset baseline.
7)Corollary 1 Following inequality holds between the objectives of PGM and GRAD-MATCHPBE(E λ (w t d p , X t d p , L d p T ,∇ θ L d p T , θ t )) ≥ E λ (w t , X t , L T , L bn T , θ t )) and E(E λ (w t d p , X t d p , L V θ L d p T , θ t )) ≥ E λ (w t , X t , L V , L bn T , θ t )) ∇ θ L d p B i i∈X t w t i ∇ θ L B iT (θ t ) as they are obtained via gradient matching,E(E λ (w t d p , X t d p , L d p T , ∇ θ L d p T , θ t )) ≥ ∥ i∈X t w t i ∇ θ L B i T (θ t ) − ∇ θ L T (θ t )∥ + λ∥w t ∥ 2 E(E λ (w t d p , X t d p , L d p T ,∇ θ L d p T , θ t )) ≥ E λ (w t , X t , L T , L bn T , θ t )) p ∇ θ L d p B i T − ∇ θ L d p T (θ t )∥ D ∇ θ L d p B i T ) − D i=p (∇ θ L d p T (θ t )) D ∥ + λ∥ D