DITTO: Data-efficient and Fair Targeted Subset Selection for ASR Accent Adaptation

State-of-the-art Automatic Speech Recognition (ASR) systems are known to exhibit disparate performance on varying speech accents. To improve performance on a specific target accent, a commonly adopted solution is to finetune the ASR model using accent-specific labeled speech. However, acquiring large amounts of labeled speech for specific target accents is challenging. Choosing an informative subset of speech samples that are most representative of the target accents becomes important for effective ASR finetuning. To address this problem, we propose DITTO (Data-efficient and faIr Targeted subseT selectiOn that uses Submodular Mutual Information (SMI) functions as acquisition functions to find the most informative set of utterances matching a target accent within a fixed budget. An important feature of DITTO is that it supports fair targeting for multiple accents, i.e. it can automatically select representative data points from multiple accents when the ASR model needs to perform well on more than one accent. We show that compared to other speech selection methods, DITTO is 3-5 times as label-efficient for its improvements on the Indic-TTS and L2 datasets.


Introduction
State-of-the-art speech recognition systems have seen tremendous progress in the last few years, with end-to-end architectures becoming a default modeling choice. While end-to-end models yield impressive Word Error Rates (WERs) and work well for certain user populations (Rao et al., 2017;Chiu et al., 2018), they severely underperform when confronted with out-of-domain test utterances in target accents that are unseen or rarely seen during training (Feng et al., 2021;Koenecke et al., 2020).
A common solution (Shor et al., 2019;Sim et al., 2019) to address such mismatched settings is to adapt a well-trained, speaker-independent ASR * Equal contribution. model with a small amount of accent-specific target data to adapt models to the target setting. While these works propose different fine-tuning schedules that would be most beneficial given the limited amount of target data, the question of which utterances should be chosen in order to be transcribed and further used for fine-tuning has received far less attention. This is extremely important, since procuring and labeling accent-specific data is challenging and expensive. Awasthi et. al. (Awasthi et al., 2021) present a method to select sentences within a fixed budget that are most likely to induce ASR errors to record accented audio on, resulting in higher-quality personalized ASR models for target accents compared to random selection. However, they assume access to a small seed set of labeled utterances from the target speaker. We address a more realistic setting wherein we have access only to a limited number of unlabeled utterances from the target domain, and without access to accented speakers to read out the selected texts.

Our Contributions
In this work, we propose DITTO a data-efficient and fair targeted subset selection approach that makes use of a suite of submodular mutual information (SMI) functions (originally defined in (Iyer et al., 2021)). For a specific target accent, we are given access to a small number (20 in our experiments) of unlabeled speech utterances, called the target (or query) set. We aim at identifying the most informative subset of speech utterances from a large unlabeled pool of diverse accents that best matches the target set. We procure the best matching subset by maximizing an SMI function instantiated using pairwise similarities between speech representations. We find DITTO to be an effective targeted subset selection technique for adapting ASR models in accents at multiple granularitieswithin Indian accents and accents around the world. DITTO uses a limited transcription budget, i.e., just around 20-35% of that of random. Furthermore, we show that DITTO can fairly select subsets that can cover multiple target accents using a facility location based SMI function.

Related Work
A number of works have studied subset selection for speech recognition. Wei et al. (2014aWei et al. ( ,b, 2013 use submodular function-based subset selection on generated transcripts to find a minimal set of ASR training data and Wu et al. (2007) use an entropy measure for the same. Asami et al. (2015) employ a joint Kullback-Leibler divergencebased subset selection on out-of-domain samples for ASR adaptation across acoustic characteristics such as speaker, noise and recording devices. Similarly, Liu et al. (2015) study subset selection to obtain low-vocabulary speech corpora for ASR, while Kirchhoff and Bilmes (2014) use a submodular approach for data selection in machine translation. Many recent papers (Yuan et al., 2019;Syed et al., 2017) have studied uncertainty and gradient based approaches for active learning to reduce the transcription time for ASR models, while Hamanaka et al. (2010) use a committee-based active learning method to select speech utterances.
A number of approaches have studied adaptation to atypical speech patterns like accented and dysarthic speech, such as (Shor et al., 2019) and (Tomanek et al., 2021) which fine-tune a subset of layers using labeled data from targeted accents. Sun et al. (2018) employ domain adversarial training to adapt across accents. Awasthi et al. (2021) tries addressing a problem that corresponds exactly to the reverse of our setting by trying to determine the sentences a model is most error-prone to, and recording utterances for them. While this can be effective for user-driven personalization, our method is suited to settings in which we have fixed speech utterances, and the only actionable item for us is to transcribe a subset of them. All these approaches need data specifically from the target domain to be labeled toward use for training/fine-tuning.
Finally, a number of recent works on data selection have leveraged the submodular mutual information functions used in this work for targeted subset selection. Kaushal et al. (2020) employ the SMI functions for query focused and privacy-preserving summarization, while Kothawade et al. (2022) utilize the SMI functions for improving the model performance on targeted slices. Recently, Kothawade et al. (2021) proposed an active learning approach using the SMI functions for rare classes, redundancy, and OOD data.

Submodular Mutual Information (SMI) Functions
Submodular Functions: We let U denote the ground-set of n data points U = {1, 2, 3, ..., n} and a set function f : 2 U − → ℜ. The function f is submodular (Fujishige, 2005) if it satisfies the diminishing marginal returns, namely Submodularity ensures that a greedy algorithm achieves bounded approximation factor when maximized (Nemhauser et al., 1978). Submodular Mutual Information (SMI): Given a set of items S, T ⊆ U, the submodular mutual information (SMI) (Gupta and Levin, 2020;Iyer et al., 2021) is defined as Intuitively, this function measures the similarity between T and S and we refer to T as the targeted set. In the setting considered in this paper, the set T (target set, also called query set) consists of a small set of unlabeled utterances from an accent, and U is a large unlabeled set of utterances from multiple accents. To find an optimal subset given a target set T , we can define g T (S) = I f (S; T ), S ⊆ U and maximize the same. Using a greedy algorithm, these submodular functions can be efficiently optimized within an approximation factor (1-1/e) of the global maximum.

SMI functions used in DITTO
We use the SMI functions recently introduced in (Iyer et al., 2021) and their extensions introduced in (Kaushal et al., 2020;Kothawade et al., 2022). For any two data points i ∈ U and j ∈ T , let s ij denote the similarity between them.
Graph Cut MI: The submodular mutual information (SMI) instantiation of graph-cut (GCMI) is defined as (Kothawade et al., 2022;Iyer et al., 2021): Since maximizing GCMI maximizes the joint pairwise sum with the query set, it will lead to a summary similar to the query set Q. GCMI models only query-relevance and does not select based on diversity (Kothawade et al., 2022).

Pre-trained ASR Network
Underperforming on Target Accents Improved Performance on Target Accents

Fine-Tuned ASR Network
Target Accent-Specific Utterances

Target Accent Exemplars
Unlabeled Set with All Accents Figure 1: ASR Accent Adaptation using DITTO.
Facility Location MI: The Facility Location Mutual Information (FLMI) function (Kothawade et al., 2022) takes the expression: (2) FLMI jointly models representation and queryrelevance. It measures a bidirectional similarity between representation of data points that are the most relevant to the query set, and vice versa.

DITTO: Our Data-efficient and Fair Targeted Subset Selection Method
In this section, we discuss DITTO our data-efficient and fair targeted subset selection method for ASR accent adaptation. We show that DITTO can select fair and target-relevant subsets, which is critical for fine-tuning ASR models on one or more accents. The main idea of our method is to instantiate a submodular mutual information (SMI) function using appropriate similarity kernels in order to jointly optimize it for targeting and fairness. We summarize our method in Algorithm 1 and illustrate it in Fig. 1. Concretely, we are provided a few unlabeled utterances from the accent (a target set T ) which we would like the ASR model M to be adapted to. The goal is to select the most informative subset S with respect to a target T from a large corpus U of unlabeled data, called the ground set. We are given a budget constraint, which is a constraint on the total time of the selected utterances. This corresponds to the transcription budget, since the selected utterances need to be later transcribed by a human.
We begin with extracting accent feature representations of the unlabeled set U and the target set T ; we subsequently discuss the feature representation in Sec. 5. Next, we compute a similarity matrix X , which is an RBF kernel containing pairwise similarities X ij between all data points in i ∈ T and j ∈ U. We use X to instantiate one of the SMI functions I f (S; T ) discussed in Sec. 3. Specifically, we optimize g T (S) = I f (S; T ) for S ⊆ U subject to the cardinality constraint c(S) ≤ B, where c corresponds to the duration (in seconds) of the specific utterance and B is the time budget. We use the greedy algorithm (Mirzasoleiman et al., 2015;Nemhauser et al., 1978;Lin and Bilmes, 2010) with memoization (Iyer and Bilmes, 2019) and with a knapsack constraint on the optimization. Specifically, given the current set S, we select the item i = argmax j∈U \S g T (j|S), with the stopping criterion as c(S) ≤ B. Once, we obtain the set S as the solution of this optimization problem, we obtain S's transcriptions from a human, and fine-tune the ASR model using S and its labels.

Scalability of DITTO:
The selection time of DITTO is dominated by the instantiation and maximization of the SMI function. Since all SMI functions used in this work are graph based, they require the computation of a similarity kernel. Hence, the main components that contribute towards the time complexity of DITTO are the similarity kernel computation and the greedy maximization. The FLMI and GCMI functions require a t × u similarity matrix, where t = |T | is the number of points in the target set and u = |U | is the number of points in the unlabeled ground set. This leads to a O(tu) complexity for computing the kernel. Given a selection budget of B, the time complexity of the greedy maximization for FLMI and GCMI is O(tuB), which is linear in budget and ground set sizes.

Datasets
We experiment with adapting ASR models on two public datasets, viz., IndicTTS and L2-Arctic,

ASR Model Description and Fine-tuning Details
Following (Awasthi et al., 2021), our pre-trained model is based on the QuartzNet-15x5 (Kriman et al., 2020) architecture. It is trained on Lib-riSpeech (Panayotov et al., 2015) for 400 epochs using the CTC-loss (Graves et al., 2006) and yields a Word Error Rate (WER) of 3.90 on the test-clean split of LibriSpeech. The QuartzNet-15x5 architecture is fully convolutional with residual connections. This model is fine-tuned with our selected targeted subsets S of accented speech to minimize CTC loss using the NovoGrad optimizer (Ginsburg et al., 2019) for 100 epochs with a batch size of 16, a linearly decaying learning rate of 10 −5 and early stopping based on the dev set. In all our experiments, we report results averaged over three runs using three different seeds and report error bars in all plots. We used an NVIDIA GTX 1080 Ti GPU for all runs.

Experimental Procedure and Results
We use a transcription budget of 20.5 minutes for single-accent targeting and 41 minutes when an accent pair is targeted. The average uttterance durations are 4.92s in IndicTTS and 3.6s in L2-Arctic, thus these budgets come out to 250 and 500 samples on IndicTTS and 340 and 780 samples on L2-Arctic respectively. In our proposed method, we use the approach outlined in Algorithm 1, with the SMI function I f set as one of the FLMI or GCMI functions. We consider them since they are computationally efficient (see Sec. 4), and model different characteristics in their selections. As discussed in Sec. 3.1, GCMI models only query-relevance, while FLMI models both query-relevance and diversity. As we shall see in Sec. 6.2 and Sec. 6.3, GCMI is an apt choice for targeting in some scenarios, whereas FLMI outperforms all methods when fairness and diversity are to be jointly modeled with targeting.
We also use two submodular functions that are wellknown for subset selection tasks. Namely, Facility Location and Log Determinant functions. Facility Location ("FL"): The facility location function is known to select a representative subset and has been extensively used for speech data subset selection tasks (Wei et al., 2014a(Wei et al., , 2013). Using the same notation as in Sec. 4, where S denotes the subset of utterances to be selected from the unlabeled set U, the FL function is defined as: Log Determinant ("LogDet"): The log determinant function models diversity and is crucial for determinantal point processes (DPPs). The LogDet function is defined as follows: where, Det(.) is the determinant, and X S denotes the rows and columns of the similarity matrix instantiated with elements in S. For fair evaluation, the FL and LogDet functions are optimized using the same greedy strategy (Mirzasoleiman et al., 2015) as the SMI functions used in DITTO. Note that FL and LogDet functions are computationally expensive since they require the computation of a O(n 2 ) similarity matrix, as opposed to the SMI functions, which can be optimized in linear time.

Targeted Subset Selection for Single-accents
In this section, we analyze the performance of DITTO for procuring a subset that is targeted for a single accent, followed by fine-tuning an ASR model using the selected targeted subset. For evaluation, we study the Word Error Rate (WER) of the targeted accent by evaluating the fine-tuned model on a held-out test set containing utterances of the targeted accent. Along with the WER, we also report a Targeted percentage (T%) that denotes the ratio of utterances selected from the targeted accent given the total budget. We conduct extensive experiments on IndicTTS (Vignesh et al., 2016) and L2-Arctic (Zhao et al., 2018) datasets (see Sec. 5.1 for details) by targeting all accents in both datasets, one accent at a time. For each accent (around 4.3K samples in IndicTTS and 3K samples in L2-Arctic), we create data splits by partitioning 70% of the data for the unlabeled set (U) and a small target set T of size 20. Of the remaining 30%, we create a test set from 27% and use 50 samples from the 3% as the finetuning dev set. In L2 Arctic, which has equal samples from each speaker of an accent, we ensure an equal split across the speakers in our accent-specific query, test and dev sets.
We present the targeted subset selection and finetuning results for the IndicTTS dataset in Tab. 1 and for L2-Arctic in Tab. 2. We observe that the SMI functions, GCMI and FLMI outperform other methods in terms of WER for all target accents. This is due to the fact that the SMI functions are able to identify utterances from the target accent almost to perfection in most cases. Interestingly, GCMI performs better than FLMI on IndicTTS in 6 out of 8 accents due to its high predilection towards query-relevance. On the other hand, GCMI performs worse than FLMI (although better than other methods) in terms of WER on the L2-Arctic dataset. This is because FLMI is significantly better in terms of targeting in comparison to GCMI and other methods. Note that IndicTTS is simpler since it contains data from only one speaker per accent, whereas L2-Arctic has comparatively more complex acoustics as it contains data from multiple speakers for each accent. We believe that the representation modeling capability and bidirectional similarity of FLMI allows for a better targeting ability on complex datasets like L2-Arctic. On the other hand, for datasets with lower acoustic complexity like IndicTTS: GCMI, which performs query-relevance, works well. We also present the variation in WER improvements across a range of budgets in Fig. 2 for accents picked from each dataset. The horizontal lines marked indicate the how much budget each method needed for the same WER gain. For Assamese (ASM) we see that Random needs 80 minutes to improve by 8, while FLMI and GCMI do it in 18 and 19.5 mins respectively. For Chinese (CHN) we observe: for 5.1 gain, Random needs 80 minutes, while FLMI and GCMI need only 17 and 27 minutes respectively. The SMI functions are thus 3-5 times as label efficient than random.

Fair Targeted Subset Selection for Multiple-accents
Another important setting of high practical value is that of adapting an ASR model for multiple targeted accents. In a real-world scenario, practitioners may want to improve the performance of the ASR model on accents that are under-performing. In another deployment scenario, one may need to fine-tune the ASR model on multiple accents in order to deploy it in a region where the population speaks in more than one accent. To tackle such scenarios, an ideal selection function would model fairness and select approximately equal number of utterances from each accent. To study this, we evaluate the performance of DITTO for targeting pairs of accents, followed by fine-tuning the ASR model on the selected targeted subset. For evalua-       tion, we study the WER and average WER for both the targeted accents by evaluating the fine-tuned ASR model on separate individual held-out test sets containing utterances from each of the targeted accents. Similar to the single accent experiments, we also report the Targeted percentage (T%) for each targeted accent. In addition, we report a Targeted Fairness (TF) score for the accent-pair, which is computed as a product of the targeted percentages of both the targeted accents. We multiply the final score by 4 to obtain a TF score of 1 when the selected subset perfectly targets both accents, i.e. it achieves 50% targeted percentage for both the targeted accents.
For our experiments, we consider pairs of the three worst-performing accents as target accent pairs from IndicTTS and L2-Arctic datasets. We present results for three target accent pairs from In-dicTTS in Tab. 3: i) Assamese and Malayalam, ii) Malayalam and Rajasthani, and iii) Assamese and Rajasthani. We use data splits created in Sec. 6.2: ground and test sets remain the same, whereas query sets for the accent pairs here are made by taking 10 from each accent from the accent-specific query sets of Sec. 6.2.
We observe that the SMI functions (GCMI and FLMI) outperform other methods in terms of the Avg. WER and the TF score. Interestingly, we also observe that GCMI often favors targeting a single accent: MAL when MAL and ASM are targeted, RAJ when RAJ and MAL are targeted and RAJ from RAJ and ASM. Due to this, GCMI obtains a lower TF score than FLMI. It is worth noting that FLMI achieves a TF score of as high as 1 (see Tab. 3's Asm-Mal section) due to its ability to jointly model representation. We find that GCMI tends to favor a particular accent A due to higher pairwise similarity values X A between utterances belonging to accent A in comparison to accent B. In Fig. 3, for the ASM-MAL accent pair, we illustrate the Avg. WER improvement and duration of utterances selected from both the accents across a wide range of budgets. Notably, we observe that FLMI continues to select fairly for both accents, while GCMI favors MAL accent over ASM.
To compare the Targeted Fairness between FLMI and GCMI, we visualize a t-SNE plot of the In-dicTTS dataset embedded using MFCC features in Fig. 5. As shown in the legend of Fig. 5, each color represents a particular accent and the selected data points are denoted in black. The query data points from ASM and MAL accents are shown by yellow stars. We observe that the data points selected by FLMI are representative of the query, as they are spread well across MAL (cluster A) and ASM (clusters B, C and D). On the other hand, data points selected by GCMI are mainly concentrated in the bigger cluster centers of MAL (cluster A) and ASM (cluster D), while completely missing clusters B and C. This is again a consequence of the fact that FLMI jointly models representation and query-relevance whereas GCMI focuses only on query-relevance.
We conduct a similar analysis for the L2-Arctic  dataset. We present the results for pairs from three bottom performing target accents from L2-Arctic in Tab. 4: i) Arabic and Chinese, ii) Arabic and Vietnamese, and iii) Chinese and Vietnamese. Consistently, the SMI functions outperform other methods in terms of Avg. WER and TF score. Evidently, FLMI performs the best across all accent pairs. In Fig. 4, for the CHN-VTN accent pair, we demonstrate the Avg. WER improvement and duration of utterances selected from both the accents across a wide range of budgets. We observe that FLMI achieves the highest Avg. WER and selects the most number of utterances from both target accents, proving it can achieve robust targeting and fairness.

Conclusion
In this work, we propose DITTO, a data efficient and fair targeted subset selection method for ASR accent adaptation. DITTO utilizes submodular mutual information (SMI) functions to find representative speech utterances that belong to the target accent within a limit budget. We show that SMI functions consistently outperform other methods for targeting. We also demonstrate that DITTO is capable of targeting multiple accents fairly, which can be beneficial for deploying ASR models in regions with populations that speak in more than one accent.

Limitations
Similar to the limitations of existing selection methods, our method needs a reasonable feature embedding for accent representation in order to effectively target accents. MFCC features are not the best choice to represent accent information. Some accents may be more difficult to represent than others. This also lowers fairness scores for such accents. For instance, in one of our experiments where Manipuri accent was paired with Rajasthani or Assamese accents, we observe that acquiring a fair subset using any selection strategy is challenging (see Tab. 5). Although, FLMI was able to achieve a higher TF score than others, it was relatively lower than other accent pairs (see Tab. 3 and Tab. 4). This is due to the fact that the pairwise similarity scores of utterances within the Manipuri accent are lower than other accents. The lower pairwise similarity scores lead to lower marginal gains during greedy maximization and are a consequence of poor feature representations due to insufficient information being encoded about the Manipuri accent. On another note, a risk associated with the targeting ability of DITTO is that it could be misused to create models that are unfair to certain populations. For future work, evaluating the performance of DITTO on larger datasets and other diverse settings (e.g. out-of-distribution accents) will be interesting.

Acknowledgments and Disclosure of Funding
This work is supported by an Amazon Research Awarded (AWA) awarded to Preethi Jyothi, Ganesh Ramakrishnan and Rishabh Iyer, and by the National Science Foundation under Grant No. IIS-2106937 awarded to Rishabh Iyer. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of Amazon or the National Science Foundation.