Accented Speech Recognition With Accent-specific Codebooks

Speech accents pose a significant challenge to state-of-the-art automatic speech recognition (ASR) systems. Degradation in performance across underrepresented accents is a severe deterrent to the inclusive adoption of ASR. In this work, we propose a novel accent adaptation approach for end-to-end ASR systems using cross-attention with a trainable set of codebooks. These learnable codebooks capture accent-specific information and are integrated within the ASR encoder layers. The model is trained on accented English speech, while the test data also contained accents which were not seen during training. On the Mozilla Common Voice multi-accented dataset, we show that our proposed approach yields significant performance gains not only on the seen English accents (up to $37\%$ relative improvement in word error rate) but also on the unseen accents (up to $5\%$ relative improvement in WER). Further, we illustrate benefits for a zero-shot transfer setup on the L2Artic dataset. We also compare the performance with other approaches based on accent adversarial training.


Introduction
Accents in speech typically refer to the distinctive way in which the words are pronounced by diverse speakers.While a speaker's accent may be primarily derived from their native language, speech accents are also influenced by various other factors related to the geographic location, educational background, socio-economic and socio-linguistic factors like race, gender and cultural diversity (Benzeghiba et al., 2007).It is therefore infeasible to build automatic speech recognition (ASR) systems which comprehensively cover speech accents during training.In such scenarios, novel speech accents continue to have an adverse effect on ASR performance (Beringer et al., 1998;Aksënova et al., 2022).While humans effectively recognize speech from new and unseen accents (Clarke and Garrett, 2004), ASR systems show substantial degradation in performance when dealing with new accents that are unseen during training (Chu et al., 2021).
Prior works attempting to address accent-related challenges for ASR can be categorized into three groups: i) multi-accent training (Huang et al., 2014;Elfeky et al., 2016), ii) accent-aware training using accent embeddings (Jain et al., 2018) or adversarial learning (Sun et al., 2018), and iii) accent adaptation using supervised (Rao and Sak, 2017;Winata et al., 2020) or unsupervised techniques (Turan et al., 2020).While partial success has been achieved using most of these approaches, the development of robust speech recognition systems that are invariant to accent differences in training and test remains a challenging problem.
In this work, we propose a new codebook based technique for accent adaptation of state-of-theart Conformer-based end-to-end (E2E) ASR models (Gulati et al., 2020).For each of the accents observed in the training data, we define a codebook with a predefined number of randomly-initialized vectors.These accent codes are integrated with the self-attended representations in each encoder layer via the cross-attention mechanism, similar to the perceiver framework (Jaegle et al., 2021).The ASR model is trained on multi-accented data with standard end-to-end (E2E) ASR objectives.The codes capture accent-specific information as the training progresses.During inference, we propose a beam-search decoding algorithm that searches over a combined set of hypotheses obtained by using each set of accent-specific codes (once for each seen accent) with the trained ASR model.On the Mozilla Common Voice (MCV) corpus, we observe significant improvements on both seen and new accents at test-time compared to the baseline and existing supervised accent-adaptation techniques.
Our main contributions are: • We propose a new accent adaptation technique for Conformer-based end-to-end ASR models using cross-attention over a set of learnable codebooks.Our technique comprises learning accent-specific codes during training and a new beam-search decoding algorithm to perform an optimized combination of the codes from the seen accents.We demonstrate significant performance improvements on both seen and unseen accents over competitive baselines on the MCV dataset.
• Even on a zero-shot setting involving a new accented evaluation set, L2-Arctic (Zhao et al., 2018), we show significant improvements using our codebooks trained using MCV.
• We publicly release our train/development/test splits spanning different seen and unseen accents in the MCV corpus.Reproducible splits on MCV have been entirely missing in prior work and we hope this will facilitate fair comparisons across existing and new accentadaptation techniques.1

Related Work
Traditional cascaded ASR systems (García-Moral et al., 2007) handled accents by either modifying the pronunciation dictionary (Humphries and Woodland, 1997;Weninger et al., 2019) or modifying the acoustic model (Fraga-Silva et al., 2014;Yoo et al., 2019).More recent work on accented ASR has focused on building end-to-end accentrobust ASR models.Towards this, there are two sets of prior works: Accent-agnostic approaches and Accent-aware approaches.
Accent-agnostic ASR.Such approaches force the model to disregard the accent information present in the speech and focus only on the underlying content.Prior work based on this approach uses adversarial training (Ganin et al., 2015) or similarity losses.Using domain adversarial training, with the discriminator being an accent classifier, has shown significant improvements over standard ASR models (Sun et al., 2018).Pre-training the accent classifier (Das et al., 2021b) and clusteringbased accent relabelling (Hu et al., 2020) have also led to further performance improvements.The use of generative adversarial networks for this task has also been explored (Chen et al., 2019).Rather than being explicitly domain adversarial, other accent agnostic approaches use cosine losses (Unni et al., 2020) or contrastive losses (Khosla et al., 2020;Han et al., 2021) to make the model accent neutral.
These losses force the model to output similar representations for inputs with the same underlying transcript.
Accent-aware ASR.Accent-aware approaches feed the model additional information about the accent of the input speech.Early work in this category focused on using the multi-task learning (MTL) paradigm (Zheng et al., 2015;Jain et al., 2018;Das et al., 2021a) that jointly trains accentspecific auxiliary tasks with ASR.Different types of embeddings like i-vectors (Saon et al., 2013;Chen et al., 2015), dialect symbols (Li et al., 2017), embeddings extracted from TDNN models (Jain et al., 2018) or from wav2vec2 models trained as classifier (Li et al., 2021a;Deng et al., 2021) have also been explored for accented ASR.Many simple ways of fusing accent information with the input speech have been previously investigated.This fusion can either be a sum (Jain et al., 2018;Viglino et al., 2019;Li et al., 2021a), a weighted sum (Deng et al., 2021) or a concatenation (Li et al., 2021a,b).Few works also explore the possibility of merging both accent-aware and accent-agnostic techniques within the same model (Zhou et al., 2023).
Our work also proposes an accent-aware approach.However, unlike prior work that focuses on prefetched accent information, we learn accent information embedded within codebooks during training.Additionally, instead of simply concatenating input speech with accent embeddings, we propose a learned fusion of accent information with speech representations using cross-attention.Prior work by Deng et al. (2021) demonstrates fine-grained integration of accent information.However, our proposed framework integrates accent information as part of end-to-end training resulting in robust adaptation.

Methodology
Base model.Our base architecture uses the standard joint CTC-Attention framework (Kim et al., 2016) with an encoder (ENC), a decoder (DEC-ATT), and a Connectionist Temporal Classification (CTC) (Graves et al., 2006)  and DEC-CTC to jointly predict the output token sequence y = {y 1 , . . ., y j , . . ., y U }. DEC-ATT is an autoregressive decoder that maximizes the conditional likelihood of producing an output token y j given h and the previous labels y 1 , . . ., y j−1 .In contrast, DEC-CTC uses CTC to maximize the likelihood of y given h by marginalizing over all alignments.The encoder is implemented using Conformer layers (Gulati et al., 2020) and the decoder is implemented using Transformer layers (Vaswani et al., 2017).
For our proposed technique, we introduce the following three essential modifications to the base architecture: i) Constructing codebooks that can encode accent-specific information (Section 3.1).ii) Enabling fine-grained integration of accent information with a Conformer-based ASR model using cross-attention (Section 3.2).iii) Modifying beam search decoding for inference in the absence of accent labels at test-time (Section 3.3).

Codebook Construction
Consider M seen accents, which are observed during training.We generate M codebooks, one per accent, where the i th codebook learns latent codes specific to the i th accent.During training, we use a deterministic gating scheme to select the codebook specific to the underlying accent of the training example.To support the selection of a single accent codebook during inference, when the accent labels for the test utterances are unknown, we modify the beam-search decoder to search across all seen accents.We found such a hard gating to be critical to achieve ASR performance improvements.In Section 6, we compare the proposed model with a soft gating mechanism that works with a standard beam-search decoding.
Each codebook contains P d-dimensional vectors that we refer to as codebook entries.The entries belonging to the i th codebook are generated as: where Embedding is a standard embedding layer.In the following sections, we use c = {c 1 , . . ., c P } to refer to the codebook corresponding to the underlying accent label for a given training example.

Encoder with Accent Codebooks
Figure 1 illustrates the overall architecture with the proposed integration of a codebook into each encoder layer via a cross-attention sub-layer.We will refer to this new accent-aware encoder module as ENC a = {ENC 1 a , . . ., ENC L a } that consists of a stack of L identical Conformer layers.The i th encoder layer ENC i a takes both h i−1 and c as inputs, and produces h i as output.Codebook c is shared across all the encoder layers.All the vectors involved in the computation of attention scores are d = 256 dimensional.
A cross-attention sub-layer integrates accent information from codebook c into each encoder layer.This sub-layer takes both self-attended contextualized representations H and the codebook c as its inputs and generates codebook-specific information relevant to the speech frames of this contextual representation.More formally, the operations within encoder layer ENC i a can be written as follows: where the equations colored in purple highlight our changes to the standard Conformer encoder layer.The above equations can be viewed as a stack of four independent blocks, each having a residual connection and being separated by layer normalization.
MultiHeadAttn(Q, K, V ) refers to a standard multi-head attention module (Vaswani et al., 2017) with Q, K and V denoting queries, keys and values respectively.MultiHeadAttn self is a self-attention module where each frame of h i−1 attends to every other frame, thus adding contextual information.Convolution is a stack of three convolution layers: a depth-wise convolution sandwiched between two point-wise convolutions, each having a single stride.The input and output of the Convolution block are d-dimensional vectors.A position-wise feed-forward layer Linear pw is made up of two linear transformations with a ReLU activation.This takes a d-dimensional output from the convolution module as input and produces a d-dimensional output vector with a hidden layer of 2048 dimensions.
MultiHeadAttn cb (H, c, c) is our proposed cross-attention module over codebook entries, where each frame of input H attends to all the entries in the codebook c to generate attention scores.These attention scores are further used to generate frame-relevant information Ĉ as a weighted average of codebook entries.We elaborate further on the attention computation for a single attention head in Ĉ.2 Let H j refer to the j th frame in H.The attention distribution {α j,1 , . . .α j,P }, where α j,k is the attention probability given by H j to the k th codebook entry, is computed as: where W i q , W i k and W i v ∈ R d×d are learned projection matrices.These attention scores are further used to generate the weighted average of codebook entries in Ĉ: The cross-attention sublayer is further modified with a residual connection and layer normalization to generate the final codebook-infused representations in C.
Algorithm 1: Inference algorithm that performs joint beam-search over all accents.Our modifications to the standard beam search (Meister et al., 2020) are highlighted .Each beam entry is a triplet ⟨s, y, A⟩ where A refers to a seen accent.score A () is a modified scoring function which uses the codebook for accent A during the forward pass.
Input : x: speech input V: list of vocabulary tokens n max : maximum hypothesis length k: maximum beam width y: output prediction so far score A (., ., .):scoring function

Modified Beam-Search Algorithm
Since we do not have access to accent labels at test-time, we rely on either using a classifier to predict the accent or modifying beam-search to accommodate the prediction streams generated by all seen accent choices.Due to a large imbalance in the accent distribution during training with certain seen accents dominating the training set, we find the classifier to be ineffective during inference.We elaborate on this further in Section 6.
Figure 1 shows our inference algorithm that performs a joint beam search over all the seen accents.Each beam entry is a triplet that expands each hypothesis using each seen accent.Scores for each seen accent are computed using a forward pass through our ASR model by invoking the codebook specific to the accent.The beam width threshold k is then applied to expanded predictions across all seen accents.

Datasets
All our experiments are conducted on the MCV_ACCENT dataset extracted from the "validated" split of the Mozilla Common Voice English (en) corpus (Ardila et al., 2019).Overall, 14 English accents present in MCV_ACCENT are divided into two groups of seen and unseen accents.Table 1 lists the accents belonging to each of these groups.

Seen Accents
Unseen Accents We create train, dev, and test splits that are speaker-disjoint.We construct two train sets, MCV_ACCENT-100 and MCV_ACCENT-600, comprising approximately 100 hours and 620 hours of labeled accented speech, respectively.Since MCV_ACCENT consists of many utterances that correspond to the same underlying text prompts, a careful division into dev/test sets that are disjoint in transcripts from the train set was performed.We train and validate the ASR models only on the seen accents, while the test data consists of both seen and unseen accents.The detailed statistics of our datasets are given in Table 2.For a quick turnaround, most of our experiments were conducted on MCV_ACCENT-100.For experiments in Section 5.3, we use the 620-hour MCV_ACCENT-600.All the data splits mentioned above are available in our codebase, which should enable direct comparisons across accent adaptation techniques.More details about the construction of the datasets are provided in Appendix A.

Models and Implementation Details
We use the ESPnet toolkit (Watanabe et al., 2018) for all our ASR experiments.As is standard practice, we further add 3-way speed perturbation to our dataset before training.We use the default configurations specified in the train_conformer.yamlfile provided in the ESPnet toolkit 3  and 6 decoder layers using joint CTC-Attention loss (Kim et al., 2016).We use four attention heads to attend over 256-dimensional tensors.The position-wise linear layer operates with 2048 dimensions.We train the model for 50 epochs using 80-dimensional filter-bank features with pitch.In all our experiments, we apply a stochastic depth rate of 0.3 which we found to yield an absolute 2% WER improvement, compared to a baseline system without this regularization enabled.During inference, we use a two-layer RNN language model (Mikolov et al., 2010) trained for 20 epochs with a batch size of 64.We conducted all our experiments on NVIDIA RTX A6000 GPUs.

Experiments and Results
Table 3 shows word error rates (WERs) comparing our best system (codebook attend (CA) system) with five approaches: 1. Transformer baseline (Dong et al., 2018).2. Conformer baseline (Gulati et al., 2020).3. Adding i-vector features (Chen et al., 2015) 4 to the filterbank features, as input to the Conformer baseline.4. Conformer jointly trained with an accent classifier using multi-task learning (Jicheng et al., 2021). 5. Conformer with Domain Adversarial Training (DAT) (Das et al., 2021b), with an accent classifier at the 10 th encoder layer.Table 3: Comparison of the performance (WER % ) of our architecture (codebook attend (CA)) with baseline and other techniques on the MCV-ACCENT-100 dataset.Numbers in bold denote the best across baselines, and the green highlighting denotes the best WER across all experiments.Ties are broken using overall WER.CA: Codebook attend -cross-attention applied at all layers with 50 entries in each learnable codebook.† indicates statistically significant results compared to DAT (at p <0.001 using MAPSSWE test (Gillick and Cox, 1989)).
From Table 3, we observe that the Conformer baseline performs significantly better than the Transformer baseline.Adding i-vectors and multitask training with an auxiliary accent classifier objective perform equally well and are comparable to the Conformer baseline, while using DAT improves over the Conformer baseline.Our system significantly outperforms DAT (at p < 0.001 using the MAPSSWE test (Gillick and Cox, 1989)) and achieves the lowest WERs across all the seen and unseen accents.We use 50 codebook entries for each accent and incorporate accent codebooks into each of the 12 encoder layers.Unless specified otherwise, we will use this configuration in all subsequent experiments.Further ablations of these choices will be detailed in Section 5.5.To further validate the efficacy of our proposed approach using accent-specific codebooks, we perform zero-shot evaluations on the L2Arctic dataset.We note here that we do not use any L2Arctic data for finetuning; our ASR model is trained on MCV_ACCENT-100.Such a zero-shot evaluation helps ascertain whether our codebooks transfer well across datasets.The L2Arctic dataset (Zhao et al., 2018) comprises English utterances span-ning six non-native English accents namely Arabic (ARA), Hindi (HIN), Korean (KOR), Mandarin (MAN), Spanish (SPA), and Vietnamese (VIA).Table 4 shows WERs achieved by our system in comparison to the baseline and other techniques.Our proposed method significantly outperforms all these approaches on every single accent (p < 0.001 using the MAPSSWE test (Gillick and Cox, 1989)).Table 5: Comparison of the performance (WER %) of our approach with other methodologies on MCV_ACCENT-600 dataset.

Effect of Training Data Size
Table 5 compares our proposed system with DAT and Conformer on the 600-hour MCV_ACCENT dataset.Compared to the Conformer baseline and the DAT, the proposed CA approach shows a steady improvement over unseen accents, while resulting in a minor drop in performance on the seen accents.To discount the possibility that improvements using our proposed model could be attributed to an increase in the number of parameters, in Table 6, we compare our proposed system with multiple variants of the baseline Conformer model (referred to as Conf. in Table 3) where parameters are increased to be commensurate with our proposed model by either (1) Increasing the number of encoder units (from 2048 → 2320) or (2) Increasing the dimension used for attention computation (from 256 → 272).We observe a slight improvement over the standard baseline when the attention dimension is increased.However, compared to all these baselines, our proposed model still shows a statistically significant improvement at p <0.001.To check the effectiveness of our approach on a balanced dataset, in Table 7, we compare our proposed system with the Conformer baseline on a 100-hour accent-balanced data split.Even on such a balanced dataset, our architecture shows a statistically significant improvement (at p=0.005) compared to the baseline.

Ablation Studies
We present two ablation analyses examining the effect of changing the number of accent-specific codebook entries (P ) and the effect of applying cross-attention at different encoder layers.
The first five rows in Table 8 refer to the addition of codebooks to all encoder layers via crossattention with varying accent-specific codebook sizes (P ) ranging from 25 to 500.As P increases, the experiments show improved performance on seen accents but degrades on the unseen accents, indicating that the codebooks begin to overfit to the seen accents.Our best-performing system with P = 50 performs well on seen accents while also generalizing to the unseen accents.As expected, using lower-capacity codebooks (P = 25) shows performance degradation.
The next five rows in

50.
Since accent effects can be largely attributed to acoustic differences, we see that the early encoder layers closer to the speech inputs benefit most from the codebooks.Adding codebooks only to the last four or eight encoder layers is not beneficial.
Randomly initialized codebooks were observed to be as useful as learnable codebooks for selfsupervised representation learning in Chiu et al. (2022).Motivated by this result, we experiment with randomly-initialized accent-specific codebooks that are not learned during training.The last row of Table 8 shows that random codebooks only cause a slight degradation in performance compared to the best performing system, echoing the observations in Chiu et al. (2022).

Inference with a Single Accent
To understand the effectiveness of accent-specific codebooks, we conduct five experiments by committing to a single seen accent during inference.That is, we decode all the test utterances using a fixed accent label.Table 9 shows results from inferring with a single accent across both seen and unseen accents.For the seen accents, the diagonal contains the lowest WERs indicating that the information learned in our codebooks benefits the accented samples.Furthermore, similar accents, from geographically-close regions, benefit each other.The New Zealand accented English speech achieves the best WERs using Australian accent specific codebooks, Hong Kong, Indian, Philippines and Singapore accented test utterances prefer Table 9: Comparison of the performances (WER%) of inferences done using fixed accent labels.
US accented codebooks, and Wales accent achieves its best results using England-specific codebooks.
The WER results achieved by our bestperforming system in Table 3 are much lower than the best WER results achieved in these singleaccent experiments.This indicates that one cannot directly map an unseen accent to an appropriate seen accent and therefore, making this decision independently for each utterance (as we propose to do in the joint beam search) is crucial.

Beam-Search Decoding Variants
All the results reported thus far use a joint beam search decoding.Table 10 shows   Table 10: WER (%) of various inference algorithms described in section 5.7 on MCV_ACCENT-100 setup.
Inference time gives a relative comparison of the time taken by each decoding variant with the standard beam search as the reference.
refers to a standard beam-search decoding over the Conformer baseline with a beam width of k.The setting B 1 and B 2 refer to running beam-search M times, once for each seen accent and picking the best-scoring hypothesis among all predictions.
For B 1 setting, we use a beam width of k for each seen accent.Naturally, this incurs a large decoding overhead with a factor of M increase in inference time and changes the effective beam width to M k.
In the B 2 setting, we divide the beam width into M parts, each occupied by a specific accent, thus making the effective beam width k/M .The setting B 1 performs the best, but significantly increases inference overhead.The B 2 setting is efficient but under-performs due to all accents being given an  Active Accents during Joint Beam-search: Using joint beam-search decoding, it is possible for samples from different accents to get pruned in early iterations and only one or two dominant accents to be active from the start.To check for this, we compute the distribution of samples in the beam across the five seen accents and plot the average entropy of this distribution across all test instances in Figure 3.It is clear that four to five seen accents are active until time-step 20, after which certain accents gain more prominence.
Figure 4 shows both the probabilities across seen accents appearing in the beam for a single Walesaccented test sample, along with the entropy of this distribution.This shows how nearly all accents are active at the start of the utterance, with England becoming the dominant accent towards the end.
Alternatives to Joint Beam-search: We also explore two alternatives to learning accent labels within the ASR model itself: i) We jointly trained an accent classifier with ASR.During inference, this classifier provides pseudo-accent labels across seen accents that we use to choose the codebook.
ii) We adopted a gating mechanism inspired by Zhang et al. (2021) that adds a learnable gate to each codebook entry.Unlike our current deterministic policy of picking a fixed subset of codebook entries, the learned gates are trained jointly with ASR to pick a designated codebook entry corresponding to the underlying accent of the utterance.
During inference, the learned gates determine the codebook entries to be used for each encoder layer.Both these techniques performed better than the Conformer baseline but were equivalent in performance to the DAT approach (Das et al., 2021b).We hypothesize that this could be due to the lack of a strong accent classifier (or a lack of appropriate learning in the gates to capture accent information).
Our joint beam-search decoding bypasses this requirement by searching across all seen accents.
Why do we see Performance Improvements on Unseen Accents?For test utterances from unseen accents, our model is designed to choose (seen) accent codebooks that best fit the underlying (unseen) accent.It is somewhat analogous to how humans use familiar accents to tackle unfamiliar ones (Anderson, 2018;Levy et al., 2019).During inference, our model searches through seen accent codebooks and chooses entries that are most like the unseen accents in the test instances.

Conclusion
In this work, we propose a new end-to-end technique for accented ASR that uses accent-specific codebooks and cross-attention to achieve significant performance improvements on seen and unseen accents at test time.We experiment with the Mozilla Common Voice corpus and show detailed ablations over our design choices.We also empirically analyze whether our codebooks encode information relevant to accents.The effective use of codebooks for accents opens up future avenues to encode non-semantic cues in speech that affect ASR performance, such as types of noise, dialects, emotion styles of speech, etc.

Limitations
We identify a few key limitations of our proposed approach: • The codebook size is a hyperparameter that needs to be finetuned for each task.
• We currently employ accent-specific codebooks, one for each accent.This does not scale very well and also does not enable sharing of codebook entries across accent codebooks.Instead, we could use a single (large) codebook and use learnable gates to pick a subset of codebook entries corresponding to the underlying accent of the utterance.
• Our proposed joint beam-search leads to a 16% increase in computation time at inference.This can be made more efficient as part of future work.
• Our joint beam-search allows for each utterance at test-time to commit to a single seen accent.However, parts of an utterance might benefit from one seen accent, while other parts of the same utterance might benefit from a different seen accent.Such a mix-and-match across seen accents is currently not part of our approach.Accommodating for such effects might improve our model further.

A Dataset Curation
To build MCV_ACCENT-600, we group the examples from MCV_ACCENT into seven buckets while preserving speaker disjointedness across the train, dev, and test sets.The buckets are visualized in Figure 5; bucket (4) refers to utterances that have exactly the same transcript but different speakers appearing across the train and dev sets.We wanted to include some transcript overlap across all combinations of train, test, and dev splits, since the model could learn accent information from samples with the same transcripts and different underlying accents.We note that a majority of the dev and test samples are disjoint in both speakers and transcripts from the training set for a true evaluation that does not benefit from having seen the same transcripts during training.
To create such a split, we loop over all the accents, and for every seen accent a seen , we first filter out examples with transcripts that have been previously dealt with and then split the remaining unique transcripts from a seen into seven buckets.For every bucket b, transcripts from b are further divided into n groups where n is the number of benefactors for that bucket.As an example, for bucket (1) the value of n is 3. Let x i be an utterance spoken by speaker s j , which is put into the train set by bucket (1).Then to maintain speaker disjointedness, we put all utterances spoken by s j into the train set.The transcripts of these utter- Numbers in bold denote the best across baselines, and the green highlighting denotes the best across all the experiments.Ties are broken using overall WER.CA L∈(i,...,j) (P = k): Codebook attend -Cross Attention applied at all layers from i to j with k entries per accent codebook.CA L∈(i,...,j) (P rand = k): Similar to the previous setup, but with codebooks frozen during training.Table 13: Comparison of the performance(WER % ) of all the experiments mentioned for MCV_ACCENT-600.We follow the same notation as in Table 12.
a comparison of our proposed joint beam search (elaborated in Section 3.3) with other beam-search variants incurring varying inference overheads.B 0 in Table10

Figure 2 :
Figure 2: Heatmap showing which codebooks are chosen during inference across seen and unseen accents.For example, the third cell in the first row shows that 92 out of 413 Australian-accented utterances used the codebook belonging to England during decoding.

Figure 3 :
Figure 3: Progression of average test entropy of the probability distribution across seen accents.

Figure 4 :
Figure 4: Progression of the probability/entropy across seen accents for a single Wales-accented test sample.

Figure 5 :
Figure 5: Illustration of the transcript-wise overlap between the train, test, and dev sets in terms of durations.(1) represents the duration of the group of utterances whose transcripts are present in all three splits.(7) denotes duration of examples having transcripts found only in the train set.

Table 1 :
List of 5 seen and 9 unseen accents in MCV_ACCENT corpus.

Table 7 :
Comparison of the performance (WER %) of our approach with Conformer baseline on an accent balanced MCV_ACCENT-100 dataset.

Table 8 :
Table 8 refer to codebooks with cross-attention introduced at varying encoder layers.The number of codebook entries is fixed at Comparison of the performance (WER %) of different variants of our architecture.CA L∈(i,...,j) (P = k): Codebook attention applied at all layers from i to j with k entries per accent codebook.CA L∈(i,...,j) (P rand = k): Similar to the previous setup, but with codebooks frozen during training.Accent-wise WER is shown in Appendix B and a few select examples are highlighted in Appendix C.

Table 12 :
Comparison of the performance(WER % ) of all the experiments mentioned for MCV_ACCENT-100.