Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and Accented Speech

Automatic Speech Recognition (ASR) systems are often optimized to work best for speakers with canonical speech patterns. Unfortunately, these systems perform poorly when tested on atypical speech and heavily accented speech. It has previously been shown that personalization through model fine-tuning substantially improves performance. However, maintaining such large models per speaker is costly and difficult to scale. We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning, while only updating a tiny fraction (less than 0.5%) of the model parameters. We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.


Introduction
Automatic Speech Recognition (ASR) systems have achieved great success on a diverse set of acoustic and linguistic conditions, domains and speech patterns. State-of-the-art ASR systems are typically trained on tens of thousands of hours of speech data, and they perform well as long as these domains and conditions are well represented in the training data.
Understandably, the distribution of such data typically focuses on the canonical and typical spoken language patterns of the target language, i.e., regional dialects, common accents and frequent non-native accents. As a result, these systems may perform poorly on the tail of the distribution which may include "heavily" accented speech and/or speech with atypical speech patterns (Darley et al., 1975). Atypical speech includes dysarthric speech, speech impairments (due to, for example, ALS, stroke, traumatic brain injury, down syndrome, cerebral palsy, and MS), stuttering, deaf * Equal contribution speech, or severe hyper-nasality due to cleft lip and palate. The lack of sufficient training data for these accents and atypical speech in the training distribution may result in a poor experience for a large segment of the population, leaving the less fortunate communities behind when it comes to speech-enabled technologies (Moore et al., 2018).
Studies on accented speech showed word error rates (WER) twice or three times as high for accented speech compared to the more standard US accent (Sainath et al., 2020;Ghorbani and Hansen, 2018). Even worse performance is observed for speakers with speech impairments (Moore et al., 2018). Our goal, in this paper, is to efficiently build scalable models that can adapt to non-canonical or atypical speech.
It has been shown that speech models originally developed for typical speech can be successfully fine-tuned with limited amounts of data to accented or impaired speech (Zhu et al., 2019;Shor et al., 2019;Gale et al., 2019;Mustafa et al., 2014;Biadsy et al., 2019;Doshi et al., 2020;. Nevertheless, one major challenge with adapting models to either individuals or small groups of speakers is that it is necessary to scale the number of models that need to be maintained and hosted. For example, for smart devices powerful enough to run ASR models on-device, having to deploy and store an additional (potentially large) model may take up valuable on-device resources. Similarly, providing personalized models for a large population of speakers in a centralized/server-based scenario is not feasible.
We propose to mitigate this issue by injecting residual adapter layers into the architecture. Particularly, we use a bottleneck architecture that requires a tiny number of parameters (< 0.5% in our scenario) compared to the full model update via fine-tuning. Then, while keeping the original pretrained model parameters frozen, we update only the parameters of the adapter layers as we train on the custom data of interest. This provides an easy way to deploy and store adapted models: a (generic) base model is deployed to all clients, and each individual or group can receive a personalized set of trained adapter layers that is small in size.
The main contributions of the paper are as follows. We show that residual adapters work extremely well for acoustic adaptation of different speech models. We present extensive experiments with adapter layers in two very different ASR usecases: personalized models for atypical speech, and group models for accented speech. We also demonstrate that adapter layers work well in two different, state-of-the-art end-to-end ASR architectures, Neural Network Transducers (RNN-T), and Transformer Transducers (T-T). This emphasizes the flexibility of this approach and its suitability as a standard alternative to full fine-tuning for arbitrary models. Our results clearly demonstrate how adaptation via adapter layers solves the issue of parameter inefficiency while largely retaining the significant adaptation gains achievable through model adaptation with in-domain data.

Related Work
Model fine-tuning has been successfully applied to domain adaptation for a variety of NLP tasks (Devlin et al., 2019;Sun et al., 2019), Machine Translation (Freitag and Al-Onaizan, 2016), and speech recognition and conversation systems, including accented and atypical speech (Zhu et al., 2019;Shor et al., 2019;Gale et al., 2019;Biadsy et al., 2019;Doshi et al., 2020; A major disadvantage of model fine-tuning is its parameter inefficiency since it retrains all (or a large portion of) the model parameters on given task-or domain-specific data, resulting in a copy of the model for that task/domain. This is especially problematic for personalization of models due to the resulting high number of specialized models.
Concatenating input features and speakerdependent vectors, such as i-vectors, is a parameterefficient speaker adaptive approach that has been applied to both acoustic models as well as endto-end ASR models (Saon et al., 2013(Saon et al., , 2021. However, only moderate improvements have been achieved, even on typical speech. We speculate that such a static, low dimensional representation may not be sufficient to capture the complex acousticphonetic patterns (e.g., consonant dropping and vowel dropping, extreme vowel reduction or length-ening, missing phonemes and even syllables, very irregular speaking rate and rhythm) often found in impaired speech.
Residual adapters were originally introduced by Rebuffi et al. (2017) for computer vision tasks as an alternative to fine-tuning. These first residual adapter modules consisted of a single projection layer added between layers of a pre-trained network. Houlsby et al. (2019) proposed a variation consisting of a bottleneck structure (downprojection through feed forward layer, RELU, upprojection) for task-specific adaptation of BERT models. Adapter modules were added after each sub-layer within a transformer layer, and the weights of the residual adapters as well as existing layer normalization parameters were updated during training. Finally, Bapna and Firat (2019) have formulated a simplification of residual adapters in the context of domain-adaptation for Machine Translation. Each residual adapter module has its own layer normalization block, followed by a down-and up-projection feed forward network. They argued that by including layer normalization in the residual adapter block, these modules are plug-able into arbitrary blocks of pre-trained modules because they learn the activation pattern of the layer into which they are injected.  proposed to use adapters on top of multi-lingual ASR models to further improve their performance (they report up to 9% WER improvement for some of the 5 languages of the multilingual model). Our focus is different in that we consider residual adapters in speech personalization scenarios where the number of adapted models is several orders of magnitudes higher (e.g., tens of thousand of speakers with atypical speech and potentially hundreds of accents and dialects) and also not static (e.g. speech impairments often progress over time).
Learning Hidden Unit Contribution (LHCU) (Swietojanski et al., 2016) is another approach to more parameter efficient speaker adaptation. Instead of updating all weights of a model, LHUC adds learned factors to the output of each hidden unit modulating their amplitude. However, Bapna and Firat (2019) have shown that using residual adapters is much more effective.

Methods
For our experiments, we chose two state-of-the-art end-to-end ASR architectures: the Recurrent Neu-  Both architectures consist of three main components: an encoder, a prediction network that incorporates label history and serves as a language model component (decoder), and a joint layer that combines predictions made by the encoder and the prediction network and feeds into a softmax. All components of the two architectures are identical except the encoder stack. The prediction network consists of 2 uni-directional LSTM layers. Inputs are 128-dimensional log Mel features computed every 10 milliseconds. 4 consecutive features are stacked with a stride of 3 frames to yield a 512dimensional input to the encoder every 30 milliseconds. Our output vocabulary consists of 4096 word piece tokens. Figure 1a shows a high-level overview of both architectures.
For RNN-T, the encoder consists of 8 LSTM layers; for T-T, we use 15 Transformer layers in the encoder. Both architectures are trained with the RNN-T loss (Bagby et al., 2018). To make T-T streamable, the attention calculation pays attention to past contexts only, which makes this architecture analogous to a uni-directional RNN.
We propose to utilize residual adapter modules as outlined by Bapna and Firat (2019) for our adaptation approach. Each residual adapter block starts with layer normalization applied to the inputs, followed by a feed-forward layer with downprojection to dimension d b , a non-linear activation (RELU), and another feed-forward layer with upprojection to the original input dimension d i . All weights of the residual adapter module are randomly initialized. Figure 1b shows such a residual adapter module and its integration within the Transformer encoder. We add residual adapters to each encoder layer, resulting in 8 adapter layers for RNN-T and 15 adapter layers for T-T. The bottleneck dimension d b enables control of the number of parameters of each residual adapter module and thus the capacity available during adaptation.

Experiments
We analyze the performance of residual adapters as an alternative to model fine-tuning in two scenarios: adaptation to (a) atypical speech and (b) accented speech. For atypical speech, we build a personalized, speaker-dependent model for each speaker based on their data. For accented speech, we build per-accent models (i.e. speaker-independent models) and also experiment with a multi-accent adaptation scenario where one model is used for all covered accents. We conduct experiments using both ASR transducer architectures (RNN-T and T-T).

Accented Speech Dataset
For the accented speech adaptation task, we use Mozilla's Common Voice corpus (v5.1) (Ardila et al., 2020). It contains spoken utterances of users reading sentences. Recordings were verified by other contributors using a simple voting system. While the full corpus contains 60 languages, for this work we use a subset containing only English recordings. We make use of Common Voice's metadata to extract accent information and use all 10 accents with more than 1k recordings, including (in order of decreasing number of recordings): England (en), India (in), Australia (au), Canada (ca), Scotland (sc), Ireland (ir), New Zealand (nz), Africa (af), Singapore (si), and Philippines (ph).
We randomly split all utterances from each accent into train/dev/test subsets. The resulting subset sizes per accent are shown in Table 5. Table 1 shows utterance counts and length (in words and seconds) aggregated across all accents.

Atypical Speech Dataset
We use the Euphonia corpus  for the atypical speech personalization task. This corpus consists of over 1 million utterance recordings of over 1000 anonymized speakers with different types and severity levels of speech impairments. Similar to the Common Voice corpus, all recordings in the Euphonia corpus are prompted speech. All our experiments are performed on a random subset of 100 speakers who have each recorded more than 1000 utterances. The resulting subset is very diverse, covering speakers with 15 different etiologies (31% with amyotrophic lateral sclerosis (ALS), 20% Down Syndrome, 14% cerebral palsy, 6% Parkinson's Disease, 5% hearing impairment etc) and different speech impairment severity levels (47% mild, 32% moderate, 21% severe). We use the predefined per-speaker train, dev, and test splits (80%/10%/10%). Table 1 shows utterance counts and length (in words and seconds) aggregated across all 100 speakers. Note that speakers with a speech impairment often have a lower speaking rate and frequently pause between individual words and before speaking. This is reflected in the relatively low ratio of words per second in the Euphonia corpus.

Experimental Settings
We follow a similar fine-tuning recipe as described in . We start from a speaker- independent base model pre-trained on 162k hours of typical (mostly American English) speech. This base model has been optimized to (a) be robust across various application domains and acoustic conditions, and (b) generalize well to unseen conditions (Narayanan et al., 2019). The same base model is used across all of our experiments.
We use SpecAugment (Park et al., 2019) for data augmentation, limit training to a maximum of 50k steps (atypical speech) and 30k steps (accented speech) and employ small batch sizes (32 for atypical speech, 256 for accented speech with RNN-T, and 128 for accented speech with T-T).
We only update the weights of the encoder layers, as our focus is on learning acoustic-phonetic variability as opposed to vocabulary and language variability. Accordingly, weights of the joint layer and the prediction network are always kept frozen. When training with residual adapters, we freeze all parameters of the base model and only update the residual adapter layers. Table 2 shows the resulting number of parameters updated for the different adaptation strategies. For example, residual adapters with a bottleneck dimension of 16 yield more than 100× parameter reduction compared to the encoder fine-tuning scenario.
Word error rate (WER) is measured on the respective test splits. The best checkpoints are chosen based on the WER on the dev split.  for T-T. For atypical speech, we report median WER across all 100 speakers. For accented speech, we report mean WER across 10 accents for the per-accent adaptation scenario. The percentages in the WER columns are the relative WER improvement γ (Eq. 1) over the unadapted model. opposed to multi-accent) adaptation results. Adaptation performance is compared with performance on the unadapted base model. In addition to WERs, we also report the relative WER improvement over the unadapted model:

Results
For residual adapters, we identified the best learning rate and bottleneck dimensions during hyper-parameter tuning on the dev set (see Section 5.1). In addition to comparing residual adapters to a scenario where we fine-tune the entire encoder, we also test the impact of fine-tuning only a few layers (1-3) of the encoder. However, this alternative for reducing the number of updated parameters is less efficient than residual adapters, which have a much lower parameter footprint due to their bottleneck architecture. Table 2 shows that on the atypical speech personalization task adapting the full encoder per speaker, we observe a relative reduction of 80% in median WER across speakers for RNN-T. 2 However, this strategy requires 81% of the model parameters to 2   report similar improvements over 500 be updated and stored per speaker. Using residual adapters, on the other hand, we achieve relative WER reduction of 77% across all speakers for RNN-T. Although fine-tuning is slightly better than adaptation with residual adapter layers, the latter only needs to update about 0.2% of the parameters. We observe similar trends with T-T.
Comparing to a scenario where we update only a few bottom layers of the encoder, we observe a significant 3 WER increase compared to full encoder fine-tuning. Residual adapters, while using less than 0.2% of the parameters, perform significantly better than updating only the first encoder layer (9% of parameters on RNN-T) and slightly better than updating the encoder layers 1-3 (32% of parameters on RNN-T).
For accented speech, Table 2 shows that finetuning leads to more moderate improvements of 29% for RNN-T and 35% for T-T (averaged across all accents). Similar to the personalization task, speakers of the Euphonia corpus; their in-depth analysis shows that this holds across different severities and types of speech impairment.
3 Throughout this paper, we use paired t-tests to measure statistical significance (indicated as significant for p-values < 0.05) residual adapters performs slightly worse than finetuning (24% improvement for RNN-T and 31% for T-T), but require the update of only a fraction of parameters. The alternative of updating only the first encoder layer shows a much poorer performance. Figure 2 shows the adapter performance drop, the relative WER reduction when switching from fine-tuning to residual adapters calculated per speaker/accent (lower is better): T-T exhibits a lower average adapter performance drop δ compared to RNN-T on both tasks. This is likely due to the higher overall capacity of residual adapters when applied to T-T due to the higher number of encoder layers (15 encoder layers for T-T, 8 for RNN-T).

Hyper-Parameter Tuning
The results reported in Table 2 are for bottleneck dimension and learning rates found to work well in hyper-parameter tuning experiments where we ran a grid search over a combination of the two. For the learning rate, we evaluated (1e − 5, 1e − 4, 1e − 3, 1e − 2), and for the bottleneck dimension, we evaluated 4, 16, 32, 128. We use a random subset of 20 speakers for the atypical speech task for parameter tuning to make search feasible; for accents, search was run across all 10 accents. For the atypical speech personalization task, a learning rate of 1e − 5 worked best for fine-tuning, and 1e − 3 for residual adapters (both on RNN-T and T-T). 4 For accents, we found the best learning rate for fine-tuned models using RNN-T to be 1e − 5, and 1e − 4 for T-T. For residual adapters, the best learning rate was 1e − 4 (RNN-T) and 1e − 3 (T-T). We found that accents with high amounts of training data tended to be more tolerant to higher learning rates compared to accents with limited training data.
Overall, these experiments show that adapters require on average a learning rate one order of magnitude higher than fine-tuning the encoder for both T-T and RNN-T. This may be attributable to the much smaller capacity of adapters and/or to their random initialization.
For the bottleneck dimension, we found that b d = 4 often leads to a higher adapter performance drop on the atypical speech task, although for some speakers -especially those with mild impairment and generally relatively low (<= 25) WER on the unadapted models -even this bottleneck dimension worked very well. A bottleneck dimension of b d = 128 rarely led to increased performance over b d = 16 and b d = 32. Between the latter two, we could not make out a clear pattern; they often performed equally well. For accented speech, when adapting for individual accents, we found that both b d = 4 and b d = 16 achieve similar performance. A bottleneck dimension b d = 128 on average leads to worse performance than b d = 16. Given these results, we chose a bottleneck dimension of b d = 16 for all reported experiments, unless otherwise noted.

Atypical Speech Personalization
In this section we further analyze performance of the residual adapters for the atypical speech personalization task, zooming in on aspects like speaker impairment severity and phrase types. Figure 3 shows the adapter performance drop δ per severity. For T-T, residual adapters seem to work similarly well across all 3 types of severities. On RNN-T models, we observe that δ is higher for speakers with moderate severity (median decrease of 18% for moderate vs 14% for mild and severe). In particular, we found that δ is somewhat correlated with the relative WER improvement γ of the fine-tuned model: δ is higher for cases where the adaptation by fine-tuning helps the most (Spearman correlation coefficient of 0.344 (p < 0.001) for RNN-T, more moderate correlation of 0.189 (p=0.06) for T-T). Overall, adaptation  76.9 Fine-tune, full enc 4.1 (76%) 6.5 (84%) 14.2 (79%) 4.5 (74%) 6.9 (83%) 13.3 (78%) Residual Adapters 4.8 (70%) 7.5 (80%) 16.3 (77%) 4.8 (71%) 8.2 (81%) 15.3 (77%) Table 3: Results for atypical speech personalization task, broken down by severity (reported: median WER scores with relative WER improvement over the unadapted base model in brackets). by fine-tuning and residual adapters show similar behavior across severity levels, which suggests that residual adapters do not have disadvantages for specific severity levels.
The Euphonia corpus also comes with domain information for each utterance. This enables us to analyze the performance of fine-tuning and residual adapters on two different domains -home automation queries 5 (short phrases of 3.2 words on average) and conversational phrases (longer with 7.4 words on average, open domain) -to understand whether residual adapters have trouble with different phrase types. Table 4 shows T-T WERs for these two domains for a subset of 43 speakers who had a sufficient number of recordings of both home automation and conversational phrases. The conversational domain, being longer and with a more open vocabulary, generally is more challenging for ASR and accordingly across all severity levels we observe higher WERs. Moreover, this domain seems to be harder to adapt to, resulting in lower WER improvements through both types of adaptation, most notably on the severe group where adaptation gain drops from ∼ 85% to ∼ 71% (finetuning approach). Despite these difference, both fine-tuning and residual adapters show similar behavior, and we conclude that even for more challenging domains with longer utterances, residual adapters work well. Table 5 presents the per-accent WER results. Depending on the accent and the amount of training data, we observe substantial variance with respect to WER of the unadapted and adapted models. Across all accents, the Scottish accent (sc) performs worst with extremely high WER, both for the unadapted and adapted models. 6 However, even for accents with fairly small amounts of training 5 Examples: "turn on lights" or "play ABBA on Spotify". 6 Note that beyond accent variation, results shown in Table 5 are affected by a domain mismatch between the training data used for the base model and the Common Voice corpus. data, such as af, adaptation clearly improves the performance over the unadapted model.

Accent Adaptation
While fine-tuning the full encoder has slightly better performance than residual adapters, the adapter performance drop δ is relatively small (see Figure 2). This is consistent with our findings on the atypical speech personalization task (Section 5.2), where lower relative WER improvement is associated with much lower adapter performance drop. Analogously to the atypical speech personalization task, the adapter performance drop is smaller on T-T compared to RNN-T.
In order to provide a better context to the related work on accent recognition and to test the capability of the residual adapters in large group adaptation scenarios, we also ran experiments for multi-accent adaptation where a single model is fine-tuned (full encoder update) or adapted with residual adapters to all accents at once. We increased the bottleneck size to b d = 128 so that the residual adapters have a sufficiently large capacity to handle this more complex task. A balanced training set of 11 accents (10 accents as described in Section 4.1 plus the US accent; 10k utterances per accent, up-or down-sampled depending on the amount of the training data per accent) was used for adaptation. Similarly, a balanced dev set was used to identify the best performing checkpoint. For WER calculation, we used the original test set per accent for comparability.
Results in Table 5 show that even in a large group adaptation scenario, residual adapters perform well and their performance is comparable to the peraccent adaptation scenario (adapter performance drop δ ≈ 8% for both RNN-T and T-T).

Training and inference time
On the atypical speech personalization task, when updating residual adapter parameters only -as opposed to encoder fine-tuning -we observed a speedup in training time as measured in global steps per second. Fine-tune, full enc 3.7 (76%) 3.6 (88%) 7.8 (85%) 6.2 (67%) 4.9 (83%) 12.4 (71%) Residual Adapters 4.7 (76%) 5.0 (87%) 8.5 (84%) 6.0 (65%) 5.8 (79%) 15.0 (72%)   number of training steps needed across the 100 speakers (i.e. best checkpoint selected) as well as the global steps/second for fine-tuning the full encoder vs residual adapter training. While residual adapters led to about the same (T-T) or fewer (RNN-T) number of steps for convergence, the global steps/second score increased by about 40%. 7 This gain is especially relevant for a personalization scenario where large numbers of user-specific models need to be trained. During inference, on the other hand, we didn't observe a measurable increase in latency when 7 We used the same accelerator setup (2x2 Tensor Processing Units slices) for fine-tuning and residual adapter training. adding residual adapters. To test this, we decoded test sets of around 300 utterances several times on personalized models trained with and without residual adapters.

Conclusions and Future Work
In this work, we have shown that adaptation of ASR models using residual adapter layers leads to substantial WER improvements over unadapted models across two tasks: atypical speech (up to 77% relative WER reduction) and accented speech (up to 31% relative reduction) and in two architectures (RNN-T and T-T). In comparison, finetuning the entire encoder for each speaker or accent yields only small improvements compared to residual adapter training.
While similar in adaptation performance, residual adapters are much more parameter efficient than model fine-tuning. In our scenario, using residual adapters on each encoder layer, less than 0.5% of the overall model parameters need to be trained and maintained per speaker or accent. On the other hand, fine-tuning the entire encoder affects over 80% of the model parameters. In addition to substantially improved parameter efficiency, we also observed a dramatic training time speed up of about 40% due to the reduced number of parameter updates.
Overall, these findings demonstrate a feasible and scalable solution for personalized, speakerdependent models as well as domain-specific or dialect/accent-focused models.
In future work, we plan to study to which encoder layers we need to add adapters for best performance and to potentially make residual adapters even more parameter efficient. Similarly, we plan to apply residual adapters with different bottleneck dimensions depending on the position in the encoder layer stack (bottom and middle layers likely require larger, top layers smaller capacity). Finally, we also plan to directly compare the effectiveness of residual adapters to approaches using statically fed speaker-dependent vectors for speaker adaptation, especially in the context of accent adaptation.