A Unified Speaker Adaptation Approach for ASR

Transformer models have been used in automatic speech recognition (ASR) successfully and yields state-of-the-art results. However, its performance is still affected by speaker mismatch between training and test data. Further finetuning a trained model with target speaker data is the most natural approach for adaptation, but it takes a lot of compute and may cause catastrophic forgetting to the existing speakers. In this work, we propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation. For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers by making use of speaker i-vectors to form a persistent memory. For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture, which to the best of our knowledge, has never been explored in ASR. Specifically, we gradually prune less contributing parameters on model encoder to a certain sparsity level, and use the pruned parameters for adaptation, while freezing the unpruned parameters to keep the original model performance. We conduct experiments on the Librispeech dataset. Our proposed approach brings relative 2.74-6.52% word error rate (WER) reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. Besides, with extremely low-resource adaptation data (e.g., 1 utterance), our method could improve the WER by relative 6.53% with only a few epochs of training.


Introduction
End-to-end models yield state-of-the-art performance on automatic speech recognition (ASR) in the past decade, such as connectionist temporal classification (CTC) model (Miao et al., 2015;Graves, 2012), attention-based encoder-decoder model (Zhang et al., 2017), recurrent neural network transducer (RNN-T) (Graves, 2012), transformer model (Dong et al., 2018) and conformer model (Gulati et al., 2020). However, model performance deteriorates due to speaker mismatch between training and test data. Given the target speaker, finetuning the trained model could alleviate the speaker mismatch problem to some extent, but finetuning the entire model requires large amounts of compute to be effective, and it could in turn bring catastrophic forgetting (McCloskey and Cohen, 1989) to the existing speakers.
Currently, there are two lines of studies to address the speaker mismatch problem in neural network based models. One category is working on the acoustic features, i.e., either by normalizing acoustic features to be speaker-independent (Seide et al., 2011;Tomashenko and Estève, 2018;Ochiai et al., 2018) or by introducing additional speaker related knowledge (e.g., i-vector) to adapt the acoustic model (Saon et al., 2013;Senior and Lopez-Moreno, 2014;Pan et al., 2018;Fan et al., 2019). A summary vector of each utterance can be trained to replace speaker i-vector (Vesely et al., 2016). To adapt to acoustic variability, Kim et al. (2017) add shifting and scaling parameters in the layernormalization layer.
The other category belongs to model adaptation, i.e., to train the speaker-dependent model from speaker-independent model parameters with extra adaptation data. To avoid overfitting, techniques such as L2 regularization (Liao, 2013), Kullback-Leibler divergence (Yu et al., 2013) and adversarial multitask learning (Meng et al., 2019) have been used. Because finetuning the entire model is computationally expensive, Yao et al. (2012); Siniscalchi et al. (2013); Samarakoon and Sim (2016a) only adapt specific layers or a subset of parame-ters. In particular, Swietojanski et al. (2016); Samarakoon and Sim (2016b); Xie et al. (2019) reparameterize each hidden unit with speaker-dependent amplitude function in fully-connected or convolutional neural network layers. However, it is difficult to determine which model parameters to adapt for target speaker, and choosing certain sub-layer(s) intuitively may not be optimal.
In this work, we propose a unified speaker adaptation model by making use of both feature adaptation and model adaptation. For feature adaptation, we propose the speaker-aware persistent memory model to generalize better to unseen test speakers. In particular, speaker i-vectors from the training data are sampled and concatenated to speech utterances in each encoder layer, and the speaker knowledge is learnt through attention computation with speaker i-vectors. Our method learns utterance level speaker knowledge, which is more effective than learning time step dependent speaker knowledge (Fan et al., 2019) since it is more robust to various variability factors along an utterance.
For model adaptation, we explore gradual pruning (Zhu and Gupta, 2018), which to the best of our knowledge, is the first time being studied for speaker adaptation. We gradually prune less contributing parameters on model encoder, and then use the pruned parameters for target speaker adaptation while freeze the unpruned parameters to retain the model performance on general speaker data. In this way, our model could adapt to target speakers very fast by updating only a small percentage (10%) of encoder parameters, and it does not change the model architecture. Freezing unpruned parameters alleviates the catastrophic forgetting problem as well.
Our proposed approach brings relative 2.74-6.52% WER reduction on general speaker adaptation. On target speaker adaptation, our method outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54% 1 .

Speech Transformer
Speech transformer (Dong et al., 2018) is an extension of the transformer model (Vaswani et al., 2017) for ASR. We briefly introduce the speech transformer model here. For a speech input sequence, speech transformer first applies two convolution layers with stride two to reduce hidden representation length. A sinusoidal positional encoding is added to encode position information. Both encoder and decoder in speech transformer model use multi-head attention network. Attention network has three inputs key, query and value, which are distinct transformations of an input sequence. The multi-head attention network is computed by concatenating single attention network h times: Multi-head attention could learn input representation in different subspaces simultaneously. For encoder, the three inputs all come from the speech input, so the attention network is called selfattention network. For decoder, the text input first goes through self-attention network. To maintain autoregression in decoder, a mask is applied to future tokens. To incorporate information from the speech input, in the next attention network, key and value vectors come from encoder, and query vector comes from decoder, so this attention network is called cross-attention network. Layer normalization and residual connection are applied before and after multi-head attention network. Afterwards, there is a position-wise feedforward network with rectified linear unit (ReLU) activation: Self-attention network and position-wise feedforward network form an encoder layer. There is an additional cross-attention network in a decoder layer. There are N e encoder layers and N d decoder layers in total.

Speaker Adaptation
Speaker adaptation arises due to speaker mismatch between training and test data. It aims to adapt the model to a target speaker, which is a critical component in HMM-based models (Keith and Matthias, 2005;Furui, 1980;Gauvain and Lee, 1994;Kuhn et al., 2000). For neural network based models, many approaches are developed as well as discussed briefly in Section 1.
Adapting an ASR model is a challenging task given that ASR model is a huge and complex model with a large number of parameters to update. Finetuning the entire model takes significant computational resources to reach the optimal performance, and it potentially causes catastrophic forgetting problem (McCloskey and Cohen, 1989;Kirkpatrick et al., 2017), which means when model parameters trained for existing speakers are adapted for target speaker, knowledge learnt previously is lost.

I-vector
I-vector is a low-dimension vector that is dependent on speaker and channel. I-vector dimension is fixed as defined no matter how long the utterance is. It is extracted based on a data-driven approach by mapping frames of an utterance to a low-dimensional vector space using a factor analysis technique (Dehak et al., 2011a). The system is based on Support Vector Machines or directly using the cosine distance value as a final decision score (Dehak et al., 2011b). I-vector was initially invented for audio classification and identification, but recently it is used for speaker adaptation as well (Saon et al., 2013;Dehak et al., 2011b;Karafiat et al., 2011).

Proposed Method
We propose an efficient speaker adaptation model by making use of both feature adaptation and model adaptation. For feature adaptation, we embed speaker knowledge represented by a number of fixed speaker i-vectors into each input utterance (Zhao et al., 2020). This aims to capture speaker information through attention computation between each utterance and speaker i-vectors. For model adaptation, an effective method is employed that can adapt to target speaker very fast without sacrificing performance on existing speakers. In particular, we prune the model gradually and finetune a small subset of parameters to be speaker-specific.

Speaker-Aware Persistent Memory for Feature Adaptation
Speaker-aware persistent memory model learns speaker knowledge from i-vectors. We first ran- domly sample N speaker i-vectors m 1 , ..., m N ∈ R d k which form the speaker space (Gales, 1998;Yu and Gales, 2006). Here we have the assumption that the linear combinations of speaker space are enough to cover the speaker information space, i. e., any unknown speaker not seen in the training data can be represented approximately by the sampled i-vectors from the training data. We name the learned transformation of speaker space as persistent memory vectors M k and M v : Only U k and U v matrices are learnable while sampled i-vectors are fixed in this method.
With the persistent memory vectors, we concatenate them respectively to the input vectors of selfattention network X = [x 1 , ..., x t ] to be the new key and value vectors. Attention network thus captures speaker-specific knowledge through attention computation between each utterance and persistent memory vectors as Eq. 9: Since M k and M v are shared across all layers, they form the persistent memory. Given that persistent memory is meant for capturing speaker knowledge, we name it as speaker-aware persistent memory. The overall framework of speaker-aware persistent memory model is shown in Figure 1.
Since our method aims at learning any speaker information from the speaker space, it effectively addresses the problem of having unseen speakers in the test data. Furthermore, using static i-vectors saves the effort to compute the i-vectors of all speakers in the training data. Besides, the attention computation (Eq. 9) with persistent memory vectors is taken over the entire utterance x 1 to x t , so each input speech time step takes part in extracting speaker information. This holistic consideration is more effective than Fan et al. (2019) who compute time step dependent speaker representations, which may be more susceptible to various variability factors along an utterance such as speaking rhythm.

Gradual Pruning for Model Adaptation
The speaker-aware persistent memory method discussed above is for general speaker adaptation without knowing target speaker profile. If the target speaker data is available, finetuning the trained model with target speaker data could result in catastrophic forgetting problem as detailed in Section 2.2. To address this, we take advantage of an effective approach to adapt to target speaker in a fast manner, and retain the model performance on the general speaker data at the same time.
Given an input speech y = {y 1 , ..., y n } and output text z = {z 1 , ..., z m }, ASR models the conditional probability of the output text over the input speech as follows: where θ represents the model parameters. Given a training dataset D = {y j D , z j D } L j=1 , θ is trained to maximize the following log-likelihood objective: Given the target speaker dataset D t = {y j Dt , z j Dt } Lt j=1 , directly finetuning the trained model means continuing to train the model to maximize the log-likelihood: where θ Dt is initialized with the trained parameters θ in Eq. 11. As shown in many recent studies (Frankle and Carbin, 2019;Zhu and Gupta, 2018;, not all parameters in a neural network model contribute to the training objective. Pruning the redundant parameters leads to negligible performance degradation (Li et al., 2017;Han et al., 2015) or may even outperform the original model (Zhu and Gupta, 2018) due to better generalization.
Our experiments in Figure 2 show that up to 50% of encoder parameters can be pruned in ASR with negligible performance degradation. Therefore, we first prune the model with training data gradually to the predetermined sparsity level by zeroing out low magnitude parameters for every 10k training steps, i.e., only retain a certain percentage of high magnitude unpruned parameters θ U P ultimately. This is to unearth the sub-network whose performance well matches the original model. Different from Liang et al. (2021) who prune on a well-trained model, we train and prune concurrently as seen in Figure 3(b) (with warm-up training first) to reduce the total number of training steps and thus save computation resources. Because speech or speaker information is learnt through the encoder of an end-to-end ASR model, we only prune encoder parameters including embedding network, self-attention network and feedforward network in all encoder layers.
Afterwards, we keep the informative subnetwork untouched by freezing the unpruned parameters θ U P to retain the performance on existing speakers, as represented by the light gray connections in Figure 3(c), and only finetune the pruned free parameters θ P for target speaker adaptation (blue connections in Figure 3(c)). The training objective will be: where θ U P is frozen and θ P is updated. Since the informative sub-network is already capable of performing ASR task very well, we believe further finetuning the free parameters with target speaker data is an added value to the speakerspecific model.
Our method does not change the model architecture, unlike some approaches to attach an additional adapter module (Ding et al., 2020). Besides, we only need to finetune a small number of parameters compared to finetuning the entire model. Fixing the informative sub-network makes our model retain past knowledge with no catastrophic forgetting issue. It also prevents the model from easily overfitting on low-resource target speaker data to some extent.

Experiments
In this section, we present our experiments using the proposed speaker-aware persistent memory model and the gradual pruning method.

Datasets
We conduct experiments to confirm the effectiveness of the proposed model on the opensource Librispeech dataset (Panayotov et al., 2015). LibriSpeech consists of 16kHz read English speech from audiobooks. We use the given train/development/test splits of Librispeech dataset. Test_clean data is clean and Test_other data has noise in speech. See Appendix A.1 for the statistics of Librispeech dataset used in our experiments.

Training Setup
We use PyTorch and Espnet  toolkit for our experiments, and we train the model for 100 epochs (n = 100 in Figure 3(b)). We use the best set of hyperparameters tested by  for transformer model without further tuning, and we pre-process the data following the Espnet toolkit. The total number of model parameters is 31 million. Input features are generated by 80-dimensional filterbanks with pitch on each frame, with a window size of 25ms shifted every 10ms. The acoustic features are mean and variance normalized. We exclude utterances longer than 3000 frames or 400 characters to keep memory manageable. For joint decoding of CTC and attention, the coefficient is 0.3 and 0.7 for CTC and attention respectively. The convolutional frontend before transformer encoder is two 2D convolutional

Adaptation for General Speakers
We first test on the adaptation for general speakers without knowing target speaker profile. Speakeraware persistent memory model introduced in Section 3.1 achieves this objective. Here we omit the hyperparameter tuning part, and directly use the best hyperparameters tested by Zhao et al. (2020), including the number of speaker i-vectors in the speaker space and the number of layers applied with speaker-aware persistent memory. We randomly sample 64 speaker i-vectors and apply on all the encoder layers in speech transformer. 64 i-vectors were tested to be a good choice to provide diverse speaker information (Zhao et al., 2020), and applying on all encoder layers helps capture speaker knowledge from both low-level phonetic features and high-level global information.  thermore, here we also compare our model with the first persistent memory model used in ASR (You et al., 2019), in which persistent memory vectors are randomly initialized and meant to capture general knowledge. Different from them, our model is to address the speaker mismatch issue. Our method achieves the best results.

Adaptation for Target Speaker
If the target speaker profile is known beforehand, the gradual pruning method discussed in Section 3.2 could adapt to the target speaker. Directly finetuning the entire model takes high computation resources by updating all model parameters, and could overfit easily if the amount of target speaker data is limited. We are interested to see the performance of the gradual pruning method especially on low-resource data, as well as how much it alleviates the catastrophic forgetting problem. Therefore, we randomly choose a speaker from the Librispeech Test_other data as the target speaker, and only select 10 utterances of the target speaker as training data. The remaining utterances of the target speaker are chosen as the test data. We do this four times and report the average performance to see the generalizability of the proposed approach. The average baseline WER of four speakers is 20.5, and is slightly smaller than the average WER of Test_other speakers, which is 21.9, so further improving the target speaker performance is a bit more challenging. The pruning rate is set as 10% here. We compare the perfor- mance of 1) Finetune: directly finetuning the entire model as Eq. 12, 2) I-vec: speaker-aware persistent memory method by adding i-vectors, 3) Pruning: gradual pruning, 4) Pruning+I-vec: combining feature adaptation and model adaptation methods proposed.
For results on the target speaker in Figure 4(a), finetuning works better than the baseline. Adding i-vectors has the highest WER initially and the performance is worse than simply finetuning the trained model after 20 epochs. We believe speakeraware persistent memory method works better on general speaker adaptation given that the sampled i-vectors form the speaker space to capture any speaker knowledge. It is not designed to adapt to some specific speakers. Using the gradual pruning method alone has lower WER than finetuning at the initial stage, but surprisingly it overfits more than the finetuning method after 20 epochs. More detailed analysis is needed and we leave it to future work. Lastly, we combine the feature adaptation and model adaptation methods, and it achieves our best result. It outperforms the baseline with up to 20.58% relative WER reduction, and surpasses the finetuning method by up to relative 2.54%. We see that the feature adaptation method and the model adaptation method we propose complement each other, as the combined model result surpasses each individual one.
We want to analyze the performance of the rest non-target speaker data to see if catastrophic forgetting happens. From Figure 4(b), all target speaker adapted models perform slightly worse than the baseline, which is expected. Combining feature adaptation and model adaptation could alleviate catastrophic forgetting problem effectively. It generally outperforms finetuning in Figure 4(b).

Analysis
In this section, we revisit our approach to reveal more details and explore the effectiveness of the gradual pruning method in combination with the speaker-aware persistent memory model.

Pruning Rate
We first test different pruning rates on encoder. Results are shown in Figure 5. Less pruning rate keeps more parameters for the general speaker data, and has less learning capability to target speaker. It is more suitable for simple adaptation tasks. Higher pruning rate generates a more sparse network and is more flexible for speaker adaptation, except that it retains less original model parameters, thus forgets more on the general speaker data. It can be seen from Figure 5 that pruning 10% of encoder parameters achieves the best result.

Gradual Pruning vs One-time Pruning
We use the gradual pruning method (Zhu and Gupta, 2018) to prune to target sparsity for every 10k training steps. One-time pruning at the initial/middle/final stage of the overall training is tested for comparison as well. We train for 100 epochs, and initial/middle/final stage pruning is done at 0/50/100 epoch respectively. Gradual pruning and one-time pruning will reach the same sparsity level after the training. Here we use either gradual or one-time pruning at different stages during training, and show the best results of finetuning for 15 epochs. Table 3 shows that gradual pruning   works better than one-time pruning, be it initial, middle or final stage of the training. Compared with one-time pruning, gradual pruning could learn and prune at the same time. In particular, gradual pruning follows the train prune cycle, and is capable of iteratively learning the unpruned parameters after less contributing parameters are pruned. For the one-time pruning, pruning at an earlier stage has the advantage to let the model learn the unpruned parameters based on the pruned ones in the remaining training of the model, but pruning earlier has the risk to prune important parameters since the model is not well learnt yet, vice versa for pruning late. Hence, gradual pruning works the best.

Extremely Low-resource Adaptation Data
Lastly, we would like to see the extremely lowresource adaptation data scenarios. We reduce the amount of adaptation data and compare the perfor-  Table 4: Characteristics of utterances selected as the extremely low-resource adaptation data. mance with the baseline, where no adaptation is performed. The characteristics of the adaptation data selected are listed in Table 4. From Figure 6, when the amount of adaptation data is reduced from 10 utterances to 5 utterances, the results are similar to that of 10 utterances at the initial training stage, and could outperform the baseline by up to relative 18.59%. With less adaptation data, the model overfits much faster, especially in the case of having only 1 utterance for adaptation. However, even with only 1 utterance, it could surpass the baseline by up to relative 6.53% with only 5 epochs of training. Therefore, even with extremely low-resource adaptation data such as 1 utterance, our method is effective with fast adaptation.

Conclusion
In this paper, we have proposed a unified speaker adaptation approach consisting of feature adaptation and model adaptation. Speaker-aware persistent memory model makes use of speaker i-vectors to adapt at the feature level, and we use the gradual pruning approach to retrieve a subset of model parameters for adaptation at the model level. Gradual pruning is found to be better than one-time pruning because gradual pruning could iteratively learn based on pruned parameters. It can alleviate catastrophic forgetting problem as well by retaining a subnetwork whose performance matches the original network. We find that our proposed method is effective in both general speaker adaptation and specific target speaker adaptation. In particular, our method brings relative 2.74-6.52% WER reduction on general speaker adaptation, and outperforms the baseline with up to 20.58% relative WER reduction on target speaker adaptation. Even with extremely low-resource adaptation data, our method could bring 6.53% relative improvement with only a few training epochs. In the future, we are interested in the overfitting issue with low-resource data, as well as multi-speaker adaptation with our method.

A.2 Average Runtime
In Table 6, we list the average runtime using one V100 GPU of 1) Baseline, 2) Finetune: directly finetuning the trained baseline model, 3) Ivec: speaker-aware persistent memory method by adding i-vectors, 4) Pruning: gradual pruning, 5) Pruning+I-vec: combining feature adaptation and model adaptation methods proposed. During training, all models are trained with the given 100h Librispeech training data, while during adaptation, all models are trained with 10 utterances of adaptation data, except for the baseline where no adaptation is performed.

A.3 Evaluation Metrics
We evaluate model performance by word error rate (WER), which can be computed as following: where S is the number of substitutions, D is number of deletions, I is the number of insertions, N r is number of words in the reference (N r = S + D + C), C is the number of correct words.

A.4 Computing Infrastructure
We conduct our experiments on NVIDIA V100 GPU and Intel(R) Xeon(R) Platinum 8163 32-core CPU @ 2.50GHz.