Joint Audio/Text Training for Transformer Rescorer of Streaming Speech Recognition

Recently, there has been an increasing interest in two-pass streaming end-to-end speech recognition (ASR) that incorporates a 2nd-pass rescoring model on top of the conventional 1st-pass streaming ASR model to improve recognition accuracy while keeping latency low. One of the latest 2nd-pass rescoring model, Transformer Rescorer, takes the n-best initial outputs and audio embeddings from the 1st-pass model, and then choose the best output by re-scoring the n-best initial outputs. However, training this Transformer Rescorer requires expensive paired audio-text training data because the model uses audio embeddings as input. In this work, we present our Joint Audio/Text training method for Transformer Rescorer, to leverage unpaired text-only data which is relatively cheaper than paired audio-text data. We evaluate Transformer Rescorer with our Joint Audio/Text training on Librispeech dataset as well as our large-scale in-house dataset and show that our training method can improve word error rate (WER) significantly compared to standard Transformer Rescorer without requiring any extra model parameters or latency.


Introduction
Streaming end-to-end automatic speech recognition (ASR) models aim to transcribe the user's voice with minimal latency and have been widely used in numerous interactive ASR applications that support direct user interaction in real-time.Unlike non-streaming end-to-end ASR (Chorowski et al., 2015;Chan et al., 2016;Bahdanau et al., 2016;Kim et al., 2017;Chiu et al., 2018), streaming endto-end ASR, such as RNN-T (Graves, 2012a;Prabhavalkar et al., 2017;Battenberg et al., 2017;He et al., 2019;Li et al., 2019), are limited to use short audio context or not use future context to satisfy low latency constraints and suffer from higher word error rates (WER).
To address this issue of streaming ASR, a Two-Pass architectures has been recently proposed to improve WER while keeping latency low (Sainath et al., 2019;Li et al., 2020;Xu et al., 2022).The main idea of the Two-Pass architectures is to use a non-streaming model, so-called Rescorer, (2ndpass) on top of the conventional streaming model (1st-pass) to re-score and choose the best output among the initial n-best outputs generated from 1stpass model.Specifically, the latest Transformerbased (Vaswani et al., 2017) Rescorer (Li et al., 2020) , 2018;Sriram et al., 2017;Shan et al., 2019;Toshniwal et al., 2018;Kannan et al., 2018;McDermott et al., 2019;Variani et al., 2020;Weinstein et al., 2020;Kim et al., 2021b), however, these techniques require extra model parameters and latency.More recently, new approaches to use text-only data without requiring extra model parameters and latency by multi-task learning for LAS (Chan et al., 2016) have been proposed (Sainath et al., 2020;Wang et al., 2020;Tang et al., 2022).
In this work, we present our Joint Audio/Text training method for Transformer Rescorer to leverage unpaired text-only data without requiring any extra parameter or latency.Unlike previous studies (Sainath et al., 2020;Wang et al., 2020), our method does not need to use Text-To-Speech (TTS) or prior information (domain ID) or tuning parameter, and it is based on Transformer decoder.We evaluate our Transformer Rescorer with our Joint Audio/Text training method on Librispeech dataset as well as large-scale in-house dataset and show that our model significantly outperforms over Transformer Rescorer with standard training method WER without requiring extra training parameters or computational latency.

Transformer Rescorer
The architecture of the two-pass system (Sainath et al., 2019) consists of streaming 1st pass RNN-T model (Graves, 2012a;Shi et al., 2021) and nonstreaming 2nd pass Transformer Rescorer (Li et al., 2020) is illustrated in Figure 1.The standard way to train the two-pass system is two-step: The Decoder of RNN-T takes h and y and generate initial output ŷ1st in a streaming fashion.The RNN-T model is trained to maximize P (y = ŷ|x) (Graves, 2012b) first.The parameters of RNN-T are then fixed.
2. Then, Transformer Rescorer is trained with audio embeddings h generated from the fixed encoder of RNN-T and true transcription text y to maximize u logP (y u |h, y <u ).(Vaswani et al., 2017).
In step 2, Transformer Rescorer is based on Transformer Decoder setup (Vaswani et al., 2017), and the true transcription, y, is the query and audio embeddings, h, is the key in cross attention.
During inference, the 1st-pass RNN-T model generates n-best initial hypotheses ŷ1st = ( ŷ1 1st , • • • , ŷn 1st ) with standard beam search process.Then, Transformer Rescorer takes the n-best initial hypotheses as well as the full audio embeddings h 1:T from RNN-T model and computes the log probability score for each hypothesis (re-score) and the final best hypothesis ŷ2nd is generated.

Joint Audio/Text Training
Our Joint Audio/Text Training method aims to leverage large amount of unpaired text-only data without requiring extra model parameter or latency.To do so, we allow the model to be trained on either 1) paired audio/text data where both modality inputs are available, or 2) unpaired text data where only text inputs are available.Our training method is also two-step as described in 2.
In step 2, for training on the unpaired textonly example, we use the estimated averaged audio embeddings, h avg , of paired audio/text data in the training set.This averaged audio embeddings, h avg , can be simply obtained by passing a 0-dummy audio sequence to the well-trained Encoder of RNN-T from the step 1.Our Transformer Rescorer takes this averaged audio embeddings 5718 instead as the keys and values in cross attention.Unlike (Sainath et al., 2020;Wang et al., 2020), our approach is based on Transformer architecture and it does not need to change the original objective function, nor does it require a tuning parameter for multiple losses or any prior knowledge of inputs.Figure 2 illustrates our proposed Joint Audio/Text training method for Transformer Rescorer.Note that the inference process is the same as general rescorer as described in 2.

Data
Librispeech We first evaluate our approach on the Librispeech English corpus (Panayotov et al., 2015) which is publicly available.The training data contains 960 hours of labeled speech and an additional text-only corpus containing 810M words.We apply spectrum augmentation (Park et al., 2019) and speed perturbation.The Librispeech unpaired text-only corpus contains 810M words and 40M samples, which is almost 27 times bigger than the paired audio/text data.We will discuss the effect of data mixing ratio of text-only data in Section 5.1 Large-Scale In-house Voice Command We also evaluate our approach on our large-scale in-house English dataset as well.Our in-house training dataset has two sources: 20K hours of publicly shared video data and 20K hours of voice assistant domain data.All videos and audios are completely de-identified.We augment the training data with speed perturbation, simulated room impulse response, and background noise, resulting in 145K hours.Unlike Librispeech, we have 4 times smaller size of unpaired text-only corpus then paired audiotext data.The unpaired text-only data contains 1M samples of in-domain (VA) text-only data and 25M samples of general-domain text-only data.Our evaluation data, VA has 44.2K de-identified shortform utterances in the voice assistant domain, collected by a third-party data vendor.

Model
For the 1st pass model, we use an RNN-T model architecture which is widely used in streaming ASR (Graves, 2012b;Graves et al., 2013).The RNN-T consists of Encoder and Decoder.Our Encoder has 20 emformer layers (Shi et al., 2021).We extract 80-channel filterbanks features and convert to 320dimensional inputs with a stride of 4. The context size is 160ms for streaming restriction.The Pre-dictor in Decoder consists of 3 LSTM layers.The Joiner in Decoder uses 5k word-piece as our targets.Our 1st-pass RNN-T model has 79M parameters.
For the 2nd pass model, Transformer Rescorer, we use 2 layers of conventional Transformer decoder (Vaswani et al., 2017)  The RNN-T and Transformer rescorer are trained in two steps as described in 2 and 3.For Librispeech experiments, we trained RNN-T for 120 epochs with ADAM optimizer and a base learning rate of 0.001.For large-scale in-house experiments, we trained 15 epochs with same scheduler until full convergence.
During inference, we used the standard beam search with a beam size of 10 and generated 10best hypotheses.As described in 2, Once initial decoding is done from RNN-T model, 10-best hypotheses and full context audio embeddings from RNN-T are passed to Transformer rescorer in parallel.For Librispeech experiments, we did not use any external neural language model.For largescale in-house experiments, we used a small neural language model(2.5Mparameters) for decoding with shallow fusion (Kim et al., 2021a) and ILME (Meng et al., 2021) to compare our method with the best baseline system.
We evaluated the speech engine perceived latency (SPL) for the latency analysis.The SPL measures the time from the end of the user's utterance until the ASR engine completes the result.We evaluated the SPL for three models: 1) 1st pass RNNT baseline without 2nd pass Transformer Rescorer, 2) Baseline with 2nd pass Transformer Rescorer, and 3) Baseline with 2nd pass Transformer Rescorer trained with our proposed Joint Audio/Text training method.The averaged SPLs (measured in ms) were 633.0, 636.0, 636.0, for models 1), 2) and 3), respectively.Overall, although the use of the 2nd pass model can increase SPL, our proposed Joint Audio/Text training method itself does not increase the latency at all because it does not require any model architecture changes.

Effect of Mixing Ratio
Similar to previous study (Wang et al., 2020), we observed that text-only data mixing ratio is crucial to succeed with Joint Audio/Text training for Transformer rescorer as well.Figure 3 shows the relative WER reduction of the Rescorer on Librispeech text-clean/test-other with different mixing ratio, defined as follows: mixing ratio = # of text-only # of (text-only + audio-text) Note that when we use the entire text-only data from the Librispeech provided, the mixing ratio was 96%.The baseline in Figure 3 was 0% mixing ratio which means that we used Rescorer trained only on paired audio/text data.We observed that text-only data of 40% mixing ratio performed best, and adding more than 80% of text-only data even performed worse than the baseline Rescorer.
Based on this observation, we used 40% and 50% mixing ratio for the experiments in 5.2 and 5.3, respectively.

Results on Librispeech
Table 1 shows WER results on Librispeech testclean/test-other evaluation set with Baseline (BS), Baseline with Transformer Rescorer (BS + RS), and the baseline with Transformer Rescorer with our Joint audio/text training (BS + RS + Our Joint A/T).As we discussed in Section 5.1, we used 40% mixing ratio to obtain the best results.We observed that 8.9% and 9.4% WER relative improvements by using standard Transformer Rescorer(RS), and 14.9% and 12.2% WER relative improvement by using RS trained with our Joint A/T.

Large-Scale In-house dataset
Table 2 shows WER results on 44.2K in-house evaluation set with Baseline (BS), Baseline with Transformer Rescorer (BS + RS), and the baseline with the Transformer Rescorer with our Joint audio/text training (BS + RS + Our Joint A/T).As previously described in Section 4.1, the text-only data in test domain was only 1% and the text-only data in general domain was only 19% among the entire training data.In this experiment, we oversampled in-domain text-only data to 20% and outof-domain text-only data to 30%, thus we use 50% mixing-ratio.We observed that 4.9% WER relative improvements by using standard rescorer, and 7.8% WER relative improvement by using rescorer trained with our Joint A/T.Surprisingly, our approach was still effective with our strong baseline which was trained on 145K hours.We also observed that using over-sampled duplicated indomain text-only data is more effective rather than using unique out-of-domain text-only data.

Conclusions
We have introduced Joint Audio/Text training method for Transformer Rescorer of the streaming two-pass end-to-end ASR.Unlike standard training method for Transformer Rescorer, our method can leverage unpaired text-only data and consequently improves recognition accuracy without requiring extra model parameters or computational latency.We evaluated our approach on the Librispeech dataset as well as large-scale in-house dataset and showed that Transformer Rescorer with our proposed method obtained 3% -7% relative improvement in WER compared to the standard Transformer Rescorer model.

Limitations
As with the majority of studies, this study has potential limitation.The primary limitation is that the benefit of our training approach that leverages unpaired text-only data may be diminished when paired audio/text training data is abundant and they are in same domain as test domain.

Figure 1 :
Figure 1: The two-pass system consists of streaming 1st pass RNN-T model and non-streaming 2nd pass Transformer Rescorer.

Figure 2 :
Figure 2: Our Joint Audio/Text training for Transformer Rescorer in ASR contains both the selfattention and the cross-attention.The attention dimension is 1024 and feed forward dimension is 4K, and use 8 multi-headed attention.Transformer Rescorer takes the hypothesis from the RNN-T Decoder as a query and the 1024-dimenional audio embeddings from the RNN-T Encoder as a key/value in the cross-attention.Our 2nd-pass Transformer Rescorer model has 44M parameters.Note that our Joint Audio/Text training does not require any extra model parameter.The architectures of the 1st pass RNN-T model and the 2nd pass Transformer Rescorer are illustrated in Figure 1.

Figure 3 :
Figure 3: WER improvement of Rescorer with different mixing ratio on Librispeech.

Table 1 :
Comparison of WER on Librispeech with the baseline (BS), BS with the standard rescoring model (BS + RS), and BS with RS trained by our Joint Audio/Text (BS + RS + Our Joint A/T).

Table 2 :
Comparison of WER on large-scale in-house dataset.VA is our in-house 20K hours of voice assistant domain data (described in Section 4.1.)