BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.


Introduction
The field of natural language processing (NLP) has witnessed remarkable improvements in performance thanks to the advances in deep learningbased techniques (Collobert et al., 2011;Bahdanau et al., 2015;Sutskever et al., 2014;Vaswani et al., 2017;Young et al., 2018). Much of the recent progress in NLP lies in large-scale language models (LMs) (Devlin et al., 2019;Brown et al., 2020), which are pre-trained on a vast amount of text data to learn versatile linguistic knowledge (Tenney et al., 2019). Such pre-trained models have been shown to improve diverse NLP tasks, alleviating the heavy requirement of supervised training data. Inspired by the great success in NLP, pre-trained LMs have been actively adopted for speech processing tasks, including automatic speech recognition (ASR) (Shin et al., 2019;Huang et al., 2021), spoken language understanding (SLU) (Chuang et al., 2020;Chung et al., 2021), and text-to-speech synthesis (Hayashi et al., 2019;Kenter et al., 2020).
This paper focuses on leveraging pre-trained LMs for end-to-end ASR (E2E-ASR), which aims to model direct speech-to-text conversion (Graves and Jaitly, 2014;Chorowski et al., 2015;Chan et al., 2016). One of the challenges in E2E-ASR is a huge discrepancy between input and output sequences; the input is a continuous acoustic signal with finegrained patterns, while the output is discrete linguistic symbols (e.g., words) with long-range dependencies. Such an input-output gap makes it difficult for an E2E-ASR model to extract semantic/morphosyntax information from speech, which is essential for generating proper text. We believe this limitation can be mitigated by taking advantage of the rich linguistic representations obtained from pre-trained LMs.
We explore a novel direction for adopting a pre-trained masked language model (MLM) for E2E-ASR, based on connectionist temporal classification (CTC) (Graves et al., 2006). Compared to other autoregressive approaches, such as RNN-Transducer (RNN-T) (Graves, 2012) and attention-based sequence-to-sequence (Chorowski et al., 2015), CTC's non-autoregressive formulation allows simple training and inference processes for realizing E2E-ASR. However, the performance of CTC is often limited due to a conditional independence assumption between output tokens (Chiu et al., 2018). In this work, we propose BERT-CTC that adapts BERT (Devlin et al., 2019) for CTC to mitigate the conditional independence assumption. BERT-CTC conditions CTC outputs on context-aware BERT embeddings, thereby incorporating explicit linguistic information into training/inference. The BERT-conditional formulation enables a model to attend to the full contexts of the input and hypothesized output sequences via the self-attention mechanism, while maintaining the benefits of a simple training algorithm in CTC. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines outputs with flexible length adjustment.
The key contributions of this work are summarized as follows: • We propose BERT-CTC, which efficiently adapts a pre-trained MLM for CTC-based E2E-ASR without fine-tuning. We provide a probabilistic formulation of our BERT-CTC and its close relation to conventional approaches, i.e., CTC and RNN-T.
• We evaluate BERT-CTC in various ASR tasks, which demonstrates its effectiveness regardless of variations in speaking styles and languages. We also show its potential application to end-to-end SLU.
• The codes and recipes are open-sourced on ESPnet (Watanabe et al., 2018), the widely used toolkit for end-to-end speech processing. 1 We hope our work encourages further research on combining ASR with pre-trained LMs, helping to bridge ASR and NLP fields.

Background
To understand how BERT-CTC exploits BERT for relaxing the conditional independence assumption in CTC, we start with a brief review of probabilistic formulations of conventional E2E-ASR approaches, including CTC (Graves et al., 2006;Graves and Jaitly, 2014) and RNN-T (Graves, 2012;Graves et al., 2013).
Definition of End-to-End ASR Let O = (o t ∈ R D |t = 1, · · · , T ) be an input sequence of length T , and W = (w n ∈ V|n = 1, · · · , N ) be the corresponding output sequence of length N . Here, o t is a D-dimensional acoustic feature at frame t, w n is an output token at position n, and V is a vocabulary. 2 In general, the output length is much shorter than the input length (i.e., N ≪ T ). The objective of ASR is to find the most probable output sequenceŴ that corresponds to a given input where V * denotes all possible token sequences. E2E-ASR aims to realize the direct mapping from O to W by modeling the posterior distribution p(W |O) with a single deep neural network.

Connectionist Temporal Classification
CTC formulates E2E-ASR by considering all possible alignments between an input sequence O and output sequence W . To align the sequences at the frame level, CTC augments an output sequence by allowing repetitions of the same token and inserting a blank symbol ϵ for representing "no output token" (e.g., silence). Let A denote an augmented output sequence defined as A = (a t ∈ V ∪ {ϵ}|t = 1, · · · , T ), which we refer to as an alignment between O and W . With the introduction of the frame-level alignment, CTC factorizes p(W |O) as follows: where B ctc is the collapsing function (Graves et al., 2006) that maps A to W by suppressing repeated tokens and removing blank symbols, and B −1 ctc (W ) is a set of all possible CTC alignments that are compatible with W . To obtain Eq. (3), CTC makes a conditional independence assumption of O in Eq. (2), and we assume p(W |A) = 1, as W can be determined uniquely by the collapsing function.
The joint probability p(A|O) is further factorized using the probabilistic chain rule as p(a t | @ @ @ @ @ @ a 1 , · · · , a t−1 , O). (4) In Eq. (4), CTC makes a conditional independence assumption between output tokens, where p(A|O) is approximated as the product of token emission probabilities at each time frame. The conditional probability p(a t |O) in Eq. (4) is computed as In Eq. (5), Softmax(·) is a softmax function, and Linear(·) is a linear projection layer. AudioEnc(·) in Eq. (6) is an audio encoder network that embeds speech input into a sequence of d ae -dimensional hidden vectors H ae = (h ae t ∈ R d ae |t = 1, · · · , T ). Training The objective function of CTC is defined by the negative log-likelihood of Eq. (4) over all possible alignments: The summation in Eq. (7) is efficiently computed via dynamic programming (Graves et al., 2006).
(1) is solved using the best path decoding algorithm (Graves et al., 2006). The algorithm first obtains the most probable alignment A in a greedy manner, concatenating the most active tokens at each frame:â t = argmax at p(a t |O). The most probable token sequenceŴ is then obtained by applying the collapsing function toÂ aŝ W = B ctc (Â).

RNN-Transducer
CTC estimates the distribution over alignments only depending on speech input (Eq. (4)). Thus, by definition, CTC cannot consider output dependencies, preventing a model from properly capturing the multimodal distribution of target token sequences (Gu et al., 2018). RNN-T overcomes this problem by making each token prediction explicitly conditioned on the previous non-blank output tokens (w 1 , · · · , w n−1 ). Let Z = (z u ∈ V ∪ {ϵ}|u = 1, · · · , T + N ) be an alignment used in RNN-T, and RNN-T factorizes p(W |O) similarly to Eq. (3) as where B rnnt is the collapsing function of RNN-T (Graves, 2012) that map Z to W . The joint probability p(Z|O) is factorized using the probabilistic chain rule without the conditional independence assumption (cf. Eq. (4)) as where n u is the number of tokens predicted up to an index of u. From Eq (9) to Eq. (10), RNN-T assumes (z 1 , · · · , z u−1 ) ≈ (w 1 , · · · , w nu−1 ), which is reasonable since W can be determined uniquely by the collapsing function. The conditional probability p(z u |w 1 , · · · , w nu−1 , O) is computed as h pn nu = PredictionNet(w 1 , · · · , w nu−1 ). (12) In Eq. (11), h ae t is obtained from the audio encoder (Eq. (6)), and JointNet(·) is a joint network that combines the audio and token representations, h ae t and h pn nu , using a linear projection layer. In Eq. (12), PredictionNet(·) is a prediction network that encodes the previous non-blank output tokens to a hidden vector h pn nu . The adoption of the prediction network is the main difference from CTC, which explicitly captures causal dependency in outputs.
Training The RNN-T loss L rnnt (O, W ) is defined by the negative log-likelihood of Eq. (10). Similar to the CTC objective in Eq. (7), the summation over alignments is efficiently computed using dynamic programming (Graves, 2012).
Inference RNN-T estimates the most probable token sequenceŴ using the beam search algorithm proposed in (Graves, 2012).

BERT-CTC
Overview In Fig. 1, we compare our proposed E2E-ASR model, BERT-CTC, to CTC and RNN-T. BERT-CTC leverages powerful representations from BERT (Devlin et al., 2019) to make CTC training/inference explicitly conditioned on linguistic information ( Fig. 1(a) vs. Fig. 1(c)). We use BERT as a feature extractor for a (masked) token sequence, whose parameters are frozen during training. BERT-CTC can be similar to RNN-T in that audio and token representations are fused to estimate the distribution over alignments ( Fig. 1 vs. Fig. 1(c)). However, BERT-CTC attends to the full contexts of the input and output sequences via the self-attention mechanism (Vaswani et al., 2017), which facilitates a model to learn inner/interdependencies within/between the sequences.

Acoustic Encoder
BERT-CTC is formulated by introducing a partially masked (or partially observed) sequencẽ W = (w n ∈ V ∪{[MASK]}|n = 1, · · · , N ), which is obtained by replacing some tokens in an output sequence W with a special mask token [MASK]. Note that during inference, we apply masks to a hypothesized sequenceŴ to obtain a masked sequence. Considering all possibleW , the conditional probability p(W |O) is factorized as follows: where A(W ) covers W with all possible masking patterns. Here, we interpret p(W |O) as a prior distribution of sequences consisting of observed tokens that are easily recognized only from speech input; the other masked tokens are difficult and require contextual information to be determined (e.g., homophones), which is modeled by p(W |W , O). We further describe the above interpretation in the training ( §3.1) and inference ( §3.2) sections.
The conditional probability p(W |W , O) is further factorized by using the CTC alignment as In Eq. (15), we make two conditional independence assumptions. The first is that given W and O,W is not required to determine A. This is reasonable because W already contains observed tokens inW and is helpful in avoiding the combination of all possible masked sequences and alignments (i.e., A × B −1 ctc ). The second is that givenW , O is not required to determine W . We consider p(W |W ) as a strong prior modeled by a pre-trained MLM (i.e., BERT), which can be achieved without the observation from O. We empirically show that this assumption holds in §7.3. Similar to CTC, the joint probability p(A|W, O) is factorized using the probabilistic chain rule as p(a t | @ @ @ @ @ @ a 1 , · · · , a t−1 , W, O). (16) To obtain Eq. (16), we make the same conditional independence assumption as in CTC. However, compared to Eq. (4), Eq. (16) is conditioned on an output sequence W , enabling a model to explicitly use linguistic information to estimate the distribution over alignments. This is somewhat similar to RNN-T (Eq. (10)), but is different in that BERT-CTC attends to the whole context (w 1 , · · · , w N ). We discuss this advantage in §7.1. Substituting Eq. (16) into Eq. (15), we model the product of p(a t |W, O) and p(W |W ) as where BERT(·) is the output of BERT representing the distribution of target sequences. 3 This enables Eq. (17) to be realized with a single differentiable model, enabling the whole network to be trained end-to-end. The conditional probability p(a t |BERT(W ), O) is computed as In Eq. (18), SelfAttn t (·) indicates the t-th output of stacked Transformer self-attention layers (Vaswani et al., 2017), which consume the concatenated H ae (from Eq. (6)) and H bert . 4 In Eq. (19), BERT(·) embeds a masked sequenceW into a sequence of d bert -dimensional hidden vectors H bert = (h bert n ∈ R d bert |n = 1, · · · , N ).

Training
The BERT-CTC objective is defined by the negative log-likelihood of Eq. (13) expanded with Eq. (15): To deal with the intractable marginalization over W in Eq. (20), we rewrite it under expectation with respect to the sampling distribution A(W ): whose upper bound can be derived by using the Jensen's inequality as where L bc is the loss for BERT-CTC training. Compared with the CTC objective (Eq. (7)), each token prediction in Eq. (21) is explicitly conditioned on contextual embedding from BERT. This relaxes the conditional independence assumption between outputs while retaining the same optimization strategy as in CTC. For samplingW from A(W ) in Eq. (21), we first obtain the random number of tokens from a uniform distribution as M ∼ Uniform(1, N ). Then, M tokens in a ground-truth sequence W are randomly selected to be replaced with [MASK], similar to (Ghazvininejad et al., 2019).

Hierarchical Loss
We apply an auxiliary CTC loss to the audio encoder output in a hierarchical multi-tasking manner (Fernández et al., 2007;Sanabria and Metze, 2018). As the vocabulary size of BERT is often too large for ASR training, we train the audio encoder to predict a sequence W ′ = (w ′ l ∈ V ′ |l = 1, · · · , L) tokenized with a smaller vocabulary V ′ (i.e., |V ′ | ≪ |V|). This has been shown effective for training sparse word-level ASR (Higuchi et al., 2022). The BERT-CTC loss is combined with the hierarchical CTC loss as where λ ctc is a tunable parameter. We investigate the importance of the hierarchical loss in §7.1.

Inference
The most probable token sequenceŴ is estimated by solving Eq. (1) for Eq. (13) aŝ From Eq. (23) to Eq. (24), we make the Viterbi approximation to deal with the intractable summation over all possible masked sequences. To solve Eq. (24), we design a mask-predict algorithm (Ghazvininejad et al., 2019) assisted by CTC inference, inspired by (Chan et al., 2020;Higuchi et al., 2020). See Table 4 for an example decoding and Appendix A for pseudocode. The algorithm first initializes a target sequence with an estimated length, which is then followed by k = {1, · · · , K} iterations of token masking and prediction steps.
Initialization (k = 1) BERT-CTC is non-autoregressive, and the length of a target sequenceN needs to be given in advance to start decoding (Gu et al., 2018). We determine the target length based on the auxiliary sequenceŴ ′ predicted from the audio encoder output H ae asN ∼ |Ŵ ′ |. Given the estimated length, we initialize an initial masked sequenceW (k=1) by filling allN positions with the mask token [MASK]. By feeding H ae and H bert (= BERT(W (k=1) )) to the self-attention module, a hypothesized sequenceŴ (k=1) is obtained via CTC inference. Here,Ŵ (k=1) is predicted only from speech without any observations from output tokens, as they are all masked.

Token Prediction
Step (Eq. (24)) H ae and H bert (= BERT(W (k+1) )) are fed to the self-attention module to generate the next hypothesisŴ (k+1) . Here, the prediction ofŴ (k+1) is conditioned on the contextual embedding obtained from BERT. Similar to (Chan et al., 2020;Chi et al., 2021), BERT-CTC inference repeatedly predicts a target sequence at the alignment level, which does not require an additional mechanism (Gu et al., Higuchi et al., 2021b) for adjusting the target length over iterations. Moreover, BERT-CTC considers the output dependencies at the token level, making it more suitable for a model to capture linguistic information.

BERT-CTC for End-to-End SLU
In addition to E2E-ASR, BERT-CTC can model end-to-end SLU jointly by extending Eq. (18) as where we assume y ∈ Y as an intent label in a set of intents Y. Note that SelfAttn T +1 (·) indicates the T + 1-th output of the self-attention module, which corresponds to the [CLS] token of BERT.
Training The loss is defined by adding Eq. (22) and the negative log-likelihood of Eq. (26) as (27) where λ slu is a tunable parameter.
Inference The most probable labelŷ can be estimated at any timing of BERT-CTC inference bŷ y = argmax y p(y|W , O). When k = 1, the label is predicted only from audio information, and when k = K, the label is predicted with full access to audio and linguistic information.

Additional Related Work
End-to-End ASR with MLM Inspired by the great success in non-autoregressive neural machine translation, conditional masked language model (CMLM) (Ghazvininejad et al., 2019) has been adopted for E2E-ASR. Audio-CMLM (A-CMLM) (Chen et al., 2020) has trained an E2E-ASR model with an MLM objective (Devlin et al., 2019), making token predictions conditioned on both the speech input and a partially masked target sequence. Imputer (Chan et al., 2020) and Mask-CTC (Higuchi et al., 2020, 2021b have introduced CTC to the CMLM-based modeling, where the mask-predict algorithm is used to refine a framelevel or token-level sequence predicted by CTC. Our method of combining CTC and MLM is related to the above studies, but conceptually different in that BERT-CTC aims to relax the conditional independence assumption used in CTC by leveraging an external pre-trained MLM (i.e., BERT) as contextual embedding.
LM Integration for End-to-End ASR. There is a line of prior studies seeking to integrate an external LM into E2E-ASR. Shallow fusion has been the most widely used approach (Hannun et al., 2014;Gulcehre et al., 2015;Chorowski and Jaitly, 2017;Kannan et al., 2018), which linearly interpolates the output probabilities from an E2E-ASR model and external LM. Deep fusion (Gulcehre et al., 2015) is a more structured approach, where an E2E-ASR model is jointly trained with an external LM to learn the optimal combination of the audio and linguistic information in a latent space. Cold fusion (Sriram et al., 2018) and component fusion (Shan et al., 2019) have further improved deep fusion by a gating mechanism that learns a more sophisticated combination of the two models.
Our approach can be seen as a variant of cold fusion in that an external pre-trained MLM is fused to a CTC-based E2E-ASR model, selectively combining audio and linguistic representations via the self-attention mechanism. However, BERT-CTC is a novel direction in which we seek to integrate BERT into a CTC-based model in a theoreticallysound manner.

Experiments
We used the ESPnet toolkit (Watanabe et al., 2018) for all the experiments. All the implementations and recipes are made publicly available (see §1).
Spoken Language Understanding We also evaluated our model on the SLURP dataset (Bastianelli et al., 2020). SLURP consists of English prompts of an in-home personal robot assistant, and we focused on the intent classification task.
We used the standard development and test sets for tuning hyper-parameters and evaluating performance for each dataset. Full dataset descriptions are in Appendix D.1.

End-to-End ASR Models
CTC (baseline): A model trained based on the CTC loss L ctc (see §2.1). Given the recent advances in CTC-based modeling (Higuchi et al., 2021a), we built a strong baseline using the intermediate CTC technique (Tjandra et al., 2020;, which applies an auxiliary CTC loss to intermediate outputs of the audio encoder. We used the intermediate loss in a hierarchical manner (Sanabria and Metze, 2018), where the loss is calculated using a target sequence tokenized with a smaller vocabulary (i.e., V ′ in §3.1).

RNN-T (baseline):
A model trained based on the RNN-T loss L rnnt (see §2.2). Considering the recent techniques developed upon multi-task learning (Boyer et al., 2021), we trained a strong model using an auxiliary CTC loss applied to the audio encoder output (Jeon and Kim, 2021). Same as CTC, we enhanced the audio encoder with intermediate CTC . All the CTC losses were calculated using the smaller-vocabulary sequence. BERT-CTC (ours): The proposed model trained based on the BERT-CTC loss (Eq. (22)). As in the other models, we adopted intermediate CTC for the audio encoder. All the CTC losses were calculated using the smaller-vocabulary sequence.
See Appendices B and C for intermediate CTC and detailed model descriptions, respectively.

Experimental Settings
Model Configuration For the audio encoder, we adopted the Conformer architecture (Gulati et al., 2020), which consisted of 12 encoder blocks. The prediction network in RNN-T was a single long short-term memory (LSTM) layer. The selfattention module in BERT-CTC had 6 Transformer encoder blocks, and we used a BERT BASE model provided by HuggingFace (Wolf et al., 2020).
Tokenization For each language, we used the same vocabulary as BERT for tokenizing target texts. We also constructed a smaller-sized vocabulary V ′ for the hierarchical losses, which is obtained by applying the byte pair encoding-based algorithm (Sennrich et al., 2016) to the transcription of each dataset.
Training We mostly followed ESPnet recipes provided for each dataset. For BERT-CTC, we set λ ctc (in Eq. (22)) to 0.3 for all the ASR tasks and λ slu (in Eq. (27)) to 1.0 for the SLU task.
Inference For CTC, we performed the best path decoding ( §2.1). For RNN-T, we used the beam search decoding ( §2.2) with a beam size of 20. For BERT-CTC, unless otherwise indicated, the number of iterations K was always set to 20 ( §3.2).
Detailed experimental settings for reproducibility are in Appendix D.

Results
Speech Recognition Table 1 shows results on LibriSpeech-100h and TED-LIUM2 in word error rate (WER), and AISHELL-1 in character error rate (CER). While RNN-T slightly outperformed CTC on several evaluation sets in LibriSpeech-100h and AISHELL-1, CTC resulted in better performance on TED-LIUM2. RNN-T was ineffective at training ASR with the BERT vocabulary, particularly when a severe mismatch exists against the target ASR domain (i.e., Wikipedia vs. lecture). BERT-CTC significantly outperformed the baselines, consistently achieving the best results on all datasets. BERT-CTC improved over RNN-T, and we attribute this to not only considering the whole context of the target sequence but also using the   powerful representations from BERT, which we further analyze later. In Appendix E, we compare our AISHELL-1 results to those from recent works and show that our approach is on par with the state-ofthe-art (Zheng et al., 2021) with fewer parameters. Figure 2 illustrates the correlation between BERT-CTC results and the number of decoding iterations. When decoded with K = 1, the model only uses speech input to predict a token sequence. By increasing K, the model beneficially exploited the BERT knowledge for refining the output tokens. Table 2 lists the results of the SLURP intent classification task, evaluated in accuracy. We refer to the ESPnet-SLU (Arora et al., 2022) result as a baseline, which performs SLU along with ASR by prepending an intent label to the corresponding output sequence.

Spoken Language Understanding
We also refer to the ESPnet-SLU result obtained by stacking BERT on top of an ASR model, which was found to be less effective. BERT-CTC outper-formed the baselines by effectively incorporating acoustic and linguistic information. By decoding in a single iteration (K = 1), BERT-CTC predicted an intent only from speech, and the accuracy was already higher than those of baselines. We observed a slight but clear gain by increasing K, which improved both ASR and SLU performance thanks to BERT. We note that our result outperforms the state-of-the-art 86.9% reported in (Seo et al., 2022).

Ablation Studies
To validate the effectiveness of our model design for BERT-CTC, we conduct ablation studies (Table 3) on the usage of hierarchical loss and BERT.
Hierarchical Loss We observed that hierarchical CTC helped all the models improve their performance by a large margin. As the vocabulary of BERT is generally too large for E2E-ASR, the hierarchical modeling was crucial for predicting the sparse word-level tokens. Moreover, the result indicates that the hierarchical loss is effective for training an ASR model with a vocabulary from a different domain, as there is a non-negligible domain-mismatch between the BERT training text and ASR transcription.
BERT To ablate BERT-CTC with BERT, we replaced BERT(·) in Eq. (19) with a simple embedding layer with positional encoding. We found that removing BERT led to degradation in BERT-CTC performance, which supports the importance of using BERT. However, interestingly, the result was still better than the baselines, indicating the advantage over RNN-T in that BERT-CTC is capable of considering the bi-directional context. Table 4 shows a process of BERT-CTC inference, decoding an utterance in the LibriSpeech test set. k = 1 ... thou a gave meet any one afterter these hour recite aught of courtry whether he be ne'er ... k = 10

Error Analysis with Decoding Example
... thou a again meet any one afterter these hour reciteiting aught of poetryry whether he be near'er ... k = 15 ... thou again meet any one after this hour reciteiting aught of poetryry whether he be near'or ... k = 20 ... thou again meet any one after this hour reciteiting aught of poetry whether he be near ... w/o BERT ... thou a gag meet any one after this hour residing aught of boy whether he be near ... Reference ... thou again meet any one after this hour reciting aught of poetry whether he be near ... Table 4: Decoding example from LibriSpeech test-other set (2033-164914-0016). At each iteration, the highlighted tokens are masked and repredicted in the next iteration. Blue indicates refined tokens, and red indicates ones not. In the output sequence at k = 1, the model mistakenly predicted phonetically similar tokens (e.g., "again"→"a gave", "near"→"ne'er"). At the first iteration, the model was only conditioned on acoustic information, making it challenging to determine target tokens accurately. As the iteration proceeded, the model corrected the most errors by considering the output dependency. Unlike the original maskpredict algorithm (Ghazvininejad et al., 2019), our approach permits for flexibly adjusting the target length, enabling the model to resolve insertion and deletion errors (e.g., "afterer"→"after"). We also show an example obtained w/o BERT (from Table 3), which failed to recover tokens that were correctly recognized by BERT-CTC with BERT.

Conditional Independence of p(W |W , O)
We empirically validate the conditional independence assumption made in Eq. (15), where the output sequence W depends only on its masked sequenceW without audio information O. To this end, we augmented the BERT module by inserting adaptive cross-attention layers, which is similar to Adapter-BERT Networks (Guo et al., 2020). These additional layers are trained to infuse the audio encoder output H ae into each BERT layer, thereby allowing BERT-CTC to realize p(W |W , O). When evaluated on LibriSpeech, the modified BERT-CTC resulted in 7.2%/17.9% on the dev. set and 7.3%/18.0% on the test set, which are worse than the results in Table 1. This indicates that BERT already captures sophisticated linguistic information and does not require extra parameters for adapting BERT to audio input.

Attention Visualization
Figure 3 depicts example attention weight matrices, produced by the second self-attention layer of BERT-CTC. We observed two major attention patterns: weights aligning audio and token sequences by capturing inter-dependencies (Fig. 3 left) and weights attending inter-dependencies within each sequence (Fig. 3 right). These patterns support our motivation for the BERT-CTC design in learning inner/inter dependencies within/between the audio and token representations.

Inference Speed Comparison
To see how the iterative decoding with BERT affects the inference speed of BERT-CTC, we evaluated each model on the real-time factor (RTF). RTF was measured on the LibriSpeech test-other set using a single GPU with a batchsize of 1 or a single CPU. RTFs for GPU / CPU inference resulted in 7.91e-3 / 4.18e-2 for CTC, 4.81e-1 / 4.55 for RNN-T, and 9.72e-2 / 7.22e-1 for BERT-CTC. The semi-autoregressive characteristic in BERT-CTC enabled faster inference than autoregressive RNN-T and provided further speedup with the parallel computing using GPU.

Conclusion
We proposed BERT-CTC that leverages BERT for relaxing the conditional independence assumption in CTC. BERT-CTC uses BERT as contextual embedding to explicitly condition CTC training/inference on linguistic information. Experimental results showed that BERT-CTC improved over conventional approaches. Moreover, we confirmed that BERT-CTC is applicable to end-to-end SLU.

Limitations
Vocabulary Constraint The output unit of BERT-CTC is constrained to the vocabulary of BERT, which is likely to be not generalized to an ASR domain and too sparse for ASR training. Table 5 shows results on LibriSpeech-100h with different vocabularies, where V asr is an ASR vocabulary with a vocabulary size of 300 constructed from LibriSpeech transcriptions, and V bert is the BERT vocabulary with a vocabulary size of 30522. We observed that, by using V asr , the performance of CTC and RNN-T improved over the results using V bert and closed the gap with the BERT-CTC results. We believe that using a BERT variant with a smaller vocabulary, e.g., CharacterBERT (El Boukkouri et al., 2020) improves BERT-CTC further.
Computational Cost BERT-CTC requires a high computational cost, especially during inference, due to the iterative forward calculations of BERT (i.e., K = 20 times) with the O(N 2 ) computational and memory complexities in the self-attention layers. Still, GPUs can greatly accelerate the inference speed, and BERT-CTC can alternatively use other pre-trained MLMs with lighter weights, e.g., AL-BERT (Lan et al., 2019) and DistilBERT (Sanh et al., 2019).
Non-streaming BERT-CTC is not suited for online streaming scenarios, where output tokens are predicted synchronously to sequential speech input. It is not a significant problem when we consider applying BERT-CTC to utterance-level ASR tasks, such as end-to-end SLU as we demonstrated the capability of BERT-CTC (Table 2). Otherwise, we can adopt existing techniques for making BERT-CTC streaming, e.g., causal masking (Vaswani et al., 2017), time-restricted attention (Povey et al., 2018), and block-wise processing (Tsunoo et al., 2019). Another solution can be to apply the twopass algorithm (Sainath et al., 2019), where BERT-CTC first performs streaming recognition at k = 1 and then refines the outputs using the full context information at k > 1. Algorithm 1 BERT-CTC Inference Input: The number of iterations K, audio encoder output H ae 1:

References
▷ Obtain the most probable alignment from the audio encoder 2:Ŵ ′ = B ctc (Â ′ ) 3:N ∼ |Ŵ ′ | ▷ Obtain the target length from the intermediate prediction 4:W = (w n = [MASK]|n = 1, · · · ,N ) ▷ Initialize a masked sequence 5: for k = 1, . . . , K do Algorithm 1 describes the overall process of BERT-CTC inference. For estimating the target length in line 3, at the implementation level, we first decodê W ′ into a sentence, which is then tokenized using the BERT vocabulary, and the length of the resulting sequence is used as the target length. In lines 12-25, before the token masking step, we calculate a probability scorep n for each tokenŵ n in the estimated output sequenceŴ . This score calculation simply takes the maximum value in frame-level token probabilities that correspond to a predicted tokenŵ n after the collapsing operation. In line 28, given the probability scores, MaskLowestProb(·) masks tokens inŴ with the M lowest scores.

B Intermediate CTC
Intermediate CTC (Tjandra et al., 2020; where AudioEnc (e) (·) indicates the e-th layer output of the audio encoder. Similar to Eq. (5), token emission probabilities at each time frame is computed based on Eq. (28) as p (e) (a t |O). (30) C Model Details C.1 CTC (baseline) We applied the intermediate CTC loss to the 6th layer of the audio encoder, which is calculated using the smaller-vocabulary sequence W ′ in a hierarchical multi-tasking manner. Using the intermediate loss, the CTC loss L ctc is extended as where λ ic is a tunable weight for the intermediate loss, and we equally weighted each loss (i.e., λ ic = 0.5) as in Higuchi et al., 2022).

C.2 RNN-T (baseline)
We applied auxiliary CTC losses to the final and intermediate layer of the audio encoder. As in Eq. (31), the intermediate loss was applied to the 6-th layer. With the additional CTC losses, the RNN-T loss L rnnt is extended as where λ ctc is a tunable weight for CTC losses, and we set λ ctc = 0.3 as in (Boyer et al., 2021). Note that all the CTC losses were calculated using the smaller-vocabulary sequence W ′ in a hierarchical multi-tasking manner.

C.3 BERT-CTC (ours)
We applied auxiliary CTC losses to the final and intermediate layer of the audio encoder. As in Eq. (31), the intermediate loss was applied to the 6-th layer. With the additional CTC losses, the BERT-CTC loss L bc is extended as where λ ctc is a tunable weight for CTC losses, and we set λ ctc = 0.3. Note that all the CTC losses were calculated using the smaller-vocabulary sequence W ′ in a hierarchical multi-tasking manner (as explained in §3.1).

D.2 Model Configuration
For the audio encoder network, we used the Conformer (Gulati et al., 2020)-based encoder architecture implemented in ESPnet (Guo et al., 2021). The audio encoder consisted of 2 or 3 convolutional neural network (CNN) layers followed by a stack of 12 encoder blocks.

D.3 Tokenization
We used the same subword vocabulary as BERT for tokenizing target texts, where the vocabulary size |V| was 30522 for English and 21128 for Mandarin. For the smaller-sized vocabulary V ′ used in hierarchical CTC, we used SentencePiece (Kudo, 2018) 12 to construct subword vocabularies from transcription data in each training set. Following the ESPnet recipes, the vocabulary size was set to 300 for LibriSpeech-100h, and 500 for TED-LIUM2 and SLURP. For AISHELL-1, we used character-level tokenization with 4231 Chinese characters.

D.4 Training
All the models were implemented and trained using ESPnet (Watanabe et al., 2018) 13 and Py-Torch (Paszke et al., 2019) 14 . In Tables 8, 9, and 10, we summarize training configurations for the CTC, RNN-T, and BERT-CTC models, respectively. We augmented speech data using speed perturbation (Ko et al., 2015) with a factor of 3 and SpecAugment (Park et al., 2019). For the hyperparameters in SpecAugment, we set the number of frequency and time masks to 2 and 5, and the size of frequency and time masks to 27 and 0.05T . Note that the maximum size of the time mask depends on the utterance length T . After training, model parameters were averaged over 10 checkpoints with the best validation performance. For CTC, we trained models using a single RTX 2080 Ti GPU for 1 to 3 days, depending on the tasks and number of epochs. For RNN-T, we trained models using 4 V100 GPUs for 5 to 7 days, depending on the tasks and number of epochs. For BERT-CTC, we trained models using a single RTX 2080 Ti GPU for 3 to 5 days, depending on the tasks and number of epochs.

D.5 Inference
RTF was measured using a single V100 GPU (with a batchsize of 1) or a single Intel(R) Xeon(R) Gold 6148 CPU@2.4 GHz.

E Comparison to Prior Works
AISHELL-1 Table 11 lists results on AISHELL-1, comparing our BERT-CTC with recent approaches using a pre-trained acoustic model (AM) or/and LM. BERT-CTC achieved comparable performance to the state-of-the-art approach, Wav-BERT (Zheng et al., 2021), without using a pretrained AM. Moreover, the number of trainable parameters in BERT-CTC was much fewer than in the other models because BERT-CTC used BERT as contextual embedding (without fine-tuning). We attribute this advantage of BERT-CTC to our welldefined formulation for conditioning CTC training/inference with BERT knowledge.
Non-autoregressive End-to-End ASR Table 12 compares our BERT-CTC with the previous nonautoregressive E2E-ASR models on LibriSpeech-100h and TED-LIUM2. It should be noted that we refer to (Higuchi et al., 2021a) for the prior results, and the comparison is not necessarily in an equivalent setting, e.g., we conducted experiments using ESPnet2 while the previous work used ESP-net1. Overall, BERT-CTC achieved better results than the other non-autoregressive models, thanks to the usage of BERT. In particular, we observed clear differences in the LibriSpeech "other" sets and TED-LIUM2. However, the performance on the LibriSpeech "clean" set was on par with the other approaches, which we attribute to the vocabulary mismatch problem we have discussed in the limitation section.