Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition

Unifying acoustic and linguistic representation learning has become increasingly crucial to transfer the knowledge learned on the abundance of high-resource language data for low-resource speech recognition. Existing approaches simply cascade pre-trained acoustic and language models to learn the transfer from speech to text. However, how to solve the representation discrepancy of speech and text is unexplored, which hinders the utilization of acoustic and linguistic information. Moreover, previous works simply replace the embedding layer of the pre-trained language model with the acoustic features, which may cause the catastrophic forgetting problem. In this work, we introduce Wav-BERT, a cooperative acoustic and linguistic representation learning method to fuse and utilize the contextual information of speech and text. Specifically, we unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework. A Representation Aggregation Module is designed to aggregate acoustic and linguistic representation, and an Embedding Attention Module is introduced to incorporate acoustic information into BERT, which can effectively facilitate the cooperation of two pre-trained models and thus boost the representation learning. Extensive experiments show that our Wav-BERT significantly outperforms the existing approaches and achieves state-of-the-art performance on low-resource speech recognition.


Introduction
. However, in practice, unlike the commonly used languages (e.g. English and Chinese) with sufficient training data, many other languages (e.g. Swahili, Tamil) have only low-resource data due to the scarcity of audios and the huge labor resources consumed in transcription. In this way, the aforementioned data-driven mechanism is impractical for low-resource languages and thus suffers from unsatisfactory performance.
To resolve this learning difficulty in the lowresource domain, many efforts have been devoted to leveraging unlabeled data. One mainstream research paradigm is unsupervised pre-training, or representation learning, which has achieved great success in natural language processing (Devlin et al., 2018;Peters et al., 2018) and received increasing attention in speech recognition (Oord et al., 2018;Schneider et al., 2019a). As a representation in this line, wav2vec (Schneider et al., 2019a) and wav2vec 2.0 (Baevski et al., 2020) apply unsupervised contrastive pre-training and show promising results. To utilize linguistic information, some works (Chiu and Shin et al., 2019) also aim to build language models to rescore the N -best hypotheses generated by acoustic models. The most recent approach  even cascaded the pre-trained wav2vec 2.0 and BERT into a single model for low-resource ASR.
However, there leave two critical challenges on how to integrate the acoustic model and language model to utilize the contextual information of speech and text. 1) Representation discrepancy: the acoustic model focuses more on local dependencies of the speech sequence, while the language model aims at capturing long-term semantic information of texts. It is desired to explore an effective model to fuse and leverage the two kinds of representation. 2) Embedding inconsistency: The language model applies a token embedding layer during pre-training but previous methods  simply replace the embedding layer with the features generated by the acoustic model, which may result in the catastrophic forgetting problem (Goodfellow et al., 2013).
To tackle the above challenges, in this work, we make the first attempt to successfully integrate the well-trained acoustic model and language model for low-resource speech recognition. Towards this end, we introduce a new framework that incorporates the two kinds of pre-trained models for cooperative acoustic and linguistic representation learning by exploiting complementary contextual information of both speech and text.
First, to solve representation discrepancy, unlike the previous works Yu and Chen, 2021) that simply connect the acoustic model and the language model by treating them as an encoder and a decoder, we consider them as two encoders that provide two different representations. Specifically, we propose a Representation Aggregation Module, a plug-in component to better exploit and fuse the acoustic and linguistic information. We design and evaluate several representation aggregation mechanisms, including Gated Acoustic-Guided Attention, Gated Linguistic-Guided Attention, and Gated Cross-Modal Attention. The experimental results show the proposed Gated Cross-Modal Attention is the most effective method for representation aggregation.
Second, to fill the gap of embedding inconsistency, we introduce an Embedding Attention Module to incorporate the acoustic features into BERT by a gated attention process, which not only preserves the capability of BERT but also takes advantage of acoustic information. Moreover, as BERT requires audio transcripts as input to create word embedding, it may be easy to overfit when using ground truth transcripts. On the other hand, it is also hard to converge when using transcripts predicted by the acoustic model. To facilitate the cooperation of the two encoders, we propose a sampling strategy with decay to randomly select the ground truth and generated transcripts for smooth training.
We adopt pre-trained wav2vec 2.0 (Baevski et al., 2020) and BERT (Devlin et al., 2018) as the encoders to provide acoustic and linguistic representations respectively for their flexible pre-training then fine-tuning paradigm as well as excellent local contextual modeling ability. Accordingly, we denominate our method as Wav-BERT.
We evaluate our method on several datasets with diverse languages from the public IARPA BABEL dataset (Gales et al., 2014) and AISHELL-1 corpus (Bu et al., 2017). The experimental results demonstrate that our Wav-BERT significantly outperforms the existing approaches on low-resource ASR. Furthermore, our exhaustive ablation studies demonstrate the effectiveness of the proposed mechanisms for cooperative acoustic and linguistic representations learning. We hope this work will be useful for the community on the way to explore different pre-trained models for low-resource ASR.
2 Related Work

Low resource speech recognition
To tackle the low-resource ASR task, transfer learning ASR (Kunze et al., 2017) and multilingual transfer learning ASR (Dalmia et al., 2018;Watanabe et al., 2017a;Toshniwal et al., 2018) are explored via using different source languages to improve the performance of low-resource languages. Meta-learning approaches (Finn et al., 2017;Nichol et al., 2018) are also adopted for lowresource ASR (Hsu et al., 2020;Xiao et al., 2021) to obtain fast adaptation ability to new tasks with only a few data through meta-learning a model initialization from training tasks. In addition, recent works utilize unsupervised pre-training (Schneider et al., 2019b;Chung and Glass, 2020) and semisupervised learning (Kahn et al., 2020;Li et al., 2019) to exploit a large amount of unlabeled data to learn general representations for low-resource adaptation. Among them, Wav2vec 2.0 (Baevski et al., 2020) achieved excellent results through self-supervised learning, which learns powerful and contextual acoustic representations of a large speech audio corpus by solving contrastive tasks that require identifying the true quantized latent speech representations for masked time steps. Then it shows strong feasibility of ultra-low resource speech recognition with even only 10 minutes of labeled data.

Speech recognition with BERT
To use the linguistic information from BERT (Devlin et al., 2018) for improving ASR performance, some works (Chiu and Shin et al., 2019;Wang and Cho, 2019) use BERT to rerank the N-best hypotheses generated by the ASR model. Besides, knowledge distillation (Futami et al., 2020) is explored to use BERT as a teacher model to guide ASR model training. Moreover, some recent works Yu and Chen, Figure 1: Comparison of the architectures of different approaches to fuse BERT into the ASR model. (a) Rescoring methods use BERT to rescore N -best hypotheses generated by wav2vec 2.0 ASR (Shin et al., 2019). (b) Cascade methods directly cascade the BERT decoder on the top of the wav2vec 2.0 encoder through Length Alignment module . (c) Adapter-BERT inserts adapter modules in each BERT layer .

Preliminaries
Here we briefly introduce the architectures of acoustic and linguistic encoders in our framework. Wav2vec 2.0. We adopt wav2vec 2.0 (Baevski et al., 2020) as our acoustic encoder because of its effectiveness and efficiency. It has two stages: (i) contrastive pre-training to learn representations of speech and (ii) fine-tuning to adapt the learned representations on labeled data with connectionist temporal classification(CTC) loss (Graves et al., 2006b) for downstream speech recognition tasks. In this work, we aim to utilize the public pre-trained model and mainly focus on the fine-tuning stage. The architecture of wav2vec 2.0 contains a feature encoder, a context network with a transformer and a quantization module. During fine-tuning, the quantization module is removed and a randomly initialized linear projection layer is attached on top of the context network. BERT. BERT (Devlin et al., 2018) is employed as our linguistic encoder since it is one of the most popular text pre-training approaches and has shown remarkable performance in many downstream natural language processing tasks. It also consists of two steps: (i) self-supervised pre-training to learn deep bidirectional linguistic representations from a large text corpus and (ii) fine-tuning to adapt to downstream tasks using labeled data. BERT consists of an embedding table, a multi-layer bidirectional Transformer encoder, and an additional output layer for fine-tuning.

Motivation
To transfer the knowledge learned on the abundance of high-resource language data for lowresource speech recognition, many efforts have been devoted to unifying acoustic and linguistic representation learning. We first categorize previous methods and then introduce our solution.
As shown in Figure 1 (a), one simplest way to fuse BERT into an acoustic model in speech recognition is rescoring (Chiu and Shin et al., 2019). It uses BERT as a language model to calculate the pseudo-log-likelihood scores of text sentences for reranking the N -best hypotheses generated by the acoustic model. However, this process is time-consuming as it needs to iteratively mask each word in the sentence for inference and then sum up the scores of all masked words. It also requires tuning many hyper-parameters by repetitive experiments, e.g. beam size, balanced weights of the language and acoustic models.
Recently, some works Yu and Chen, 2021) directly cascade the decoder BERT on the top of the acoustic encoder, as illustrated by Figure 1 (b). However, such a simple cascade often cannot well fuse the contextual information of speech and text.
Inspired by AB-Net , we design Adapter-BERT that inserts cross-attention adapters

Or
Sampling with decay Feed forward

Multi-Head Attention
Gate & Add  in each BERT layer with the Mask-Predict algorithm (Ghazvininejad et al., 2019) to fully utilize the bidirectional information of the input sequence, as shown in Figure 1 (c). Nevertheless, the adapters in each layer of BERT will affect the pre-trained parameters of BERT, causing catastrophic forgetting. Moreover, the Mask-Predict decoding suffers from low inference speed.
To solve the representation discrepancy and embedding inconsistency between speech and text, in this work, we introduce Wav-BERT, a cooperative acoustic and linguistic learning framework that fuses and leverages the contextual information of speech and text from the representation level to the embedding level, as shown in Figure 1 (d). We first present an independent Representation Aggregation Fusion Module for acoustic and linguistic representation aggregation, without inserting it in any pre-trained model to avoid destroying the parameters of pre-trained models. Then, an Embedding Attention Module is introduced to better combine acoustic and linguistic embedding instead of simply replacement.

Our Wav-BERT
The architecture of our Wav-BERT is illustrated in Figure 2. Specifically, wav2vec 2.0 encoder takes raw waveform X as input and outputs acoustic representation H A , which is then fed into a lin-ear projection layer with CTC loss (Graves et al., 2006b) (L ctc 1 ) and the Representation Aggregation Module respectively. For the input of BERT encoder, we employ "Sampling with Decay" mechanism to sample from the masked ground truth Y r or wav2vec 2.0 CTC output Y CT C 1 with probability p and 1 − p, so as to narrow the gap between training and inference. Next, word embedding E and acoustic embedding H A are fed into the Gate Attention to model the conditional information from the wav2vec 2.0 encoder side. Through the subsequent BERT transformer layers, we get the linguistic representation H L . Finally, the Representation Aggregation Module takes linguistic representation H L as well as acoustic representation H A as input, generating the CTC output Y CT C 2 and crossentropy (CE) output Y CE , supervised by the CTC (L ctc 2 ) and CE (L ce ) criterion respectively. Simultaneously, the conditional masked language model (CMLM) objective (L cmlm )  is also attached on BERT encoder followed by a feedforward layer to supervise the BERT output Y m . Overall, the objective of our framework is defined as: where µ 1 , µ 2 , µ 3 and µ 4 are the corresponding loss weights.

Representation Aggregation Module
To solve representation discrepancy, we first design several representation aggregation mechanisms, such as Gated Acoustic-Guided Attention, Gated Linguistic-Guided Attention. In our Representation Aggregation Module, we combine a Gated Acoustic-Guided Attention (Left) and a Gated Linguistic-Guided Attention (Right) to construct a Gated Cross-Modal Attention for better exploiting and aggregating the acoustic and linguistic representations.
Specifically, Gated Cross-Modal Attention Module takes acoustic representation H A generated by wav2vec 2.0 as well as linguistic representation H L generated by BERT as input and feeds them as the query, key, and value vector respectively to a multi-head attention, which can be formulated as: where Q H A means passing H A as query vector, K H L as well as V H L means passing H L as key and value vector respectively. C A is the acoustic guided context feature generated by attention which tend to focus on the values in the linguistic representation H L related to acoustic representation H A . Vice versa, C L is the linguistic guided context feature to focus on the values in the H A related to H L . Next, the context feature C A and acoustic representation H A are fed into a gated weighting layer to automatically capture the most important information between context and acoustic representation, and generating acoustic-guided linguistic representation H AGL , which can be formulated as: where W 1 as well as B 1 are model parameters and Φ A is the gated weight. Similarly, the context feature C L and linguistic representation H L are fed into another gated weighting layer to weigh the expected importance Φ L and generate linguistic-guided acoustic representation H LGA , which can be formulated as: where W 2 as well as B 2 are model parameters and Φ L is the gated weight. We then feed H AGL and H LGA to a feedforward layer followed by residual connection respectively and get aggregation representation H A as well as H L . Finally, two linear projection layers are attached on the top of Representation Aggregation Module to get the Y CT C 2 and Y CE . As the sequence length of Y CT C 2 is determined by acoustic representation H A , we use CTC criterion to align the acoustic frames of Y CT C 2 to the ground truth tokens. On the other hand, the sequence length of Y CE is determined by linguistic representation H L , so we use CE criterion to align the text sequence of Y CE to the ground truth transcript.
The different aggregation mechanisms including Gated Acoustic Guided Attention, Gated Linguistic-Guided Attention and Gated Cross-Modal Attention are evaluated and compared in Table 3.

Embedding Attention Module
Recent works Yu and Chen, 2021) directly connect the BERT on the top of the acoustic encoder and simply replace the embedding layer with the acoustic features generated by the acoustic encoder, causing the catastrophic forgetting problem.
To fill the gap of embedding inconsistency, we propose the Embedding Attention Module and insert it behind the embedding layer of BERT to incorporate the acoustic information into the word embedding instead of simply replacing them. We first introduce a Gated Attention operation in this module. As shown in Figure 2, word embedding E generated by embedding layer is fed to a selfattention layer followed by a feed-forward layer to capture higher level linguistic embedding E L . Then, a multi-head self-attention followed by a gated weighting layer takes E L as the query vector and acoustic embedding H A generated by wav2vec 2.0 as the key vector as well as value vector to fuse the linguistic embedding and acoustic embedding. Thus, as a conditional masked language model, BERT can learn to predict the masked word under the conditional acoustic information and provided enhanced linguistic representation.
Furthermore, for the input of the embedding layer of BERT, it is easy to overfit when using ground truth transcripts while it is hard to converge when using transcripts predicted by wav2vec2.0 encoder. To solve this issue, we propose a "Sampling with Decay" mechanism by feeding BERT either the masked ground truth transcript Y r or the predicted CTC result Y CT C 1 with a certain probability during training. The probability p of selecting from Y r decreases linearly as the number of training steps increases. Through the Embedding Attention Module with "Sampling with Decay" mechanism, we further integrate the acoustic and linguistic information from the embedding level to facilitate better fusion between wav2vec 2.0 encoder and BERT encoder. Table 4 verifies the effectiveness of each component of our proposed Embedding Attention Module.

Inference
For inference, we first feed the result Y CT C 1 into BERT encoder; then select the one with higher confidence from the two outputs Y CT C 2 and Y CE as our final output.

Experiments
In this section, we first illustrate the implementation details of our Wav-BERT. Then we introduce two low-resource speech recognition datasets containing several languages as well as the comparison results among our approach and baseline methods. Furthermore, we conduct ablation studies to validate the effectiveness of each main component of our Wav-BERT and present some case studies for perceptual comparison. Implementation Details. For our proposed Representation Aggregation Module and Embedding Attention Module, the heads and embedding dimensions of all multi-head attention are set to 8 and 768 respectively. Meanwhile, the inner-layer dimension of the position-wise feed-forward is set to 2048. Regarding optimization details, we train our model as well as baselines based on wav2vec 2.0 Base for 200K steps with one GeForce RTX 3090 GPU, setting max tokens and update frequency to 640000 and 4 correspondingly. As for experiments using XLSR-53 (Conneau et al., 2020), three GeForce RTX 3090 GPUs are used with max tokens as 480000 and update frequency as 4. We use the three-stage learning rate policy with the initial learning rate as 5e-5, and set each stage ratio to 0.05, 0.45 and 0.5. Besides, we set the weight µ 1 , µ 2 , µ 3 and µ 4 for each loss to 0.5 for training. Other optimizer settings are the same as wav2vec 2.0 (Baevski et al., 2020). In terms of the "Sampling with Decay" policy, languages in IARPA BA-BEL start from 100K steps to 200K steps, while in AISHELL-1 it starts from 40k steps to 100k steps, all with p decreasing from 90% to 10%. Datasets. IARPA BABEL (Gales et al., 2014) is an open-source multilingual corpus of conversational telephone speech. For low resource evaluation, we randomly select 3 kinds of languages with few data: Swahili (Sw), Tamil (Ta) and Vietnamese (Vi). We adopt the same setup as (Conneau et al., 2020) and use the dev folder of the BABEL dataset as our test set since "eval" data are not released. We re-sample audios of all languages to 16kHz. AISHELL-1 (Bu et al., 2017) is an open-source and high-quality Mandarin speech corpus, and is widely used in the speech community, which contains 178 hours of Mandarin speech data. Although the data is in Chinese, a common used language, the quantity is small. Thus, it can also verify our Wav-BERT for low-resource data. Moreover, there are many latest state-of-the-art methods on this dataset to be compared.
For a fair comparison, we use the official wav2vec 2.0 (Base/Large) model, XLSR-53, and mBERT models as the initial encoders. All model checkpoint download links are described in the appendix. Table 1 reports the results on IARPA BABEL in terms of character error rate (CER), where our Wav-BERT achieves state-of-the-art performance on all low-resource languages. We find some interesting points comparing the results. First, the performance of the methods without pre-training is quite bad, which indicates that the conventional end-to-end models are impractical for low-resource languages due to the limited data. Second, the pre-training models like wav2vec 2.0 and XLSR largely improve the recognition accuracy thanks to the powerful acoustic representation learned from the huge amount of high-resource language data. Third, in addition to the pre-trained acoustic model, other methods also utilize a pre-trained language model like mBERT while the results change slightly or even become worse. One of the reasons  (Karita et al., 2019) 6.0 6.7 SA-T (Tian et al., 2019) 8.3 9.3 SAN-M (Gao et al., 2020) 5.7 6.5 CAT (An et al., 2019) -6.3 LFML  6.2 6.7 LASO (Bai et al., 2021) 5.9 6.9 NAR-Transformer (Song et al., 2020) 5.6 6.3 Wenet  -4.7 LASO with BERT (Bai et al., 2021) BERT 5.3 6.1 NAR-BERT-ASR (Yu and  4.9 5.5 wav2vec 2.0 (Baevski et al., 2020) wav2vec 2.0 7.9 8.4 wav2vec 2.0 (cn) (Baevski et al., 2020) 5.2 5.8 wav2vec 2.0 (cn) w/ 4-gram (Baevski et al., 2020) 4.5 4.9 BERT rescoring (Shin et al., 2019) 4.2 4.5 Adapter-BERT  wav2vec 2.0 6.9 7.3 w2v-cif-bert  w/ BERT 5.6 6.3 our Wav-BERT w/ wav2vec 2.0 3.8 4.0 our Wav-BERT w/ wav2vec 2.0 (cn) 3.6 3.8 is that the methods that construct adapters in BERT (ADapter-BERT) or simply combine BERT with wav2vec 2.0 (w2v-cif-bert) inevitably suffer from the embedding inconsistency problem and fail to make the best use of pre-trained linguistic representation. As for our Wav-BERT, it effectively facilitates the cooperation of the pre-trained acoustic and language models by the proposed fusion modules from representation level to embedding level. As a result, it can consistently improve the ASR results for different low-resource languages. Moreover, when the pre-trained model (e.g. wav2vec 2.0) becomes larger, the performance of our Wav-BERT will be also improved while it requires more GPU resources to tune the whole model.  Table 2 reports the comparison results on AISHELL-1. In addition to the baselines mentioned above, we also report more latest works for comparison. The data quantity of this dataset is larger than that of IARPA BABEL, so all the methods perform much better. It also accounts for that the performance distance between the methods with pre-trained models and those without pretrained models becomes small. During the methods without pre-trained models, wenet  achieves the best results due to its advanced CTC-Conformer (Graves et al., 2006a;Gulati et al., 2020) architecture, better attention rescoring decoding strategy and larger training epoch number. With the pre-trained language model of BERT, NAR-BERT-ASR (Yu and Chen, 2021) stacked a decoder initialized by a pre-trained BERT model on the top of the transformer encoder and achieves competitive results on AISHELL-1. Regarding methods using the pre-trained acoustic model, the official wav2vec 2.0 Base model that pre-trained on 960 hours of Librispeech corpus achieves great results as the model learned good representations of speech. Furthermore, we also collect and use 1960 hours of public Mandarin speech data to pre-train a wav2vec 2.0 (cn) model, which obtains better performance on AISHELL-1 evaluation. In conclusion, our Wav-BERT not only improves the performance of both wav2vec 2.0 and wav2vec 2.0 (cn) models, but also outperforms other state-of-the-art methods unifying wav2vec 2.0 and BERT. It further demonstrates the generalization of Wav-BERT on different low-resource ASR datasets with different data sizes.

Comparison of model fusion methods
As illustrate in Section 4.1, there are many different model fusion methods to fuse the pre-trained wav2vec 2.0 and BERT. We compare our Wav-BERT with these methods and report the results in Table 1 and Table 2. First, by using BERT to rescore N -best hypotheses generated by wav2vec 2.0 with CTC beam search, rescoring (Shin et al., 2019) (Figure 1 (a)) is slightly better than wav2vec 2.0, but its inference process is time-consuming. Second, w2v-cif-bert  uses CIF to connect wav2vec 2.0 and BERT in a cascade way and replace word embedding with acoustic embedding as input for BERT. It is better than wav2vec 2.0 in AISHELL-1 but worse in BABEL for the Table 5: Predicted examples on AISHELL-1 test set generated by Wav2vec 2.0, BERT rescoring, w2v-cif-bert and our Wav-BERT. The differences words are marked with pronunciation. The wrong words are marked in red. The translations of the sentences are also provided.

Method
Predicted example with translation wav2vec 2.0 (Baevski et al., 2020) pling with decay Wenzhou aunt Nian and banpai pretended to be their daughter and got married successfully.
BERT rescoring (Shin et al., 2019) pling with decay More than half of Wenzhou's old aunt pretended to be her daughter and successfully cheated many young people into marriage..
w2v-cif-bert  pling with decay Wenzhou aunt year and half a hundred pretending to be daughters have successfully cheated into marriage, and there are many young people.
our Wav-BERT pling with decay Wenzhou aunt is more than half a hundred years old, pretending to be her daughter, and has successfully cheated many young people into marriage.
reason that the mBERT is not as well trained as the bert-base-chinese model, resulting in a more severe catastrophic forgetting problem after replacing its input. Third, Adapter-BERT that inserts adapter modules into each BERT layer and tunes it on the training data, has an inconspicuous improvement or even performance degradation since the insertion of adapters affects the pre-trained representation of BERT. Finally, our Wav-BERT significantly surpasses other methods, which indicates that our model can effectively exploit the acoustic and linguistic information through the multi-level hierarchical fusion. Besides, our cooperative learning methods can also help the pre-trained encoders to avoid catastrophic forgetting of pre-training information so that the whole model can converge faster and better.

Representation Aggregation Module
To investigate the effectiveness of our Representation Aggregation Module, we present results for Gated Linguistic-Guided Attention, Gated Acoustic-Guided Attention, removing gated weighting in Table 3. We can find that the effect of gated weighting, while small, is still existent, which can automatically measure the importance of the acoustic and linguistic representation while aggregating those two kinds of representation. Compared with Gated Cross-Modal Attention, Gated Acoustic-Guided Attention and Gated Linguistic-Guided Attention increases the average CER by 0.6% and 3.5% respectively, which indicates that the attention in each direction plays an important role in our Representation Aggregation Module while Gated Acoustic-Guided Attention makes a greater contribution since speech recognition task is more dependent on acoustic information.

Embedding Attention Module
The results in Table 4 further verify the effectiveness of our Embedding Attention Module. First, we report the result of Embedding Replacement that simply replaces the original word embedding with the acoustic embedding as the input of BERT like previous works (Yu . As expected, the performance is poor especially on AISHELL-1, which indicates that such simple replacement methods will be affected by the embedding inconsistency problem. In contrast, we solve this challenge by the proposed Embedding Attention Module including the sampling mechanism and Gated Attention, so that the performance is largely improved. Second, when turning off "Sampling with Decay" or Gated Attention, the average CER increased by 1.9% and 0.6% respectively. It demonstrates that the "Sampling with Decay" mechanism effectively alleviates the embedding inconsistency of BERT between inference and training. Mover, the Gated Attention effectively provides additional acoustic information to the input of BERT, facilitating it to capture more reliable linguistic representation.

Case Studies
We further present some case studies in Table 5, to illustrate the importance of acoustic and linguistic information for speech recognition. We provided some transcript examples obtained from the baseline methods and our Wav-BERT with the same input from AISHELL-1 test set. The pronunciations of the keywords and the English translation of the whole sentence are also provided. As can be observed, all the baseline methods predict one or two wrong words with similar pronunciation as the wrong words, which leads to an unreasonable sentence. On the contrary, thanks to the cooperative learning of acoustic and linguistic information, our Wav-BERT can successfully recognize the whole sentence without any word error.

Conclusion
In this work, based on the powerful wav2vec 2.0 and BERT models, we introduce cooperative acoustic and linguistic representation learning for lowresource speech recognition. To solve the representation discrepancy and embedding inconsistency challenges, we design a Representation Aggregation Module and an Embedding Attention Module to facilitate the cooperation of the two pre-trained models and thus boost the representation learning.
Extensive experimental results demonstrate that our proposed Wav-BERT can significantly improve low-resource ASR performances in different languages. In future work, we will investigate more effective modules to infuse more types of knowledge, and apply our framework to more pre-trained models to promote the development of low-resource speech tasks.

A Datasets
Both IARPA BABEL dataset (Gales et al., 2014) and AISHELL-1 (Bu et al., 2017) are open-source and high-quality speech datasets, and are widely used in the speech community. Among them, AISHELL-1 can be downloaded for free here 1 , For each speaker in it, around 360 utterances(about 26 minutes of speech) are released. Table 6 provides a summary of all subsets in the corpus. As for IARPA BABEL, it can be purchased through LDC 2 (eg. Vietnamese Language Pack 3 ). Table 7 summarizes the amount of data in hours for the language used in our experiments on the "Full Language Pack" (FLP) condition. Researchers can easily reproduce or compare our results with the same languages.

C Pre-trained Models
We use different pre-trained acoutic and language models in our experiment described in Sec 5. All of them are open-source except the wav2vec 2.0 (Baevski et al., 2020) pre-trained in Chinese by ourselves. For pre-trained language models, the bert-base-chinese model can be download here 4 , and the multilingual mBERT can be download here 5 . For pre-trained acoustic models, the official wav2vec 2.0 pre-trained on English can be download here 6 , and the XLSR-53 (Conneau et al., 2020)

D Baselines
We describe some baseline methods below, which are reproduced by ourselves or experimented with the open-source code.
gram model is trained by transcripts in the training set of each language, using the KenLM (Heafield, 2011) framework. And the beam size for beam search is set to 50.
2. BERT rescoring (Chiu and Shin et al., 2019): For each language, results from the trained wav2vec 2.0 model with beam search, are rescored by the fine-tuned language model(mBERT or bert-base-chinese model). Specifically, the linguistic decoder is fine-tuned by transcripts in the training set of each language using masked language model(MLM) objective (Devlin et al., 2018) of BERT. In rescoring stage, we mask each word in the sentence once at a time, then sum all the log-likelihoods of the masked words from each masked input instance. Finally rescoring the sentence with both the likelihoods from acoustic and language model. Besides, considering it is time-consuming, the beam size for beam search is set to 5.
3. Adapter-BERT: This method is inspired by AB-Net , cross-attention adapters are inserted to each BERT layer to unify the wav2vec 2.0 and BERT model. Output from the feed-forward layer at the last of BERT is supervised by the cross-entropy criterion. In inference, the Mask-Predict algorithm (Ghazvininejad et al., 2019) is adopted.
4. Embedding Replacement: Inspired by previous work (Yu and Chen, 2021), we use similar architecture as it but replace the acoustic encoder with wav2vec 2.0 and keep our Representation Aggregation Module. We use position embeddings as query vector and acoustic representation from wav2vec 2.0 as key and value vector to attention block followed by 3 self-attention block, which is the same as (Yu and Chen, 2021), generating aligned acoustic representation H pos . Then H pos is used as the input of BERT, replacing the word embedding. Finally, Representation Aggregation Module takes both the H pos and linguistic representation from BERT as input, just the same as our Wav-BERT. It is worth mention that the length of the position embedding is set to 60, considering it cost too much GPU memory for a larger value.