Multilingual Speech Translation from Efficient Finetuning of Pretrained Models

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning 10 50% of the pretrained parameters. This effectively leverages large pretrained models at low training cost such as wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation. This sets a new state-of-the-art for 36 translation directions (and surpassing cascaded ST for 26 of them) on the large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average for En-X directions and +6.7 BLEU for X-En directions). Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model (+5.6 BLEU on average across 28 non-English directions), making it an appealing approach for attaining high-quality speech translation with improved parameter and data efficiency.


Introduction
Recent advances in pretraining over unlabeled data and then finetuning on labeled data leads to significant performance improvement in text understanding and generation tasks (Devlin et al., 2019;Radford, 2018). Lately, such text pretraining and finetuning paradigms have been extended to other modalities: audio (Schneider et al., 2019;Baevski et al., 2020), images (Su et al., 2019;, and video (Sun et al., 2019). At the same time, pretraining and finetuning techniques have improved multitasking applications significantly, such as multilingual translation, cross-lingual representations, question-answering and so on (Raffel et al., 2020; Self Attention  . In this paper, we advance the one-model-for-all paradigm further by adapting audio and multilingual text pretraining and finetuning to improve multilingual speech-totext translation. Our contributions are as follows: • We propose a simple and effective approach to combine pretrained single-modality modules to perform speech-to-text translation. With minimal architecture change, we add a crossmodal adaptor to bridge the length discrepancy between audio encoder output and text decoder input. Our approach can also perform multi-task finetuning with both speech-to-text translation and text-to-text translation tasks where we find joint training with the latter brings further gains. • We present an efficient transfer learning strategy by only finetuning the LayerNorm and Attention (LNA) parameters of pretrained models. This approach is not only parameterand data-efficient but also effective for zero-shot crosslingual transfer to unseen languages (train on A → B, test on A → C and C → B).
• Our approach is also effective for zero-shot multilingual translation (train on A → B and B → C, test on A → C), which provides an efficient approach for many-to-many speechto-text translation without dependency for parallel data for every direction.
• Using a pretrained audio encoder (wav2vec (Baevski et al., 2020)) and multilingual text decoder (mBART ), this approach sets a new state-of-the-art (SOTA) on two large-scale speech translation benchmarks. On CoVoST 2 (Wang et al., 2020b), we pushed the SOTA for end-to-end approach for all 21 X-En directions(+6.7 BLEU on average) and 15 En-X directions (+6.4 BLEU on average) by finetuning only 10 ∼ 50% of parameters. Similarly on Europarl (Iranzo-Sánchez et al., 2020), our zero-shot multilingual many-to-many model is not only data efficient, but also brings +5.7 BLEU (on average) when translating 18 non-English directions compared to a many-to-many model training on 1.6× training data with all pairwise (both to/from English and non-English) directions.
We describe our approach in Section 2, namely pretrained models, length adaptor, LNA finetuning and joint speech-text finetuning as is illustrated in Figure 1. Experiments setup and results are elaborated in Section 3 and Section 4. Section 5 provides ablation studies of the proposed finetuning strategy.

Pretrained Modules
Our model leverages a pretrained wav2vec 2.0 (Baevski et al., 2020) as encoder for acoustic modeling, a pretrained multilingual BART (mBART)  as decoder for language modeling. Both models are pretrained on unlabelled data via self-supervised learning. We provide an overview of the pretraining procedure in A.1.

Length Adaptor
We add a lightweight adaptor module in between encoder and decoder to better align the two mod-ules pretrained with different modalities. The adaptor module performs projection and downsampling to alleviate length inconsistency between the audio and text sequences. Specifically, we use a stack of n 1-dimensional convolutional layers with stride m to shrink the speech sequence (encoder output) by a factor of m n .

LNA Finetuning
Instead of finetuning all parameters in pretrained models, we propose parameter efficient finetuning strategy (LNA) of only finetuning the layer normalization (LayerNorm) and multi-head attention (MHA) parameters. LNA is motivated to bridge the discrepancy between pretraining and downstream (ST) task, which we hypothesize are accounted by the following parameters: LayerNorm parameters from pretrained models were trained based on the statistics of the data used in pretraining and thus need to be adapted to downstream tasks during finetuning. The importance of finetuning LayerNorm has been observed in multilingual (text-only) translation (Stickland et al., 2020). Attention Encoder attention (EA, attention to encoder outputs) parameters from pretrained MT decoder were trained on the text-to-text MT task, so we hypothesize that they are crucial to be adapted to the speech encoder output. Combined with Layer-Norm parameter is the proposed LNA-Minimalist finetuning. In addition, we also investigate the role of self attention (SA) parameters in facilitating crosslingual transfer ability.

Joint Speech-text Finetuning
Multi-task learning has been shown as an effective approach to improve the performance of the speech translation task using other related tasks, such as MT and ASR (Weiss et al., 2017;Anastasopoulos and Chiang, 2018;Bahar et al., 2019;Tang et al., 2021a,b). We jointly train MT and ST tasks in the finetuning with pretrained models. The speech transcripts are used as input for the MT task and the corresponding speech data is used as input for the ST task. As a result, we can leverage abundant parallel text data to further improve the performance.

Datasets
We evaluate our proposed models on two largescale multilingual speech translation benchmarks.
Statistics of the datasets and implementation details are reported in the A.2 and A.3. CoVoST 2 (Wang et al., 2020b) is a multilingual speech-to-text translation corpus with English into 15 languages (En-X) and 21 languages into English (X-En). It provides a comprehensive test bed for low-resource scenarios, with 4 X-En directions between 10 hours and 20 hours training data, and 11 X-En directions less than 4 hours training data. Europarl ST (Iranzo-Sánchez et al., 2020) has both English-centric as well as non-English directions, which allow us to evaluate the proposed method's effectiveness of multilingual translation between any pair, especially zero-shot performance. We experiment on all 6 languages (de, en, es, fr, it, pt). We compare to a multilingual baseline trained with all pair-wise parallel data.

Training
We evaluate the following instantiation of the proposed method which is referred to as XMEF (Cross-Modal Efficient Finetuning). Encoder. We initialize the encoder using the opensourced 1 wav2vec 2.0 large architecture pretrained on unlabelled English-only (XMEF-En) audio from LibriVox (Baevski et al., 2020). For many-to-one experiments, we also experiment with a multilingual wav2vec 2.0 (XMEF-X), which was pretrained on raw audio from 53 languages (Conneau et al., 2020). Encoder output is followed by 3 1-D convolution layers with stride 2 to achieve 8x down-sampling of audio encoder outputs. Decoder. We initialize the decoder with opensourced 2 mBART50 models and the same vocabulary . We use mBART50N1 (49 languages to English) for X-En ST directions and mBART501N (English to 49 languages) for translating En-X ST directions. LNA Finetuning. We study the parameter efficiency and crosslingual transfer ability of LNA finetuning in the bilingual setting without the additional effect from multilingual training. Drawing learnings on that, we then evaluate applying LNA finetuning to encoder only (LNA-E), decoder only (LNA-D), and both (LNA-E,D) respectively. For multilingual finetuning on CoVoST 2, we use all X-En training data (except zero-shot crosslingual transfer experiments) for evaluating X-En perfor-  mance, and En-X data from all directions for evaluating En-X performance. For evaluating multilingual zero-shot performance on Europarl, we only use X-En and En-X for finetuning and evaluate on all (X-X) pairs. Joint Training. Two encoders are initialized with the pretrained mBART encoder and wav2vec 2.0 encoder mentioned above, and are used for text and speech input respectively. The last 12 transformer layers in the wav2vec encoder are replaced with 12 mBART encoder layers. Parameters in those 12 layers are shared between the two encoders during joint training (Tang et al., 2021b). The decoder is also shared between two tasks and is initialized with the pretrained mBART decoder model. We also experimented with adding additional bitext used in ML50  as training data for the MT task. Only the language pairs present in the CoVoST 2 dataset are chosen and they cover all language pairs except English to and from "Ca" and "Cy". We fine-tune all parameters in this experiments due to the large mismatch of the pretrained model (mBART encoder as part of the speech encoder) and more available training data.

Baselines
From scratch: The first baseline trains a sequenceto-sequence model with Transformer architecture without any pretraining.For CoVoST 2 experiments, we use the same model configuration as is provided by (Wang et al., 2020b). ASRPT+Multi: Pretraining encoder on ASR task was shown to be an effective method to improve speech translation and accelerates convergence (Bansal et al., 2019). We compare our results to a strong baseline provided by (Wang et al., 2020b), consisting of a multilingual Transformer model trained on CoVoST 2 with multilingual ASR pretraining (ST). For the Europarl ST many-to-many baseline, we use Transformer architecture with 12layer encoder, 6-layer decoder, and trained on all 30 directions. To provide the strongest baseline, encoder was pre-trained on LibriSpeech English ASR). XMEF-BL: Multilingual models for En-X (oneto-many) usually face more challenges from interference as they were found to underperform the bilingual counterparts (Arivazhagan et al., 2019). Therefore, we compare to applying our method (XMEF, LNA) to bilingual (BL) finetuning, i.e. finetuning on parallel data from a single language pair. Previous SOTAs: We compare to the best end-toend (E2E) model from previous literature (Wang et al., 2020b; Iranzo-Sánchez et al., 2020) on each translation direction, which is usually the bestperforming multilingual model trained with parallel data from all directions (both X-En and En-X) and also pretrained with ASR. Even though the focus of the proposed method is E2E model, we also compare to the best performing cascade approach (Cascade SOTA) which is composed of Transformer-large encoder from ASR pretraining and a multilingual MT model trained on all X-En and En-X data.

Parameter Efficiency
First, we evaluate the transfer learning performance of finetuning the entire pretrained model as well as the proposed efficient finetuning (LNA). To separate the additional crosslingual transfer learning from multilingual finetuning, we evalute on bilingual ST (En-De and De-En in CoVoST) task. We first evaluate LNA-Minimalist (69M params), comparing to finetuning all parameters and only top layers which were found effective in transfer learning in NLP tasks with pretrained BERT (Wu and Dredze, 2019;Kovaleva et al., 2019). Figure 2 show that in both low data and high data regimes, the proposed LNA-Minimalist both generalizes better (lower perplexity on dev set) and substantially improves training efficiency (only 10% of parameters to train leading to lower memory cost and faster training).

Transfer from Pretraining
To assess transfer ability from encoder pretrained on English to other (speech) input languages, we evaluate the performance of XMEF-En on CoV-oST 2 De-En ST task. We investigate the role of finetuning encoder self-attention (LNA-ESA) in facilitating crosslingual transfer. We compare to baselines of finetuning the entire encoder (All), and finetuning feature extractor which are commonly used in adaptation in ASR (Rivière et al., 2020).
Results are summarized in Figure 3. LNA still demonstrates improved generalization than alternative finetuning approaches, with finetuning encoder self attention (LNA-ESA) being crucial for adapting pretrained English encoder to other languages.

Zero-shot Crosslingual Transfer
Next, we evaluate XMEF's crosslingual transfer performance from multilingual finetuning. To precisely measure the transfer capability, we evaluate the zero-shot setting, i.e. finetune XMEF-En with parallel ST data from multiple languages, and evaluate on an unseen language. We study the transfer performance in source (speech) and target (text) separately.
Source-side (speech) transfer. We evaluate whether the proposed approach enables positive crosslingual transfer to translate speech from unseen languages in Table 1. We finetune on labelled data for 5 to-English language pairs, and evaluate the finetuned model's zero-shot performance when translating speech input from unseen languages (Pt). First, we found that comparing to finetuning more parameters (LNA-D, and All), LNA finetuning (LNA-E,D) not only trains more than 2 × faster but also achieves better generalization both for seen and unseen languages. Especially, it attains remarkable performance as unsupervised speech translation for Portuguese-English, achieving 8.2 BLEU (compared to the supervised bilingual baseline 0.5 BLEU as is provided in Table 3, and even beats  (+1.9 BLEU) the previous state-of-the-art for this direction which is a supervised multilingual model. Target-side (text) transfer. Table 2 shows the proposed approach also achieves zero-shot transfer capability for translating to new languages, with unsupervised translation for English-Japanese only 1.3 BLEU behind the best supervised result. Furthermore, an interesting finding is that applying LNA finetuning to decoder is crucial for zero-shot transfer to unseen languages (Ja), as finetuning the entire decoder tends to optimize the model on target languages seen during training.

Multilingual Speech Translation
We evaluate the performance of XMEF with multilingual finetuning on all 36 translation directions in CoVoST 2, respectively all 21 languages into English (many-to-one) and from English into 15 languages (one-to-many). Many to one. Consistent with the observation of source-side crosslingual transfer in Sec 4.1, XMEF-En perform very well on Romance, Germanic and Slavic language families in both high-resource ( ≥ 100 hours training data) and low-resource directions (7 ∼ 44 hours training data) as is summarized in Table 3, and even surpassing the best cascade results on 8 languages. Our multilingual model also improves distant (from English) and extremely low resource (mostly ≤ 5 hours training data) languages as is shown in second panel of Table 3. For crosslingual adaptation from XMEF-En to speech input of other languages, LNA-E,D (only finetune 21.5% of pretrained parameters) outperforms finetuning the entire model (Finetune All) by 0.7 BLEU (averaged across 21 directions), while finetuning the entire encoder (LNA-D) brings +1.2 BLEU. Finetuning XMEF-X achieves the best average BLEU score, however, major improvement is from finetuning encoder (LNA-D).
One to many. Table 4 summarizes performance on translating (from English) to 15 languages where multilingual models from XMEF-En have improved previous state-of-the-art (both E2E and cascade) on all directions (+6.4 BLEU on average). The performance of applying LNA finetuning to encoder only (LNA-E) is very close to (24.2 vs. 24.5 averaged BLEU) that of finetuning the entire model (Finetune All) while has 40% less parameters to train. Applying LNA to both encoder and decoder (LNA-Min, LNA-E,D) further reduces the amount of parameters to train to only 8 ∼ 20% of all parameters in the pretrained models yet still maintain strong performance compared to strong baselines such as ASR PT with multilingual finetuning (ASR PT+Multi) as well as the best cascade models. The only two languages (Ca, Cy) it did not  Table 3: Performance of X → En multilingual model. We report BLEU scores on test set. For each XMEF method, we report the number of parameters trained in brackets. Previous E2E SOTA is the best-performing end-to-end multilingual (with ASR pretraining) model from (Wang et al., 2020b). Results in bold are where the proposed approach improves previous E2E SOTA, and sets new SOTA as underlined. * means our new E2E SOTA also beats the previous cascade SOTA.
improve with LNA finetuning of the decoder were never seen during mBART pretraining.

Joint Training
In the many to one case (Table 3), language pairs with reasonable amount speech training data (+ 18 hours) and large amount of parallel text data (+1 million sentences) ("Fr-En", "De-En", "Es-En", "It-En", "Ru-En" and "Fa-En"), outperform the corresponding single task trained models and achieve state-of-art results . However, if the amount of speech data is too small (10 hours or less), joint training is ineffective and may even make the performance worse. In one to many case ("En-X"), where there are 364 hours English audio data for training, joint training improves the results further by another 0.6 BLEU (Table 4).

Zero-shot Many-to-Many Speech to Text Translation
Finally, we evaluate how the proposed approach performs in zero-shot multilingual translation (translating X → Y after training on X → En and En → Y. We apply LNA-D multilingual finetuning using En-X and X-En training data only from the Europarl corpus. Table 5 reports both the supervised performance on to-and from-English directions and zero-shot performance translating between non-Engligh languages without training on their parallel data. We compare to the strong baseline of a many-to-many multilingual model trained from scratch using all parallel data from non-English directions as well as English-centric directions. Our approach improves both to-and from-English directions (+6.8 BLEU and +8.2 BLEU on  Table 4: Performance on En → X multilingual ST. We report BLEU scores on test set. For each XMEF method, we report the number of parameters trained in brackets. 'BL' refers to using the same XMEF and LNA-E,D finetuning but only on bilingual corpus. Results in bold are where the proposed approach improves previous E2E SOTA, and sets new SOTA as is underlined. * means our new E2E SOTA also beats the previous cascade SOTA. For multilingual models (i.e. the same model evaluated on multiple directions), we also report the average (Avg.) BLEU scores across all 15 directions.
averge respectively) and our zero-shot results also beats (+5.6 BLEU) the supervised many-to-many model on 28 pair-wise (except for It-Pt and Pt-Es) translation directions.

Ablation Studies
Ablation on LNA Finetuning. In Table 6 we analyze how individual components of LNA contribute to the generalization performance and training efficiency. Specifically, we examine the key components of LNA-Minimalist (LNA-Min) finetuning. We find finetuning LayerNorm parameter (far less compared to the amount of multi-head attention parameters) is important for training stability when finetuning pretrained models without which (-LN) training diverges. Finetuning the encoder attention (EA) parameters is important for adapting the pretrained text decoder for ST task. For adapting to a single language pair downstream ST task (English-German), we find finetuning self attention (+SA) parameters in the decoder did not bring further improvement while significantly increasing the amount of parameters to train.
Ablation on Length Adaptor. We study whether the performance is sensitive to downsampling ratio in the adaptor module. We conduct the experiments on CoVoST 2 many-to-one experiments, and report perplexity on dev set of three directions with diverse input languages: German-English (De-En), Chinese-English (Zh-En) and Estonian-English (Et-En). Table 7 shows our approach is not sensitive to common downsampling ratios (4 or 8) while extreme downsampling (27) hurts performance.

Related Work Speech
Translation. Sequence-to-sequence based speech translation has shown very good potential over the traditional cascaded system (Berard et al., 2016;Goldwater et al., 2017;Weiss et al., 2017) with end-to-end approaches surpassing cascaded system for the first time at IWSLT (Ansari et al., 2020) in a shared task setting. However, previous work also indicates that its success heavily relies on large amounts of labelled training data, which is difficult to acquire. In order  Table 5: Zero-shot performance (baseline/XMEF) on Europarl. Baseline is a many-to-many multilingual model trained on parallel data from all 30 directions. For our approach (XMEF), only to-and from-English directions ( shaded ) were used in multilingual finetuning while the rest are results of zero-shot translation. Bold are where our model (En-only and zero-shot for the rest) outperforms a supervised many-to-many model. * means that our zero-shot model also beats the supervised cascade model in (Iranzo-Sánchez et al., 2020).   Table 7: Ablation on length adaptor with different downsampling ratios of speech input. The experiment was conducted on the CoVoST X-English multilingual finetuning and we report perplexity on the dev set (PPL ↓) for three distinct languages.
to mitigate the data scarcity issue, recent research work focuses on multi-task learning (Weiss et al., 2017;Anastasopoulos and Chiang, 2018;Bahar et al., 2019;Wang et al., 2020c,d;Indurthi et al., 2020;Di Gangi et al., 2019), pretraining different components of the model (Bérard et al., 2018;Bansal et al., 2019), transfer learning (Gaido et al., 2020; and generating synthetic data (Jia et al., 2018;. Pretraining and Finetuning. Our work is motivated by the recent success of self-supervised learning for NLP and speech processing applications (Radford, 2018;Devlin et al., 2019;Clark et al., 2019;Lewis et al., 2019;Lample and Con-neau, 2019;Dong et al., 2019;Rivière et al., 2020;Kawakami et al., 2020;Chung and Glass, 2020;Baevski et al., 2020), which has achieved state-of-the-art results when finetuning on downstream tasks in NLP Devlin et al., 2019;Raffel et al., 2020;. Our work attempts to leverage pretrained components from different modalities (text and speech) to perform the ST task. How to efficiently adapt large pretrained models has gained growing interest. (Houlsby et al., 2019) and (Pfeiffer et al., 2020) represent the stream of work which adds additional "adaptor modules" to achieve fast adaptation to downstream tasks. Another category of solutions focus selective finetuning (only subset of parameters) suitable for downstream tasks.
Our work belongs to the second category of efficient finetuning without adding extra parameters (e.g. adaptor modules). Empirical studies shows that finetuning the final layers of BERT account for most of the quality gains on downstream tasks (Kovaleva et al., 2019;. Finetuning LayerNorm parameters was also found effective for adapting pretrained BART or mBART for machine translation (Stickland et al., 2020). A general approach is to automatically learn which layers/parameters from a large-pretrained model to finetune and freeze (Guo et al., 2019), which we found is an exciting direction for future work.

Conclusion
We proposed a simple and effective approach to leverage pretrained single-modality models (such as wav2vec 2.0, mBART) to perform speech-totext translation. On two large-scale multilingual speech translation benchmarks, our approach advances the state-of-the-art (+6.6 BLEU on average for 36 translation directions in CoVoST 2, and +5.6 BLEU for 28 translation directions in Europarl).
We provide an efficient finetuning strategy which is not only data-and parameter-efficient, but also demonstrates crosslingual transfer ability by only finetuning 10 ∼ 50% of the parameters of large pretrained models.

A Appendix
A.1 Description of Pretrained Models wav2vec 2.0 is a simple and powerful framework to learn high quality speech representation from unlabelled audio data. It mainly consists of two components: feature encoder and context encoder. The feature encoder, which is built from temporal convolution layers, takes raw audio signal O as input and generates latent speech representation Z = [z 1 , · · ·, z T ]. They are fed to the transformer based context encoder to generate context representations C = [c 1 , · · ·, c T ] with sequence level information. During pre-training, the model is optimized with a contrastive task to distinguish true latent from distractors. The input to the context encoder is with span masked. The latent speech representation Z is discretized to Q = [q 1 , · · ·, q T ] and used as targets for the frames in the masked span.
mBART is a sequence-to-sequence generative pretraining scheme, specifically a denoising autoencoder (DAE) to predict the original text x given g(x) where g is a noising function that corrupts text such as random span masking and order permutation . The model is trained with monolingual data of N languages: D = {D 1 , ..., D N } where each D i is a collection of documents in language i. The pretraining objective optimizes L θ : where x is an instance in language i and the distribution P is parameterized by the sequence-tosequence model.

A.2 Data
The CoVoST 2 dataset (Wang et al., 2020b We provide the list of languages used in our experiments and their ISO codes. Ta Tamil  Tr Turkish Zh Chinese (Sim)

A.3 Implementation Details
Preprocessing. When using wav2vec 2.0 encoder, we use 16-bit 16kHz mono-channel audios as inputs. When using a traditional speech recognition (ASR) encoder, we extract 80-channel log mel-filter bank features (25ms window size and 10ms shift) with utterance-level cepstral mean and variance normalization applied. We remove training samples with more than 3,000 frames for GPU memory efficiency. For preprocessing the target (text) data, we use the same vocabulary as is used in the pretrained mBART model.

Pretrained models.
We use the opensourced models from wav2vec 2.0 and mBART50 pretrained with multilingual parallel text data. These models can be downloaded from https://github.com/pytorch/ For XMEF-En, we use the 960-hour Wav2Vec 2.0 Large (LV-60) model. For XMEF-X, we use the 56K-hour XLSR-53 Large model. For decoder, we use the pretrained "mMBART 50 finetuned many-to-one" model for many-to-one experiments and "mMBART 50 finetuned one-to-many" for one-to-many experiments.
Training. We implement all our experiments using fairseq S2T Wang et al., 2020a). Our experiments are run with 32 Nvidia V100 GPUs (32GB) with batch size of 256k tokens. We use FP16 training implemented in fairseq . We apply the same regularization as the baseline models such as label smoothing 0.3, attention dropout probablity 0.3. We choose learning rate among [1e − 5, 5e − 5, 1e − 4] based on validation accuracy (measured on dev set). For multilingual wav2vec 2.0, we enable normalization flag to be consistent with pretraining. We did not apply any temperature adjustment in sampling language pairs in training, but simply train on the empirical distribution of training data volume.
Evaluation. We use the best checkpoint (without checkpoint averaging) according to validation loss and a beam size of 5 for decoding. We report case-sensitive detokenized BLEU using sacre-BLEU (Post, 2018), except for Japanese and Chinese translations (no word segmentation) where we report character-level BLEU.