DUB: Discrete Unit Back-translation for Speech Translation

How can speech-to-text translation (ST) perform as well as machine translation (MT)? The key point is to bridge the modality gap between speech and text so that useful MT techniques can be applied to ST. Recently, the approach of representing speech with unsupervised discrete units yields a new way to ease the modality problem. This motivates us to propose Discrete Unit Back-translation (DUB) to answer two questions: (1) Is it better to represent speech with discrete units than with continuous features in direct ST? (2) How much benefit can useful MT techniques bring to ST? With DUB, the back-translation technique can successfully be applied on direct ST and obtains an average boost of 5.5 BLEU on MuST-C En-De/Fr/Es. In the low-resource language scenario, our method achieves comparable performance to existing methods that rely on large-scale external data. Code and models are available at https://github.com/0nutation/DUB.


Introduction
Speech-to-text translation (ST) converts the spoken source language into the written target language, which is a closely related task to machine translation (MT).In recent years, direct ST that does not rely on intermediate transcription has received considerable attention due to its potential applications in unwritten language scenarios and various domains (Bérard et al., 2016;Sung et al., 2019;Han et al., 2021;Papi et al., 2021;Fang et al., 2022;Ye et al., 2022;Cheng et al., 2023).One of the major challenges faced by ST is data scarcity, which is similar to the low-resource scenarios encountered in MT.Intuitively, techniques developed for low-resource MT (Imamura et al., 2018;Xia et al., 2019;Chen et al., 2020;Liu et al., 2020 et al., 2020) should be utilized to improve ST performance.However, these techniques are hard to be transferred to ST due to the modality gap between speech and text, where ST takes continuous speech as input and MT takes discrete tokens as input.Generally speaking, if there is a way to efficiently remove the modality gap, a large number of useful NLP techniques can be applied and facilitate the improvement of ST.
Recently, representing speech with unsupervised discrete units has become popular and successful in the field of speech processing (Baevski et al., 2019(Baevski et al., , 2020;;Hsu et al., 2021;Lakhotia et al., 2021).Instead of losing relevant information, discretizing continuous speech has been found to have the advantage of filtering out extraneous signals (Sicherman and Adi, 2023;Lakhotia et al., 2021), leading to significant improvements in the speech tasks, such as automatic speech recognition (Meng et al., 2022), text-to-speech (Dunbar et al., 2019), and speech-to-speech translation (Zhang et al., 2021;Lee et al., 2022;Inaguma et al., 2022).Based on this observation, we are motivated to explore the answers to the following two questions: (1) Is it better to represent speech with discrete units and use them as model input than with continuous features for direct ST? (2) By narrowing the modality gap with discrete speech units, how much benefit can useful MT techniques bring to direct ST?
In this paper, we propose Discrete Unit Backtranslation (DUB), which migrates the useful backtranslation technique from MT to ST by discretizing the speech signals into unit sequences.In our proposed method, we first convert speech into discrete units using the clustering indices on Hu-BERT (Hsu et al., 2021) representations.To complete the translation task, we feed the discrete units into the Unit-to-Text Translation (U2TT) model.For the back-translation training strategy, DUB employs a text-to-unit translation model that learns to predict the source discrete units from the target text.By leveraging the additional easily accessible text in the target language, we utilize the synthetic parallel data generated by the text-to-unit translation model in conjunction with the original parallel data to update the final unit-to-text model.
Our contributions include the following.• We design a novel unit-text translation (U2TT) framework for direct ST by discretizing the speech feature in an unsupervised manner.Our analysis shows that in such a framework, the unit retains the semantic information for translation and can be used as model input.

Related Work
Speech translation Without using textual transcriptions during inference or training, translating audio directly into the target language is very meaningful for languages that do not have a written form.Bérard et al. (2016) first proposed an end-to-end encoder-decoder architecture for such direct speech-to-text translation.Later, novel models (Di Gangi et al., 2019b;Dong et al., 2021a,b;Zheng et al., 2021) and training techniques, such as multi-task learning (Indurthi et al., 2021;Tang et al., 2021;Ye et al., 2021), knowledge distillation (Liu et al., 2019;Dong et al., 2021b), and pretraining methods (Zheng et al., 2021;Zhang et al., 2022c), were developed to improve end-to-end performance.However, these training methods often rely on the use of source text or knowledge from the pre-trained models.Without using transcripts or pretraining, Zhang et al. (2022a) proposed the parameterized distance penalty to better model speech locality in the self-attention structure and provided results on ST benchmarks covering 23 languages.Back-translation Back-translation (BT) is a widely used method for improving machine translation systems by training a target-to-source model and creating synthetic parallel data from monolin-gual target text.This approach has been shown to be effective in both statistical (Bertoldi and Federico, 2009;Bojar and Tamchyna, 2011) and neural machine translation models (Sennrich et al., 2016;Edunov et al., 2018;Hoang et al., 2018), and is frequently used to improve translation performance in WMT competitions (Farhad et al., 2021;Wenzek et al., 2021;Adelani et al., 2022).A similar data augmentation idea through synthesizing speech from utterances can be applied to automatic speech recognition (ASR) (Tjandra et al., 2017;Hayashi et al., 2018;Ueno et al., 2021).However, applying BT in ST is not trivial.Zhang et al. (2022b) augmented the triplet data by TTS generation from transcription, but the experiment showed that such scaling yields minimal improvement to the final ST model.
Discrete speech representation Discrete speech representations are often studied in the work on self-supervised speech representation learning (Van Den Oord et al., 2017;Baevski et al., 2019Baevski et al., , 2020;;Hsu et al., 2021;Meng et al., 2022).For example, Van Den Oord et al. (2017) proposed Vector Quantised-Variational AutoEncoder (VQ-VAE) to map continuous signals, like speech or image, into a discrete sequence space.Hsu et al. (2021) proposed HuBERT, which learns self-supervised speech representation by extracting speech features and clustering them offline, and iteratively training the clustering indexes of features at masked locations.Although the clustered discrete representations are only a by-product of HuBERT, they are used to build the generative spoken language model (Lakhotia et al., 2021;Kharitonov et al., 2022), enhance speech representation (Chung et al., 2021;Meng et al., 2022;Chen et al., 2022;Wu et al., 2022;Zhang et al., 2022c), and model direct speech-to-speech translation (Lee et al., 2022;Inaguma et al., 2022).In the prior literature, probably the most similar task to ours is the textless speechto-speech translation (Lee et al., 2022;Nguyen et al., 2022), but the difference is that they discretized the target-side speech and convert speechto-speech generation into speech-to-discrete-unit generation, while we discretize the speech at the source side.Zhang et al. (2022c) leveraged the discrete unit as an interface to align speech and text, and proposed a unified-modal encoder-decoder pretraining model, SpeechUT.SpeechUT aims to improve speech representation via the units, while we use the units to construct a unified framework for ST and MT, and to explore transferable training methods from NLP to speech.
3 Our Approach

Problem Formulation
Unlike cascade systems or existing end-to-end ST work that utilizes speech-transcription-translation triplet (s, x, y), we aim to build and train the model that translates speech directly into text in another language without using the transcription x.The training dataset is denoted as D s,y = {(s, y)}.Also, we introduce the monolingual corpus of the target language D ′ y = {y ′ }, enhance the model via the discrete unit back-translation (DUB) method (described in Section 3.3).

Model Structure
As illustrated in Figure 1(a), our model consists of three main components: discrete unit extractor, unit-to-text translation model, and text-to-unit translation model.Discrete Unit Extractor The discrete unit extractor converts continuous speech signals into a sequence of discrete units, which we use the Hiddenunit BERT (HuBERT) (Hsu et al., 2021).HuBERT is a self-supervised model learned by predicting discrete labels of masked audio segments from k-means clustering on the model's intermediate representations.It consists of a stack of 1-D convolutional layers and a Transformer encoder to encode the speech into continuous intermediate representations, and a k-means model to convert the representations into a sequence of cluster in-dices.We then remove the adjacent duplicate indices to obtain the discrete units sequence, denoted as u = (u 1 , u 2 , . . ., u T ), u i ∈ {0, 1, . . ., K − 1}, ∀1 ≤ i ≤ T , where K is the number of clusters.Note that the discrete unit extractor used offline during the pre-processing stage before translation, can be considered as a feature extractor.
Unit-to-Text Translation (U2TT) Model The U2TT model θ u→y performs the forward translation.It consists of a discrete unit embedding layer and a Transformer.The discrete unit embedding layer converts discrete units u into the embedding e = (e 1 , e 2 , . . ., e T ).In order to retain more contextual and textual information from Hu-BERT, we adopt the intermediate representations of HuBERT's k-means cluster centroids as prior knowledge to initialize the unit embedding.This initialization operation is referred to as pre-trained embedding in the later analysis (Section 6.1).The Transformer follows the vanilla Transformer architecture (Vaswani et al., 2017), consisting of a Transformer encoder and a Transformer decoder.The encoder takes unit embedding e plus sinusoidal positional embedding as input and outputs semantic representation.The decoder generates the translation sequence y = (y 1 , y 2 , . . ., y |y| ) autoregressively based on the semantic representation.
Text-to-Unit Translation (T2UT) Model The T2UT model θ y→u has the same structure as the U2TT model, but with a randomly initialized text embedding layer.It is added to perform the text-tounit translation and to incorporate the DUB training.(2); 3.For each text y ′ ∈ D ′ y , generate corresponding synthetic units û′ through the BT model (generation methods will be discussed in Section 3.4).Then, add special <BT> indicator at the begining of û′ (Caswell et al., 2019).The synthetic unitstranslation set is denoted as and loss in Eq. ( 1).Training Objective The training objectives for U2TT and T2UT models are the negative loglikelihood losses based on the unit-translation pairs:

Discrete
, where D refers to D u,y ∪ D ′ u,y for DUB training, and when D = D u,y , Eq. ( 1) is the loss function for training U2TT from scratch.

Generation methods for back-translated units
We explore the following generation methods for producing synthetic units: beam search, sampling, and top-k sampling.We also apply a speech normalization method to remove speaker information when generating units.
Beam search tries to identify the maximum a posteriori (MAP) output and generate the sentence with the largest estimated probability given an input.Sampling means sampling from the distribution randomly at each step, which generates diverse outputs.Top-k sampling is a middle ground between beam search and sampling.At each time step, we select the k most likely tokens from the output distribution, re-normalize and then sample from this restricted set.
The discrete unit extractor produces various unit sequences for speech with the same content when delivered by multiple speakers (Lee et al., 2022).These variations pose a challenge for training the BT model.In order to address this issue, we adopt a Speech Normalization module from (Lee et al., 2022), which removes speaker information from the discrete units and produces norm units.Specifically, it is an off-the-shelf HuBERT-CTC model trained on VoxPopuli (Wang et al., 2021a) that normalizes variant speech input to a single speaker to eliminate such influence (denoted as Speech Norm).We implement back-translation with norm units and use the resulting BT model to generate pseudo norm units.

Datasets
MuST-C MuST-C1 (Di Gangi et al., 2019a), one of the most widely-used ST benchmarks, contains translations from English to 8 languages collected from TED talks.We train and validate our approach in three ST directions: English-German (En-De), English-French (En-Fr), and English-Spanish (En-Es).
CoVoST-2 X-En CoVoST-2 (Wang et al., 2021b) is a multilingual ST corpus derived from the Common Voice project, which offers translations from English to 15 languages and from 21 languages to English.We conducted experiments on X-En, including high-resource languages (≥ 10 hours of speech) such as French (Fr) and German (De), and low-resource languages (≤ 2 hours of speech), like Arabic (Ar), Swedish (Sv), Japanese (Ja), etc.Without the need for transcription, the evaluation focuses on the capability of our method to generalize to the low-resource unwritten multi-languages.Monolingual text corpus Monolingual targetlanguage text corpora are introduced for backtranslation.For MuST-C we include 48M German, 79M French and 64M Spanish sentences sampled from TED 1 (Duh, 2018), WMT 1 (Bojar et al., 2016) and CCMartix 1 (Schwenk et al., 2019) datasets for En-De/Fr/Es respectively.For CoVoST-2 X-En experiments, we introduce 1M extra English sentences sampled from the transcriptions of Common Voice Corpus 11.0 1 (Ardila et al., 2020).
All statistics of the datasets are in Appendix A.

Experimental setups
Pre-processing The model accepts 16-bit 16KHz mono-channel raw waveform speech and then discretizes them into units.We denote the discrete units clusters by the numbers (e.g.#1, #2), and combining with the target-language sentences, we learn the joint vocabulary via SentencePiece (Kudo and Richardson, 2018).We set the joint vocabulary size to 8000 for both MuST-C and CoVoST-2.
Model Configuration For MuST-C experiments, we use the HuBERT-base2 (pre-trained on Librispeech without fine-tuning) with a 500-cluster k-means quantizer based on the 9th layer representation as the discrete unit extractor.For CoVoST-2, we employ the mHuBERT3 with a 1000-cluster kmeans quantizer based on the 11th layer representations pre-trained on the VoxPopuli (Wang et al., 2021a)

Baseline models
We compare our method with the baselines as listed in Table 1 ∼ 3 (Appendix C for details).In particular, we explain the following baselines that do not involve transcriptions during training.Revisit ST (Zhang et al., 2022a) is a direct speechto-translation model with parameterized distance penalty (PDP) and CTC regularization.Its framework and training objectives are sorely different from ours.Unit-to-text Translation (U2TT) has the structure as described in Section 3.2 and is trained using only speech-translation supervision from the ST dataset from scratch, without applying DUB.As a

Method
De Fr Es Avg.

Main results on Speech-to-text Translation
MuST-C As shown in Table 1, compared to the methods that do not involve the transcribed text, our method, U2TT (LARGE) with DUB, gets the best ST results by introducing extra target-language text, and DUB obtains an average boost of 5.5 BLEU compared with U2TT in the three En-X directions.Encouragingly, we find that our method achieves comparable performance to previous models that utilize transcriptions through multi-task learning or pre-training.As for the baseline, U2TT outperforms the Transformer-ST, where we believe that the discrete units still retain the semantic information of the audio feature (e.g.log Mel-filter bank, abbr.Fbank) for translation.As for the gap between our method and the SoTA system, we argue that SpeechUT (Zhang et al., 2022c) performed various mask-predict pre-training tasks using extra 1.4k hours of speech and parallel MT data from WMT, which is not included in our approach.✓1M 7.1 8.9 6.9 7.9 0.5 2.1 7.0 5.7 5.8 Table 3: Test BLEU scores on CoVoST-2 low-resource X-En language pairs with less than 2 hours of speech.† : results from (Wang et al., 2021b).* : results from (Wu et al., 2022) ♢ : results from (Babu et al., 2021).♠ : results from (Li et al., 2021).The numbers in parentheses are their parameter sizes.Random sampling is the decoding strategy for DUB.
CoVoST-2 Our method performs similarly to MuST-C on the high-resource En-X (Table 2).Without considering auxiliary data or pre-training methods, adding only 1M additional English text, DUB improves by an average of 2.1 BLEU over 7 language pairs compared to U2TT, and by an average of 3.4 BLEU over the cascaded ST system.For the low-resource setting, our method can bring improvement on almost every language pair and achieve better performance than the large-scale multilingual speech or text pre-training models, like XLS-R+mBART-50 model (Babu et al., 2021), with much fewer parameters.The discrete unit extractor is unsupervised, so our method does not require transcriptions, which is particularly advantageous for unwritten ST.This experiment mimics such low-resource nature of unwritten languages in practice.The results also show that the U2TT model and the DUB training have the potential to translate low-resource unwritten languages.The key benefit of the DUB is to make use of a lot of monolingual text.Here, alternative techniques such as pseudo-labeling and pre-training (implemented as Cascaded BT and Bi-modal BART) are also evaluated on the MuST En-De translation, by introducing an equivalent corpus of 10 million German sentences.
• Cascaded BT aims to build a target-to-source MT-TTS pipeline to construct pseudo-speech translation augmented data for the training.Specifically, we use transcription-translation pairs of MuST-C to train a back-translation MT model and use the released FastSpeech25 (Ren et al., 2020) and a HiFi-GAN (Kong et al., 2020) vocoder for TTS generation.• Bi-modal BART has the same structure as U2TT, and is pre-trained by denoising large-scale corrupted discrete units and monolingual text, following the recipe of mBART (Liu et al., 2020).We combine the 10M additional text with 7M discrete units extracted from 10k hours of speech in GigaSpeech (Chen et al., 2021) to pre-train the model and fine-tune it based on MuST-C unittranslation pairs.See Appendix B for training details.
As shown in Table 4 introducing equivalent raw text, DUB is superior to the above two approaches and has a greater potential to exploit monolingual raw text.We find that the gain from cascaded BT-synthesized speech is limited because the synthetic speech is robotic and monotonic, making it easy to overfit the model to the synthetic pairs.Although the bi-modal BART pre-training can bring about 2 BLEU improvements, it is still inferior to DUB.We attribute this to the gap between the denoising pre-training task and the downstream generation tasks, while DUB does not have such a gap.Meanwhile, we observe that combination of bi-modal BART and DUB can bring further performance improvements, which indicates that they are complementary to each other.

5.2
The better the pseudo-unit, the more effective the DUB method?
In Section 3.3, we presented four generation methods to create synthetic pseudo-units based on the BT model, namely beam search, sampling, top-k sampling, and speech normalization.In the experiments, we set a beam size of 5 for the beam search, k=10 for top-k sampling, and use an offthe-shelf speech normalizer6 from Lee et al. (2022) for Speech Norm.Does the forward model gain more from synthesized pairs when the synthesized units are of higher quality?We calculate the Unit Error Rate (UER) on the MuST-C validation set to assess the synthesis quality.A lower UER indicates that the generated units are closer to the directly extracted units, i.e. of higher quality.We systematically vary the backtranslated data from 1M to 10M, and present the BLEU scores and UERs of the generation methods in Table 5 and Figure 2. The Speech Norm module produces the highest quality synthesized units, while the sampling-based methods have lower quality.Interestingly, the sampling method with the lowest synthesis quality has the most significant improvement over the forward model.
We conjecture that the richness and irregularity of the synthesized data can better improve the forward ST model, while regular pseudounits, e.g.generated by MAP-based beam search, are more predictable and not conducive to performance improvement.This is consistent with previous findings of BT techniques in machine translation (Edunov et al., 2018).In addition, Speech Norm, which normalizes speech to a single speaker, is not necessary for our DUB method.Although such an operation makes the ST model easier to learn and the UER smaller, it compromises the diversity of the synthetic data, which is also not helpful for performance improvement.The model generalization ability weakens when these single-speaker synthesis units increase.without extra text, the BLEU score is 20.9, which is slightly higher than HuBERT, but the gap is quite small.

Speech
6.1 Are discrete units suitable features for ST input?
We show how much semantic information is retained for different input forms by comparing the results of the downstream ST task (shown in Table 6).Training from scratch, we find that the U2TT model translates better than Transformer-ST (19.9 vs.18.0), indicating that compared to the speech feature, like Fbank, the discrete unit is a better choice for input and has no information loss in terms of the semantic information required for translation.We assume that this is strongly correlated with the HuBERT-based discrete unit extractor, since HuBERT is designed to learn a combined acoustic and language model over the continuous speech input, which preserves much textual information for the speech.But rigorously, compared to the continuous representation of HuBERT, the discretization procedure does suffer from semantic information loss.Comparing Line II and III, there is a gap of 2.9 BLEU between U2TT and HuBERT-Transformer (where frozen HuBERT Layer-9 continuous representation is taken as input to perform ST), in terms of ST metrics.Fortunately, the gap can be compensated by a) initializing the unit embedding as its corresponding K-means cluster centroid on continuous HuBERT representations as described in Section 3.2 (denoted as pretrained embedding, Line IV), which can slightly close the gap by 0.5 BLEU; and b) simply introducing only 1M additional text and applying DUB, which can achieve 2.7 BLEU improvement (Line V vs. IV).
No 6.2 Can we recover faithful speech from pseudo-units?
Do the back-translated units capture the semantics of the target language text?Since it is difficult to directly evaluate the correctness of the pseudo-units generated by the back-translation model, we concatenate a unit-based HiFi-GAN vocoder7 with our back-translation model to recover speech from the generated pseudo-units, thus completing the textto-speech (TTS) translation task.TTS generation quality is measured by ASR-BLEURT, where we transcribe the speech output using a high-quality open-source ASR model8 and calculate BLEURT9 with reference transcription.As shown in Table 7, the ASR-BLEURT of beam search and sampling is 0.6 and 0.47 respectively, indicating that the unit sequence back-translated from a given target language text can convey its general semantic meaning, which is the guarantee for the success of DUB.We conduct the listening test by checking 30 randomly sampled BT-recovered speeches for semantic consistency with ground-truth.22 of 30 sentences matched ground-truth speech, while the remaining 8 had minor issues, with only 1 being of low quality and the other 7 missing or repeating 1-2 details.We also provide some generated audio samples in Appendix F to help illustrate the degree of speech restoration.

Conclusion
In this paper, we propose Discrete Unit Backtranslation (DUB), as well as the Unit-to-Text Generation Method ASR-BLEURT beam search 0.60 sampling 0.47

Broader Impact
Our proposed model structure with a discrete unit extractor for speech and the unit-to-text translation model, which does not need any transcriptions during training, is particularly relevant for speech translation for more than 3,000 languages and dialects in the world that cannot be transcribed.Since these unwritten languages are typically lowresource, we emphasize that boosting ST performance via text-to-unit back-translation data augmentation, i.e.DUB, is very promising.Meanwhile, as a by-product of DUB, TTS translation has significant implications for assisting visually impaired or dyslexic people in understanding the world as well as preserving low-resource unwritten spoken languages.However, as exploratory work, we focus on investigating the potential of using BT to enhance ST performance, while popular large-scale pretraining methods are not employed in this paper.This makes our method slightly inferior to these methods in terms of performance, perhaps.But promisingly, in terms of structure, the model is more general across various modalities and also has more potential to integrate with the methods in NLP area (might be the topic of future research).Also, the models are still far from real industrial applications.For example, the data used for training is much smaller than the scale in reality, while the real speech is noisier and more complex than the open-source dataset, which may require front-end processing.Moreover, the success of our method is partly attributable to the HuBERT representation, which contains certain textual information for the speech, and via experiments, we also find that the quality of discrete units influences the translation performance.Nevertheless, learning meaningful discrete units is not the primary goal of HuBERT pre-training, and how to learn discrete units or representations for speech with more contextual semantic information can be explored in the future.Training details We use Adam optimizer with β 1 = 0.9, β 2 = 0.98, and 4k warm-up updates to optimize the parameters in our model.We train the model with a batch size of 5k tokens.The learning rate is 7e-4 and we apply an inverse square root schedule.The value of label smoothing is set to 0.1.The up-sampling rate r in DUB is set to 32, given the huge volume differences between the BT data and the original data.For MuST-C experiments, we train U2TT and T2UT models of each translation direction under bilingual settings.For CoVoST-2 X-En experiments, we train a multi-lingual X-En model covering 21 translation directions, distinguished by the language tags of the units in different languages.We implement our models based on Fairseq11 (Ott et al., 2019) codebase.All models are trained on 8 Nvidia Tesla-V100 GPUs and take about 400k steps to converge.During inference, We save the checkpoint with the best BLEU on the validation set and average the last 10 checkpoints.We use beam search with a beam size of 5 for each translation direction.
Training details for Bi-modal BART The training of bi-modal BART follows the recipe of mBART (Liu et al., 2020).We implemented a mask rate of 0.3, with the replacement of the masked tokens by random tokens at a probability of 0.1.Additionally, the mask length was determined through sampling from a Poisson distribution, with a lambda parameter of 3.5.

D Scalability
How does model size affect the results of our method?How much improvement does the raw text in the target language bring to our method?To this end, we take MuST-C English-German translation as an example.We set the model size to 73M, 176M and 260M parameters respectively (the specific hyperparameter settings are shown in Table 11), and introduce extra 1M, 10M, and 48M German sentences.
Figure 3 shows the BLEU scores of different sizes of models, with different amounts of monolingual back-translation data added.In general, regardless of the model size, introducing more text brings better performance.When we introduce a large amount of back-translated data, the larger model gets significantly better performance.We find that when no or less back-translated data is introduced, the performance of the large model is instead not optimal.This is because the large model is prone to overfitting when the original training data is small, but as the monolingual data is gradually introduced, the advantage of the large model becomes obvious, without replying to the transcription, introducing 48M back-translated pairs, the model with 260M parameters can boost up to 6.1 BLEU on En-De.

E Comparison With Cascaded System
It could be argued that our model employs a cascaded architecture, comprising a unit extractor and a unit-to-text translation model.The traditional cascade ST system (ASR+MT) can also be enhanced through applying back-translation to improve its MT model.
In Table 12, we compare the performance of DUB with the BT-enhanced cascaded ST system both utilizing 10M unpaired text.By comparison, we can find that the BLEU score of the U2TT model is inferior to that of the cascaded system when utilizing 10 million unpaired text samples.This discrepancy can likely be attributed to the higher baseline performance of the cascaded system.Additionally, DUB demonstrates a superior relative improvement in BLEU score compared to the cascaded system.Moreover, the discrete unit extractor is obtained through unsupervised training on unlabeled speech, which requires no transcriptions compared with the ASR system trained on speech-transcription pairs.

F Cases of Text-to-Speech Translation
In Table 13, we show two cases of German-English text-to-speech translation on MuST-C En-DE TST-COM set.In CASE 1, our text-to-speech translation system generates speech with the same content and a similar spectrogram as reference speech.In CASE 2, the synthetic speech deviated slightly from the reference speech, but the translation is correct -"release" has the same meaning as "shoveling out" and "all the time " just means "all along".The samples of generated audio are included in https://anonymous.4open.science/r/DUB/ttss_samples.

Figure 1 :
Figure 1: Left: The model structure of our approach.The offline discrete unit extractor converts speech into discrete units.The unit-to-text translation (U2TT) model translates the discrete units into the translation and the text-to-unit (U2TT) model does the opposite.Right: An illustration of the discrete unit back-translation (DUB) training procedure.
Unit Back-Translation (DUB) Training Steps Given ST parallel dataset D s,y = {(s, y)}, extra target-language corpus D ′ y = {y ′ }, and the discrete unit extractor E. As shown in Figure1(b), the DUB training steps are as follows.1. Extract unit for each speech input u = E(s), and get unit-translation pairs D u,y = {(u, y)}; 2. Train the T2UT model based on D u,y with crossentropy loss as in Eq.
the original training data by a rate of r and train U2TT model based on D u,y ∪ D ′ u,y

Figure 3 :
Figure 3: MuST-C En-De tst-COM BLEU scores for different amounts of target-language text data are introduced, under different model sizes.
; Tang † Work was done while Dong Zhang was a research intern at ByteDance.
speech in English, Spanish, and French.

Table 1 :
SoTA: use much more speech and various pre-training tasks SpeechUT (Zhang et al., 2022c) * 30.1 41.4 33.6 35.0 BLEU Scores on MuST-C En-X tst-COM set.* is the state-of-the-art system, which designed various mask-predict pre-training tasks and trained using extra 1.4k hours of speech and parallel MT data from WMT. Random sampling is the decoding strategy for DUB.
(Dong et al., 2018)n with this model helps to see the influence of DUB.Transformer-ST stands for training the Speech-Transformer(Dong et al., 2018)from scratch, but without ASR pre-training as in the previous work

Table 2 :
(Wang et al., 2021b)CoVoST-2 X-En language pairs with more than 10 hours of speech.Auxiliary data refers to all data at training excluding <speech,translation> pairs.†:Results from(Wang et al., 2021b).Random sampling is the decoding strategy for DUB.

Table 5 :
The quality of generated pseudo-units using different generation methods and their BLEU increases from 10M extra texts, evaluated by Unit Error Rate (UER) on MuST-C En-De Dev, the smaller the better.

Table 11 :
Hyper-parameter settings for the models in Figure3.

Table 12 :
MuST-C En-De tst-COM BLEU for U2TT and cascaded system.MT-BT refers to enhancing MT model of cascaded ST through back-translation.Cascaded ST is trained by ourselves on MuST-C En-De train set.The same extra text corpus with 10 million German sentences is used for both methods.