SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

End-to-end Speech Translation is hindered by a lack of available data resources. While most of them are based on documents, a sentence-level version is available, which is however single and static, potentially impeding the usefulness of the data. We propose a new data augmentation strategy, SegAugment, to address this issue by generating multiple alternative sentence-level versions of a dataset. Our method utilizes an Audio Segmentation system, which re-segments the speech of each document with different length constraints, after which we obtain the target text via alignment methods. Experiments demonstrate consistent gains across eight language pairs in MuST-C, with an average increase of 2.5 BLEU points, and up to 5 BLEU for low-resource scenarios in mTEDx. Furthermore, when combined with a strong system, SegAugment establishes new state-of-the-art results in MuST-C. Finally, we show that the proposed method can also successfully augment sentence-level datasets, and that it enables Speech Translation models to close the gap between the manual and automatic segmentation at inference time.


Introduction
The conventional approach for Speech Translation (ST) involves cascading two separate systems: an Automatic Speech Recognition (ASR) model followed by a Machine Translation (MT) model.However, recent advances in deep learning (Vaswani et al., 2017), coupled with an increased availability of ST corpora (Di Gangi et al., 2019a;Wang et al., 2020a) have enabled the use of end-to-end models (Weiss et al., 2017).Although end-to-end models can address several shortcomings of cascaded models, such as slow inference times, error propagation, and information loss, they are limited by a data bottleneck (Sperber and Paulik, 2020).This bottleneck arises from the inability of end-toend models to directly leverage data from the more resourceful tasks of ASR and MT, which restricts them from consistently matching the performance of the cascaded models (Bentivogli et al., 2021;Anastasopoulos et al., 2022Anastasopoulos et al., , 2021)).
The majority of ST corpora are based on document-level speech data, such as MuST-C (Di Gangi et al., 2019a) and mTEDx (Salesky et al., 2021), which are derived from TED talks, with duration times of 10 to 20 minutes.These documentbased data are processed into shorter, sentencelevel examples through a process called manual segmentation, which relies on grammatical features in the text.Still, this sentence-level version is single and static and is potentially limiting the utility of the already scarce ST datasets.
To address this limitation, we propose SEGAUG-MENT, a segmentation-based data augmentation method that generates multiple alternative sentencelevel versions of document-level speech data (Fig 1).SEGAUGMENT employs SHAS (Tsiamas et al., 2022b), an Audio Segmentation method that we tune to yield different re-segmentations of a speech document based on duration constraints.For each new segmentation of a document, the corresponding transcript is retrieved via CTC-based forced alignment (Kürzinger et al., 2020), and the target text is obtained with an MT model.
Our contributions are as follows: • We present SEGAUGMENT, a novel data augmentation method for Speech Translation.• We demonstrate its effectiveness across eight language pairs in MuST-C, with average gains of 2.5 BLEU points and on three low-resource pairs in mTEDx, with gains up to 5 BLEU.• When utilized with a strong baseline that combines WAV2VEC 2.0 (Baevski et al., 2020) and MBART50 (Tang et al., 2020), it obtains state-of-the-art results in MuST-C.• We also show its applicability to data not based on documents, providing an increase of 1.9 BLEU in CoVoST2 (Wang et al., 2021).• SEGAUGMENT also enables ST models to close the gap between the manual and automatic test set segmentations at inference time.• Finally, along with our code, we open source all the synthetic data that were created with the proposed method1 .

Relevant Research
SpecAugment (Park et al., 2019), which directly modifies the speech features by wrapping or masking them, is a standard approach for data augmentation in speech tasks including ST (Bahar et al., 2019;Di Gangi et al., 2019b).WavAugment (Kharitonov et al., 2021) is a similar technique that modifies the speech wave by introducing effects such as pitch, tempo and echo (Gállego et al., 2021).Instead of altering the speech input, our proposed method generates more synthetic data by altering their points of segmentation, and thus is complimentary to techniques such as SpecAugment.
An effective way to address data scarcity in ST is by generating synthetic data from external sources.This can be achieved by using an MT model to translate the transcript of an ASR dataset or a Textto-Speech (TTS) model to generate speech for the source text of an MT dataset (Jia et al., 2019;Pino et al., 2019;McCarthy et al., 2020).In contrast, SEGAUGMENT generates synthetic data internally, without relying on external datasets.
Previous research has established the benefits of generating synthetic examples by cropping or merging the original ones, with sub-sequence sampling for ASR (Nguyen et al., 2020), and concatenation for MT (Nguyen et al., 2021;Wu et al., 2021;Kondo et al., 2021), as well as for ASR and ST (Lam et al., 2022a).Our approach, however, segments documents at arbitrary points, thus providing access to a greater number of synthetic examples.An alternative approach by Lam et al. (2022b) involves recombining training data in a linguisticallymotivated way, by sampling pivot tokens, retrieving possible continuations from a suffix memory, combining them to obtain new speech-transcription pairs, and finally using an MT model to generate the translations.Our method is similar since it also leverages audio alignments and MT, but instead of mixing speech, it segments at alternative points.
Context-aware ST models have been shown to be robust towards error-prone automatic segmentations of the test set at inference time (Zhang et al., 2021a).Our method bears similarities with Gaido et al. (2020b); Papi et al. (2021) in that it re-segments the train set to create synthetic data.However, unlike their approach, where they split at random words in the transcript, we use a specialized Audio Segmentation method (Tsiamas et al., 2022b) to directly split the audio into segments resembling proper sentences.Furthermore, instead of using word alignment algorithms to get the target text (Dyer et al., 2013), we learn the alignment with an MT model.We thus create high-quality data that can be generally useful, and not only for errorprone test set segmentations.Finally, recent work has demonstrated that context-aware ST models evaluated on fixed-length automatic segmentations can be competitive compared to the manual segmentation (Amrhein and Haddow, 2022).Here, we find that utilizing data from SEGAUGMENT yields high translation quality for ST models evaluated on automatic segmentations, even surpassing the translation quality of the manual segmentation, and without explicitly making them context-aware.

ST Corpora and Manual Segmentation
A document-level speech translation corpus D (Di Gangi et al., 2019a;Salesky et al., 2021) is comprised of n triplets that represent the speech wave X, the transcription Z, and the translation Y of each document. (1) In order for the data to be useful for traditional sentence-level ST, the document-level corpus D is processed to a sentence-level corpus S, with m = n i m i examples.
Where x i = (x i,1 , ..., x i,m i ) are the sentencelevel speech waves for the i-th document, z i = (z i,1 , ..., z i,m i ) are the sentence-level transcriptions, and y i = (y i,1 , ..., y i,m i ) are the sentencelevel translations.Usually, S is obtained by a process of manual segmentation (Fig. 2), where the document-level transcription and translation are split on strong punctuation, and then aligned with cross-lingual sentence alignment (Braune and Fraser, 2010).Finally, the corresponding sentencelevel speech waves are obtain by audio-text alignment (McAuliffe et al., 2017).Since speech is continuous, instead of defining the sentence-level speech wave x i,j , it is common to define the start and end points s i,j , e i,j ∈ R that correspond to the document speech X i .Thus, S can be re-defined as: Where b i = (b i,1 , ..., b i,m i ) is the segmentation for the i-th speech wave, and b i,j = (s i,j , e i,j ) is a tuple of the segment boundaries for the j-th segment, for which x i,j = X i s i,j :e i,j .Note that oftentimes there is a gap between consecutive segments (e i,j , s i,j+1 ) due to silent periods.

Audio Segmentation and SHAS
In end-to-end ST, Audio Segmentation methods aim to find a segmentation b ′ for a speech document X ′ , without making use of its transcription.They are crucial in real world scenarios, when a test set segmentation is not available, and an automatic one has to be inferred, as simulated by recent IWSLT evaluations (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022)).They usually rely on acoustic features (pause-based), length criteria (length-based), or a combination of both (hybrid).One such hybrid approach is SHAS (Tsiamas et al., 2022b,a), which uses a supervised classification model C and a hybrid segmentation algorithm A. The classifier C is a Transformer encoder (Vaswani et al., 2017) with a frozen WAV2VEC 2.0 (Baevski et al., 2020;Babu et al., 2021) as a backbone.It is trained on the speech documents and segment boundaries of a manually-segmented speech corpus, S SGM = {(X i , b i )} n i=1 , by predicting whether an audio frame belongs to any of the manuallysegmented examples.At inference time, a sequence of binary probabilities p ′ is obtained by applying the classifier C on X ′ (eq.4).Following, parameterized with thr to control the classification threshold, and ℓ = (min, max) to control the length of the resulting segments, A produces the automatic segmentation b ′ according to p ′ (eq.5).There are two possible choices for A. The Divide-and-Conquer (PDAC) approach progressively splits the audio at the point κ of lowest probability p ′ κ > thr, until all resulting segments are within ℓ (Tsiamas et al., 2022b;Potapczyk and Przybysz, 2020).Alternatively, the Streaming (PSTRM) approach takes streams of length max and splits them at the point κ with p ′ κ > thr between ℓ or uses the whole stream if no such point exists (Tsiamas et al., 2022b;Gaido et al., 2021).
4 Proposed Methodology The proposed data augmentation method SEGAUG-MENT (Fig. 3) aims to increase the utility of the training data S, by generating synthetic sentencelevel corpora Ŝℓ , which are based on alternative segmentations of the speech documents in D (eq. 1).Whereas the manual segmentation (Fig. 2) relies on grammatical features in the text, here we propose to split on acoustic features present in the audio, by utilizing SHAS ( §3.2).For the i-th speech document X i , SEGAUGMENT creates alternative segmentation boundaries bi with SHAS ( §4.1), obtains the corresponding transcriptions ẑi via CTC-based forced alignment ( §4.2), and finally, generates the translations ŷi with an MT model ( §4.3).By repeating this process with different parameterizations ℓ of the segmentation algorithm A, multiple synthetic sentence-level speech corpora can be generated ( §4.4).A synthetic speech corpus Ŝℓ with k = n i k i examples can be defined as: Where X i is the original speech for the i-th document (eq.1), bi = ( bi,1 , ..., bi,k i ) are its alternative segmentations, ẑi = (ẑ i,1 , ..., ẑi,k i ) are its sentencelevel transcriptions, and ŷi = (ŷ i,1 , ..., ŷi,k i ) are its synthetic sentence-level translations.
In total, three different models are utilized for creating a synthetic corpus, a classifier C ( §3.2) for segmentation, a CTC encoder E (Graves et al., 2006) for forced alignment, and an MT model M for text alignment.We can use pre-trained models, or optionally learn them from the manually segmented examples of S (Fig. 4).The classifier C can be learned from i=1 , and the model M from S MT = {(z i , y i )} n i=1 .Next, we describe in detail the proposed method.

Segmentation
We follow the process described for SHAS ( §3.2) and obtain the alternative segmentations bi for each X i in the training corpus, by doing inference with C and applying the segmentation algorithm A. In contrast to its original use, we use arbitrary values for min to have more control over the length ranges of the segments and prioritize the classification threshold requirements thr over the segment length requirements ℓ in the constraint optimization procedure of A, to ensure good data quality.

Audio Alignment
To create the transcriptions ẑi of the i-th document for the segments bi ( §4.1), we are using CTC-based forced alignment (Kürzinger et al., 2020).We first do inference on the sentence-level speech of the manual segmentation x i = (x i,1 , ..., x i,n i ) with a CTC encoder E, thus obtaining character-level probabilities u i for the whole audio.We apply a text cleaning process on the transcriptions z i , which includes spelling-out numbers, removing unvoiced text such as events and speaker names, removing all remaining characters that are not included in vocabulary, and finally upper casing.The forced alignment algorithm uses the probabilities u i and the cleaned text z i to find the character segmentation along with the starting and ending timestamps of each entry.Following, we merge characters to words using the special token for the word boundaries, and reverse the cleaning step2 to recover the original text that corresponds to each segment.For each example j, we obtain the source text ẑi,j by joining the corresponding words that are within the segment boundary bi,j , and apply a post-editing step to fix the casing and punctuation (Alg.1).

Text Alignment
Unlike the case of manual segmentation (Fig. 2), cross-lingual sentence alignment (Braune and Fraser, 2010) is not applicable, and additionally, word alignment tools (Dyer et al., 2013) yielded sub-optimal results.Thus, we learn the alignment with an MT model M, which is trained on the manually segmented sentence-level data The training data is modified by concatenating examples to reflect the length of the examples that will be translated, thus learning model M ℓ from S MT ℓ , where ℓ are the length parameters used in SHAS.To accurately learn the training set alignment, we use very little regularization, practically overfitting the training data S MT ℓ ( §A.7).Since there are no sentence-level references available for the synthetic data, we monitor the document-level BLEU (Papineni et al., 2002) in a small number of training documents, and only end the training when it stops increasing.Finally, we obtain the synthetic sentence-level translations ŷi with the trained M ℓ .

Multiple Sentence-level Versions
The parameters ℓ = (min, max) of the segmentation algorithm A allow us to have fine-grained control over the length of the produced segments.Different, non-overlapping tuples of ℓ result in different segmentations, providing access to multiple synthetic sentence-level versions of each document.Moreover, the additional cost of creating more than one synthetic corpus is relatively low, as the results of the classification with C and the forced alignment can be cached and reused (Fig. 3).
For Speech Translation we train Speech-to-Text Transformer baselines (Wang et al., 2020b).Unless stated otherwise, we use the small architecture (s2t_transformer_s) with 12 encoder layers and 6 decoder layers, and dimensionality of 256, with ASR pre-training using only the original data.The full details of the models and the training proce- dures are available in §A.1.1.For inference, we average the 10 best checkpoints on the validation set and generate with a beam of 5. We evaluate with BLEU 5 (Papineni et al., 2002) and chrF2 6 (Popović, 2017) using SACREBLEU (Post, 2018), and perform statistical significance testing using paired bootstrap resampling (Koehn, 2004) to ensure valid comparisons.To evaluate on an automatic segmentation ( §6.5), the hypotheses are aligned to the references of the manual segmentation with MW-ERSEGMENTER (Matusov et al., 2005).

Main Results in MuST-C
We compare ST models trained with and without SEGAUGMENT on the eight language pairs of MuST-C v1.0, and include results from Wang et al. (2020b), which use the same model architecture.In Table 2, we observe that models leveraging SEGAUGMENT achieve significant and consistent improvements in all language pairs, thus confirming that the proposed method allows us to better utilize the available ST data.More specifically, the improvements range from 1.5 to 3.1 BLEU, with an average gain of 2.4 points.We also investigate the application of SEGAUGMENT during the ASR pre-training, which brings further gains in four language pairs, but the average improvement is only marginal over just using the original ASR data.

Results with SOTA methods
Here we study the impact of the proposed method, when combined with a strong ST model.We use a model with 24 encoder layers, 12 decoder layers, and dimensionality of 1024, where its encoder is initialized from WAV2VEC 2.0 (Baevski et al., 2020) and its decoder from MBART50 (Tang et al., 2020) ( §A.1.2).We fine-tune this model end-to-end

Low-Resource Scenarios
We explore the application of SEGAUGMENT in low-resource and non-English speech settings of mTEDx.In Table 4, we present results of the baseline with and without SEGAUGMENT for Es-En, Pt-En and the extremely low-resource pair of Es-Fr (6 hours).We furthermore provide the BLEU scores from Salesky et al. (2021), which use the extrasmall model configuration (10M parameters).We use the extra-small configuration for Es-Fr, while the others use the small one (31M parameters).SEGAUGMENT provides significant improvements in all pairs, with even larger ones when it is also utilized during the ASR pre-training, improving the BLEU scores by 2.6-5.Our results here concerning ASR pre-training are more conclusive than in MuST-C ( §6.1), possibly due to the better ASR models obtained with SEGAUGMENT ( §A.3).

Application on Sentence-level Data
We consider the application of the method to CoV-oST, of which the data do not originate from documents.We treat the sentence-level data as "documents" and apply SEGAUGMENT as before.Due to the relatively short duration of the examples, we only apply SEGAUGMENT with short and medium configurations.In Table 5 we provide our results for En-De, with and without SEGAUGMENT, a bilingual baseline from Wang et al. (2020a), and the recently proposed Sample-Translate-Recombine (STR) augmentation method (Lam et al., 2022b), which uses the same model architecture as our experiments.Although designed for document-level data, SEGAUGMENT brings significant improvements to the baseline8 , even outperforming the STR augmentation method by 0.5 BLEU points.

Automatic Segmentations of the test set
Unlike most research settings, in real-world scenarios, the manual segmentation is not typically available, and ST models must rely on automatic segmentation methods.However, evaluating on automatic segmentation is considered sub-optimal, decreasing BLEU scores by 5-10% (Tsiamas et al., 2022b;Gaido et al., 2021) as compared to evaluating on the manual (gold) segmentation.
Since the synthetic data of SEGAUGMENT originate from an automatic segmentation, we expect they would be useful in bridging the traininginference segmentation mismatch (Papi et al., 2021).We evaluate our baselines with and without SEGAUGMENT on MuST-C tst-COMMON, on both the manual segmentation provided with the dataset, and an automatic one which is obtained by SHAS.In Table 6 we present results with SHASlong, which we found to be the best.Extended results can be found in §A.4.For the purpose of this experiment, we also train another ST model with SEGAUGMENT, where we prepend a special token in each translation, indicating the dataset origin of the example9 .When generating with such a model, we prompt it with the special token that corresponds to the segmentation of the test set.The results of Table 6 show that the baseline experiences a drop of 1.6 BLEU points (or 6%) on average, when evaluated on the automatic segmentation, confirming previous research (Tsiamas et al., 2022b).Applying SEGAUGMENT, validates our hypothesis, since the average increase of 3.5 BLEU (23.4 → 26.9) observed in the automatic segmentation is larger than the increase of 2.4 BLEU in the manual one (25.0→ 27.4).Finally, using SEGAUGMENT with special tokens, enables ST models to reach an average score of 27.3 BLEU points, closing the gap with the manual segmentation (27.4), while being better10 in three language pairs.To the best of our knowledge, this is the first time11 that ST models can match (or surpass) the performance of the manual segmentation, demonstrating the usefulness of the proposed method in real-world scenarios.Our results also raise an interesting question, on whether we should continue to consider the manual segmentation as an upper bound of performance for our ST models.The second column in Manual and SHAS-long is the best score among the three models, and with bold is the best score overall for each language pair.

ST without ASR pre-training
Following, we investigate the importance of the ASR pre-training phase, a standard practice (Wang et al., 2020b), which usually is also costly.In Table 7 we present the results of ST models on MuST-C En-Es and En-Fr trained with and without SEGAUGMENT, when skipping the ASR pretraining.We also include the results of the Revisit-ST system proposed by Zhang et al. (2022a).We find that models with SEGAUGMENT are competitive even without ASR pre-training, surpassing both the baseline with pre-training and the Revisit-ST system.In general, ASR pre-training could be skipped in favor of using SEGAUGMENT, but including both is the best choice.

Training Costs
In this section we discuss the computational costs involved with SEGAUGMENT during ST model training.We analyze the performance of models with and without SEGAUGMENT from Table 2, at different training steps in MuST-C En-De dev.In Figure 5, we observe that models with our proposed method, not only converge to a better BLEU score, but also consistently surpass the baseline during training.Thus, although utilizing the synthetic data from SEGAUGMENT would naturally result in longer training times, it is still better even when we constraint the available resources.

Analysis
Here we discuss four potential reasons behind the effectiveness of SEGAUGMENT.

Contextual diversity.
The synthetic examples are based on alternative segmentations and are thus presented within different contextual windows, as compared to the original ones ( §A.8).We speculate that this aids the model to generalize more, since phrases and sub-words are seen with less or more context, that might or not be essential for their translation.Adding additional context that is irrelevant was previously found to be beneficial in low-resource MT, by providing negative context to the attention layers (Nguyen et al., 2021).
Positional diversity.With SEGAUGMENT, speech and text units are presented at many more different absolute positions in the speech or target text sequences ( §A.11).This is important due to the absolute positional embeddings in the Transformer, which are prone to overfitting (Sinha et al., 2022).
We hypothesize that the synthetic data create a diversification effect on the position of each unit, which can be seen as a form of regularization, especially relevant for rare units.This is also supported for the simpler case of example concatenation (Nguyen et al., 2021), while in our case the diversification effect is richer due to the arbitrary document segmentation.Length Specialization.Synthetic datasets created by SEGAUGMENT supply an abundance of examples of extremely long and short lengths, which are relatively infrequent in the original data.This creates a specialization effect enabling ST models trained on the synthetic data to better translate sequences of extreme lengths in the test set ( §A.9).Knowledge Distillation.As translations of the synthetic data are generated by MT models, there is an effect similar to that of Knowledge Distillation (KD) (Liu et al., 2019;Gaido et al., 2020a).
To quantify this effect, we re-translate the train set of MuST-C En-De four times, with the same MT models employed in SEGAUGMENT.Subsequently, an ST model is trained with the original and re-translated data, referred to as in-data-KD, as the MT models did not leverage any external knowledge.In Table 8 we compare in-data-KD with SEGAUGMENT, and find that although indata-KD provides an increase over the baseline, it exhibits a significant difference of 1 BLEU point with SEGAUGMENT.Our findings confirm the existence of the KD effect, but suggest that SEGAUG-MENT is more general as it not only formulates different targets, but also diverse inputs (through re-segmentation), thereby amplifying the positive effects of the source-side contextual and positional diversity.In contrast, KD only provides diversity on the target-side.

Conclusions
We introduced SEGAUGMENT, a novel data augmentation method that generates synthetic data based on alternative audio segmentations.Through extensive experimentation across multiple datasets, language pairs, and data conditions, we demonstrated the effectiveness of our method, in consistently improving translation quality by 1.5 to 5 BLEU points, and reaching state-of-the-art results when utilized by a strong ST model.Our method was also able to completely close the gap between the automatic and manual segmentations of the test set.Finally, we analyzed the reasons that contribute to our method's improved performance.Future work will investigate the extension of our method to ST for spoken-only languages and Speech-to-Speech translation, by passing the transcription stage.

Limitations
General Applicability.The proposed method requires three steps: Audio Segmentation, CTCbased forced-alignment, and Machine Translation.
For Audio Segmentation we used SHAS (Tsiamas et al., 2022b), which requires a classifier that is trained on manually segmented data.Although we demonstrated the method's applicability in CoV-oST En-De, which does include a manual segmentation, we used a English classifier that was trained on MuST-C En-De.Therefore, we cannot be certain of the method's effectiveness without manually segmented data for the source language.A possible alternative would be to use a classifier trained on a different source language, since Tsiamas et al. (2022b) showed that SHAS has very high zero-shot capabilities, provided the zero-shot language was also included in the pre-training set of XLS-R (Babu et al., 2021), which serves as a backbone to the classifier.Additionally, we tested our method on several languages pairs, and also on an extremely low-resource one, such as Spanish-French (Es-Fr) in mTEDx, with only 4,000 training examples.Although we showed improvement of 50% in that particular language pair, the two languages involved, Spanish and French, are not by any means considered low-resource.Thus, we cannot be sure about the applicability of the method in truly extremely low-resource languages, such as many African and Native American languages.Furthermore, the current version of the method would not support non-written languages, since the target text is obtained by training a MT model which translated the transcription of each audio segment.
Biases.The synthetic data is heavily based on the original data, which may result in inheriting any biases present in the original data.We did not observe any signs of this effect during our research, but we did not conduct a proper investigation to assess the degree at which the synthetic data are biased in any way.
Computational and Memory costs.The synthetic data have to be created offline, with a pipeline that involves three different models, resulting in increased computational costs.To reduce these costs, we used pre-trained models for the Audio Segmentation and the CTC encoders, and cached the inference results to be re-used.Thus, the computational cost of creating the synthetic datasets for a given a language pair involves a single inference with the classifier and the CTC encoder, and multiple training/inference phases with MT models.This process can take around 24-36 hours to create four new synthetic datasets for pair in MuST-C, using a single GPU.We acknowledge the computational costs but believe the results justify them.The process could be made much lighter by using nonparametric algorithms in the three steps instead of supervised models, which can be investigated in future work.Finally, despite the computational costs, there is a very small memory cost involved since each synthetic dataset is basically a txt file containing the new target text and a yaml file containing the new segmentation, only requiring 100-200MB of storage.for 10 epochs, and average the 10 best checkpoints according to validation BLEU.
The encoders of the ST models are pre-trained on the task of ASR using the same architecture.The only difference is the vocabulary size which is 5,000 and the learning rate which is 0.001.For the models trained without ASR pre-training ( §7) we also used a learning rate of 0.001.For MuST-C, we pre-train using the ASR data of En-De, for mTEDx we use the Es-Es and Pt-Pt accordingly, and finally for CoVoST the ASR data of En-De.
For mTEDx Es-Fr we use an extra-small architecture (s2t_transformer_xs).It has 6 encoder layers and 3 decoder layers, a dimensionality of 256, a feed-forward dimension of 1024, 4 heads in the multi-head attention, and a vocabulary size of 3,000, having a total of 10M parameters.The learning rate is set to 0.001 (warm-up of 500), the batch size to 180 thousand tokens, and use a dropout of 0.2.We also share the weights of the embedding layer and output projection in the decoder.The same model is pre-trained on ASR, but with a dropout of 0.1.All other hyperparameters are the same as for the small models described before.
All our experiments were run on a cluster with 8 NVIDIA GeForce rtx 2080 ti.The running times of each experiment on a single GPU ranged from 12 to 36 hours.

A.1.2 w2v-mBART models
For the experiments of §6.2, we use a strong baseline utilizing pre-trained models and a length adaptor (Tsiamas et al., 2022a).The encoder is composed of a 7-layer convolutional feature extractor and 24-layer Transformer encoder, while the decoder has 12 layers, and a vocabulary of size 250k, with 770M parameters in total.All the layers have an embedding dimensionality of 1024, a feedforward dimensionality of 4098, GELU activations (Hendrycks and Gimpel, 2016), 16 attention heads, and pre-layer normalization configuration (Xiong et al., 2020b).A strided 1d convolutional layer sub-samples the output of the encoder by 2 times.The encoder is initialized from WAV2VEC 2.012 (Baevski et al., 2020), which is pretrained with 60k hours of non-transcribed speech from Libri-Light (Kahn et al., 2020), and fine-tuned for ASR with 960 hours of labeled data from Librispeech (Panayotov et al., 2015).The decoder is initialized from MBART5013 (Tang et al., 2020), which is fine-tuned En-Xx multilingual machine translation.We fine-tune all the parameters of the model, apart from the feature extractor of the encoder and the embedding layer in the decoder.The inputs to the model are raw waveforms sampled at 16kHz, which are normalized to zero mean and unit variance.We train with AdamW using a base learning rate of 0.0005, with a warm-up for 2,000 steps and an inverse square root scheduler.In the encoder we use 0.1 activation dropout, time masking with probability of 0.2 and channel masking with probability of 0.1 (Baevski et al., 2020).In the decoder we use a dropout of 0.3, and attention dropout of 0.1 (Tang et al., 2020).All other dropouts are not active.The loss function is a standard cross entropy with label smoothing of 0.2.We use gradient accumulation to have an effective batch size of 32M tokens, evaluate every 250 steps, and stop the training when the performance on the validation set does not improve for 20 evaluations.We average the 10 best checkpoints according to the validation BLEU, and generate with a beam search of 5.

A.1.3 Machine Translation models
For the MT models used for text alignment in SEGAUGMENT ( §4.3), we used medium-sized Transformers, with 6 encoder and decoder layers, dimensionality of 512, feed-forward dimension of 2048, and 8 heads in the multi-head attention.We train with AdamW using a learning rate of 0.002, with a warm-up for the first 2,500 updates, and the batch size is set to 14 thousand tokens.The vocabulary size is 8,000, and for regularization, we use a small dropout of 0.1 only at the outputs of each layer and label smoothing of 0.1.We stop training when the document-level BLEU does not increase for 20 epochs, average the 10 best checkpoints according to the same metric, and then generate with a beam of size 8.
For the MT models used in the experiments of §A.6, we use a small architecture with 6 encoder and decoder layers, with dimensionality of 256, feed-forward dimension of 1024, and 4 heads.We share the weights of the embedding and output projection in the decoder and use a dropout of 0.1 (applied to attention, activation, and output).We stop training when the validation loss does not decrease for 10 epochs, average the 10 best checkpoints according to validation BLEU, and generate with a beam of size 5.

A.8 Contextual Windows of SEGAUGMENT Data
We can categorize the new context of the synthetic data of SEGAUGMENT in four types, depending on the type of overlap between the segmentation boundaries b for each document: (1) isolated, when being a subset of an original one, (2) expanded, when being a superset of an original one, (3): mixed, when overlapping with an original one, (4): equal, when being in exactly the same

A.11 Sub-word Positional Frequency
In Figure 7 we present examples of semi-rare subwords in the target languages in MuST-C v2.0 En-De training set.We are counting their frequency in terms of absolute position in the examples they appear in.We can observe that when using SEGAUG-MENT, the positional frequency of this sub-words is much more diverse, covering more space in the possible positions in the target sequence.We hy-pothesize that this effect could be regularizing the model, aiding in each generalization of this subwords in different positions ( §6.8).

Figure 2 :
Figure 2: Manual Segmentation of an example i from a document-level corpus D into m i examples of a sentence-level corpus S.

Figure 3 :
Figure 3: The SEGAUGMENT methodology.Given the ith document-level example of D, with m i sentence-level examples (S), it creates k i synthetic sentence-level examples by alternative segmentations with SHAS.Changing the segmentation parameters ℓ results in several different synthetic corpora Ŝℓ .

Table 2 :
Wang et al. (2020b)scores on MuST-C v1.0 tst-COMMON.In bold is the best score.All results with SEGAUGMENT and ASR SEGAUGMENT are statistically different from the Baseline with p < 0.001.All models use the same architecture and results of Fairseq ST are fromWang et al. (2020b).

Table 3 :
BLEU(↑) / chrF2(↑) scores on MuST-C v1.0 tst-COMMON.In bold is the best score.Results with SEGAUGMENT are statistically different from the w2v-mBART baseline model with p < 0.001, apart from the BLEU score for En-Es (33.7) which is with p < 0.005.

Table 5 :
BLEU(↑) / chrF2(↑) scores on CoVoST2 test set.In bold is the best score.All results with SEGAUGMENT and ASR SEGAUGMENT are statistically different from the Baseline with p < 0.001.

Table 7 :
Zhang et al. (2022a)cores on MuST-C tst-COMMON.In bold is the best score among the models without ASR pre-training (✗).Results of Revisit-ST are fromZhang et al. (2022a).#p stands for number of parameters.

Table 12 :
WER scores for ASR models and BLEU scores for MT models in MuST-C v2.0 En-De tst-COMMON.A.7 Results of MT models used in SEGAUGMENTIn table 13 we present results for the MT models trained to generate the new translation for the alternative data of SEGAUGMENT for MuST-C v2.0 En-De.For each parameterization ℓ = (min, max) of the segmentation algorithm A, we train specialized model M ℓ , as described in §4.3.We evaluate on the original training and development sets from S, as well as on the synthetic training set ( Ŝℓ ), which basically indicates the quality of the target text in the synthetic data.We present both sentenceand document-level BLEU scores for the original data, and only document-level scores for the synthetic ones (since no sentence-level references are available).We notice that MT models obtain very high scores in the original train set, as compared to the development set, indicating that the model has indeed overfitted.In any other case that would be very bad news, but here we willingly overfit the model as our goal is to learn the training set text alignment, and not having a good and generalizable MT model.By looking at the document-level BLEU on the synthetic training set, we can confirm that the MT models have indeed accurately learned the alignments and thus have generated high-quality translation, that can be utilized during ST training.

Table 15 :
Number of examples in Sentence-level versions for the manual and SEGAUGMENT processes.manual segmentation.Due to this they might not always resemble proper sentences, but at least they are diverse enough to be useful during training.

Table 16 :
Average duration (in seconds) of examples inSentence-level versions for the manual and SEGAUG-MENT processes.