Cascade versus Direct Speech Translation: Do the Differences Still Make a Difference?

Five years after the first published proofs of concept, direct approaches to speech translation (ST) are now competing with traditional cascade solutions. In light of this steady progress, can we claim that the performance gap between the two is closed? Starting from this question, we present a systematic comparison between state-of-the-art systems representative of the two paradigms. Focusing on three language directions (English-German/Italian/Spanish), we conduct automatic and manual evaluations, exploiting high-quality professional post-edits and annotations. Our multi-faceted analysis on one of the few publicly available ST benchmarks attests for the first time that: i) the gap between the two paradigms is now closed, and ii) the subtle differences observed in their behavior are not sufficient for humans neither to distinguish them nor to prefer one over the other.


Introduction
Speech translation (ST) is the task of automatically translating a speech signal in a given language into a text in another language. Research on ST dates back to the late eighties and its evolution followed the development of the closely related fields of speech recognition (ASR) and machine translation (MT) that, since the very beginning, provided the main pillars for building the so-called cascade architectures. With the advent of deep learning, the neural networks widely used in ASR and MT have been adapted to develop a new direct ST paradigm. This approach aims to overcome known limitations of the cascade one (e.g. architectural complexity, error propagation) with a single encoder-decoder architecture that directly translates the source signal bypassing intermediate representations.
Until now, the consolidated underlying technologies and the richness of available data have upheld the supremacy of cascade solutions in industrial applications. However, architectural simplicity, reduced information loss and error propagation are the ace up the sleeve of the direct approach, which has rapidly gained popularity within the research community in spite of the critical bottleneck represented by data paucity.
Within a few years after the first proofs of concept (Bérard et al., 2016;Weiss et al., 2017), the performance gap between the two paradigms has gradually decreased. This trend is mirrored by the findings of the International Workshop on Spoken Language Translation (IWSLT), 1 a yearly evaluation campaign where direct systems made their first appearance in 2018. On English-German, for instance, the BLEU difference between the best cascade and direct models dropped from 7.4 points in 2018  to 1.6 points in 2019 (Niehues et al., 2019b). In 2020, participants were allowed to choose between processing a presegmented version of the test set or the one produced by their own segmentation algorithm. As reported in (Ansari et al., 2020), the distance between the two paradigms further decreased to 1.0 BLEU point in the first condition and, for the first time, it was slightly in favor of the best direct model in the second condition, with a small but nonetheless meaningful 0.24 difference.
So, quoting Ansari et al. (2020), is the cascade solution still the dominant technology in ST? Has the direct approach closed the huge initial performance gap? Are there systematic differences in the outputs of the two technologies? Are they distinguishable? Answering these questions is more than running an evaluation exercise. It implies pushing research towards a deeper investigation of direct ST, finding a path towards its wider adoption in industrial settings and motivating higher engagement in data exploitation and resource creation to train the data-hungry end-to-end neural systems.
For all these reasons, while Ansari et al. (2020) were cautious in drawing firm conclusions, in this paper we delve deeper into the problem with the first thorough comparison between the two paradigms. Working on three language directions (en-de/es/it), we train state-of-the-art cascade and direct models ( §3), running them on test data drawn from the MuST-C corpus (Cattoni et al., 2020).
Systems' behavior is analysed from different perspectives, by exploiting high-quality post-edits and annotations by professionals. After discussing overall systems' performance ( §4), we move to more fine-grained automatic and manual analyses covering two main aspects: the relation between systems' performance and specific characteristics of the input audio ( §5), and the possible differences in terms of lexical, morphological and word ordering errors ( §6). We finally explore whether, due to latent characteristics overlooked by all previous investigations, the output of cascade and direct systems can be distinguished either by a human or by an automatic classifier ( §7). Together with a comparative study attesting the parity of the two paradigms on our test data, another contribution of this paper is the release of the manual post-edits that rendered our investigation possible. The data is available at: https://ict.fbk.eu/mustc-post-edits.
2 Background Cascade ST. By concatenating ASR and MT components (Stentiford and Steer, 1988;Waibel et al., 1991), cascade ST architectures represent an intuitive solution to achieve reasonable performance and high adaptability across languages and domains. At the same time, however, they suffer from well-known problems related to the concatenation of multiple systems. First, they require ad-hoc training and maintenance procedures for the ASR and MT modules; second, they suffer from error propagation and from the loss of speech information (e.g. prosody) that might be useful to improve final translations. Research has focused on mitigating error propagation by: i) feeding the MT system with ASR data structures (e.g. ASR n-best, lattices or confusion networks) which are more informative than the 1-best output (Lavie et al., 1996;Matusov et al., 2005;Bertoldi and Federico, 2005;Beck et al., 2019;Sperber et al., 2019), and ii) making the MT robust to ASR errors, for instance by training it on parallel data incorporating real or emulated ASR errors as in (Peitz et al., 2012;Ruiz et al., 2015;Sperber et al., 2017;Di Gangi et al., 2019a). Although the former solutions are effective to some extent, state-of-theart cascade architectures (Pham et al., 2019;Bahar et al., 2020) prefer the latter, as they are simpler to implement and maintain.
Direct ST. To overcome the limitations of cascade models, Bérard et al. (2016) and Weiss et al. (2017) proposed the first direct solutions bypassing intermediate representations by means of encoderdecoder architectures based on recurrent neural networks. Currently, more effective solutions (Potapczyk and Przybysz, 2020;Bahar et al., 2020;Gaido et al., 2020) rely on ST-oriented adaptations of Transformer (Vaswani et al., 2017) integrating the encoder with: i) convolutional layers to reduce input length, and ii) penalties biasing attention to local context in the encoder self-attention layers (Povey et al., 2018;Di Gangi et al., 2019b). Though effective, these architectures have to confront with training data paucity, a critical bottleneck for neural solutions. The problem has been mainly tackled with data augmentation and knowledge transfer techniques. Data augmentation consists in producing artificial training corpora by altering existing datasets or by generating (audio, translation) pairs through speech synthesis or MT (Bahar et al., 2019b;Ko et al., 2015;Jia et al., 2019). Knowledge transfer (Gutstein et al., 2008) consists in passing (here to ST) the knowledge learnt by a neural network trained on closely related tasks (here, ASR and MT). Existing ASR models have been used for encoder pre-training (Bérard et al., 2018;Bansal et al., 2019;Bahar et al., 2019a) and multi-task learning (Weiss et al., 2017;Anastasopoulos and Chiang, 2018;Indurthi et al., 2020). Existing neural MT models have been used for decoder pre-training (Bahar et al., 2019a;Inaguma et al., 2020), joint learning (Indurthi et al., 2020Liu et al., 2020) and knowledge distillation (Liu et al., 2019).
Previous comparisons. Most of the works on direct ST also evaluate the proposed solutions against a cascade counterpart. The conclusions, however, are discordant. Looking at recent works, Pino et al. (2019) show similar scores, Indurthi et al. (2020) report higher results for their direct model, while Inaguma et al. (2020) end up with the opposite finding. The main problems of these comparisons are that: i) not all the architectures are equally optimized, ii) for the sake of fairness in terms of training data, cascade systems are restricted to unrealistic settings with small training corpora that penalize their performance, and iii) evaluation always relies only on automatic metrics computed on single references. The IWSLT campaigns (Niehues et al., 2019a;Ansari et al., 2020) set up a shared evaluation framework where systems built on a large set of training data are optimized to achieve the best performance, independently from the underlying architecture. In the last round, direct models approached, and in one case (Potapczyk and Przybysz, 2020) outperformed, the cascade ones. However, the evaluation was run only on one language pair, by solely relying on automatic metrics and single references. In this paper, we overcome these limitations by comparing the two paradigms on three language pairs, using different metrics, multiple references (including professional postedits) as well as fine-grained automatic and manual analysis procedures.

ST Systems
To maximize the cross-language comparability of our analyses, we built the cascade and direct ST systems for en-de/es/it with the same core technology, based on Transformer. Their good quality is attested by the comparison with the winning system at the IWSLT-20 offline ST task (Bahar et al., 2020), 2 which consists of an ensemble of two cascade models scoring 28.8 BLEU on the en-de portion of the MuST-C Common test set. On the same data, our cascade and direct models achieve similar BLEU scores, respectively 28.9 and 29.1 (see Table 1). 3 On en-es and en-it, identical architectures perform similarly or better (up to 32.9 BLEU on en-es). Although BLEU scores are not strictly comparable across languages, we can safely consider all our models as state-of-the-art.
For the sake of reproducibility, we provide complete details about data, architectures and training setup in Appendix A.

Evaluation Methodology
Data. Our evaluation data is drawn from the TED-based MuST-C corpus (Cattoni et al., 2020), the largest freely available multilingual corpus for ST. It covers 14 language directions, with English audio segments automatically aligned with their corresponding manual transcriptions and translations. The en-de/es/it MuST-C Common test sets contain the same 27 TED talks, for a total of around 2,500 segments largely overlapping across languages. 4 For all the three language pairs, we selected subsets of MuST-C Common containing the same English audio portions from each talk, in order to obtain representative groups of contiguous segments that are comparable across languages. Furthermore, to ensure high data quality, we manually checked the selected samples and kept only those segments for which the audio-transcripttranslation alignment was correct. Each of the three resulting test sets -henceforth PE-sets -is composed of 550 segments, corresponding to about 10,000 English source words.

Post-editing.
A key element of our multi-faceted analysis is human post-editing (PE), which consists in manually correcting systems' output according to the input (the source audio in our case). In PEbased evaluation, the original output is compared against its post-edited version using distance-based metrics like TER (Snover et al., 2006). This allows for counting only the true errors made by a system, without penalising differences due to linguistic variation as it happens when exploiting independent references. This makes PE-based evaluation one of the most prominent methodologies used for translation quality assessment (Snover et al., 2006(Snover et al., , 2009Denkowski and Lavie, 2010;Cettolo et al., 2013;Bojar et al., 2015;Graham et al., 2016;Bentivogli et al., 2018b).
To collect the post-edits for our study, we strictly followed the methodology of the IWSLT 2013-2017 evaluation campaigns (Cettolo et al., 2013), which offered us a consolidated framework and best practices to draw upon. Our cascade and direct systems were both run on the PE-sets to be post-edited. To guarantee high quality post-edits, for each language we hired two professional translators with experience in subtitling and post-editing. Moreover, in order to cope with translators' vari-ability (i.e. more/less aggressive editing strategies), the outputs of the two ST systems were randomly assigned ensuring that each translator worked on all the 550 segments, post-editing an equal number of outputs from both systems. The task was performed with a CAT tool 5 that displays the manual transcript of the audio together with the ST output to be edited. However, since ST systems take as input an audio signal, we also provided translators with the audio file of each segment, asking them to post-edit strictly according to it. 6 For each language pair, the final PE-set used in our study consists of the 550 MuST-C original audiotranscript-translation triplets plus two additional sets of reference translations, i.e. the post-edited versions of the two systems' outputs.
Analyses. The collected post-edits are exploited to assess overall systems' performance ( §4) as well as to carry out deeper quantitative and qualitative analyses aimed to shed light on possible systematic differences in systems' behavior ( §5.1 and §6.1). Focusing on specific aspects of the ST problem, the inquiry is also performed by means of manual annotation of systems' outputs ( §5.2, §6.2 and §7.1). Due to the linguistic nature of this task, centred on fine-grained aspects requiring a variety of skills in both evaluation and ST technology, for such analyses we relied on three researchers in translation technology -one per language pair -with a strong background in linguistics, excellent knowledge of the addressed languages (C2 or native), as well as strong expertise in systems' evaluation.

Overall Systems' Performance
We compute overall performance results both on the PE-sets and on the MuST-C Common test sets. Our primary evaluation is based on the collected post-edits. We consider two TER-based 7 metrics: i) human-targeted TER (HTER) computed between the automatic translation and its human post-edited version, and ii) multi-reference TER (mTER) computed against the closest reference among the three available ones (two post-edits and the official reference from MuST-C). The latter metric better accounts for post-editors' variability, making the evaluation more reliable and informative. For the sake of completeness, in Table 1 we also report Sacre-BLEU 8 (Post, 2018) and TER scores computed only on the official MuST-C Common references.

PE Set
M.  Table 1: Performance of (C)ascade and (D)irect systems on the PE-sets and MuST-C Common test sets. Statistically significant differences ( * ) are computed with Paired Bootstrap Resampling (Koehn, 2004).
A bird's-eye view of the results shows that, in more than half of the cases, performance differences between cascade and direct systems are not statistically significant. When they are, the raw count of wins for the two approaches is the same (4), attesting their substantial parity.
Looking at our primary metrics (HTER and mTER), systems are on par on en-it and en-de, while for en-es the direct approach significantly outperforms the cascade one. This difference, however, does not emerge with the other metrics. Indeed, BLEU and TER scores computed against the official references are less coherent across metrics and test sets. For instance, on the en-it PE-set the cascade system significantly outperforms the direct one in terms of BLEU score, while TER shows the opposite on MuST-C Common. Interestingly, the scores obtained using independent references can also disagree with those computed with post-edits. This is the case of en-es, where significant HTER and mTER reductions attest the superiority of the direct system, while most BLEU and TER scores are still in favor of the cascade.
On the one hand, primary evaluation scores suggest that the rapidly advancing direct technology has eventually reached the traditional cascaded approach. On the other, the highlighted incongruities confirm widespread concerns about the reliability of fully automatic metrics -based on independent references -to properly evaluate neural systems (Way, 2018). This calls for deeper quantitative and qualitative analyses. Those presented in the next sections investigate performance differences focusing on two main aspects: the impact of specific input audio properties ( §5), and the linguistic errors made by the systems ( §6).

Automatic Analysis
The two ST approaches handle the input audio differently: the cascade one by means of a dedicated ASR component that produces intermediate transcripts; the direct one by extracting all the relevant information to translate in an end-to-end fashion. Is it therefore possible that some audio properties have different impact on their results? Overall performance being equal, answering this question would help to understand if one approach is preferable over the other under specific audio conditions.
Among other possible factors (e.g. noise, recording conditions, overlapping speakers) we tried to shed light on this aspect by focusing on two common factors: audio duration and speech rate. To this aim, we grouped the sentences in the PE-set according to the sentence-wise HTER percentage difference -i.e. the difference between the cascade and direct HTER scores divided by their average.
The threshold for considering performance differences as significant was set to 10%. The resulting groups contain sentences where: i) cascade is significantly better than direct, ii) direct is significantly better than cascade, iii) the difference between the two is not significant, and iv) both systems have HTER=0. For each group, we calculated the average audio duration and the corresponding speech rate in terms of phonemes 9 per second.
Results are shown in Table 2, where -for the sake of completeness -also the length of the reference audio transcript is given, together with the average HTER of the systems.
As we can see, results are coherent across languages: audio duration and speech rate averages do not differ, neither when one system performs significantly better than the other, nor when the HTER differences are not significant. We can hence conclude that, if audio duration and speech rate have any influence on systems' performance, our analysis does not highlight specific conditions that are more favorable to one approach than to the other. Both are equally robust with respect to the audio properties here considered.

Manual Analysis
Handling the input audio differently, the two approaches have inherent strengths and weaknesses. 9 Obtained by processing the transcripts with eSpeak (espeak.sourceforge.net).  In particular, although suffering from the wellknown scarcity of sizeable training corpora, direct solutions come with the promise (Sperber and Paulik, 2020) of: i) higher robustness to error propagation, and ii) reduced loss of speech information (e.g. prosody). Our next qualitative analysis tries to delve into these aspects by looking at audio understanding and prosody issues.
Audio understanding. Errors due to wrong audio understanding are easy to identify for cascade systems -since they are evident in the intermediate ASR transcripts -but harder to spot for direct systems, whose internal representations are by far less accessible. In this case, errors can still be identified in mistranslations corresponding to words which are phonetically similar to parts of the input audio -e.g. nice voice mistranslated in German as nette Jungen (nice boys). To spot such errors, our annotators carefully inspected the PE-set by comparing the audio, the reference transcripts and systems' output translations for both the cascade and direct models, as well as the ASR transcripts for the cascade one. Some interesting examples of the identified errors are reported in Table 3. AUDIO   As shown in Table 4, audio understanding errors are quite common for both systems in all language pairs. However, both the number of errors and the number of sentences they affect is significantly lower for the direct one. We observed that this is the case especially for "more difficult" sentences, such as sentences with poor audio quality and overlapping or disfluent speech.
Though far from being conclusive (we acknowledge that, due to the "opacity" of direct models, their error counts might be slightly underestimated), this analysis seems to confirm the theoretical advantages of direct ST. This finding advocates for more thorough future investigations on neural networks' interpretability, targeting its empirical verification on larger and diverse benchmarks.  Prosody. Prosody is central to disambiguating utterances, as it reflects language elements which may not be encoded by grammar and vocabulary choices. While prosody is directly encoded by the direct system, it is lost in the unpunctuated input received by the MT component of a cascade. Besides few interrogative sentences, our annotators were able to isolate only a handful of utterances whose prosodic markers result in different interpretations by the two models. Concerning interrogatives, both systems managed to translate them correctly in most cases (24 for cascade and 25 for direct out of 31). This is not surprising given the syntactic structure of English questions, which is explicit and does not rely solely on prosody (e.g. compared to Italian). In all other cases (examples in Table 5), the direct model's higher sensitivity to prosody seems to give it an edge on cascade in disambiguating and correctly rendering the utterance meaning. Also this finding calls for future inquiries aimed to check the regularity of these differences on larger datasets. 6 Linguistic Errors

Automatic Analysis
For this analysis, we rely on the publicly available tool 10 used by Bentivogli et al. (2018a)    what linguistic phenomena are best modeled by MT systems. The tool exploits manual post-edits and HTER-based computations to detect and classify translation errors according to three linguistic categories: lexicon, morphology and word order. Table 6 presents their distribution. As expected from the HTER scores in Table 1, results vary across language pairs. On en-it, systems show pretty much the same number of errors, with a slight percentage gain (+1.1) in favor of the cascade. For the other two pairs, differences are more marked and opposite, with an overall error reduction for the direct system on en-es (-6.7) and in favor of the cascade on en-de (+6.7).
Looking at the distribution of errors across categories, while for en-es the direct system is always better and the percentage reduction is homogeneously distributed, for en-de the better performance of the cascade is concentrated in the morphology and word order categories. Since English and German are the most different languages in terms of morphology and word order, this result suggests that cascade systems still have an edge on the direct ones in their ability to handle morphology and word reordering. This is further supported by en-it: the only difference, in favor of the cascade, is indeed observed in the morphology category.

Manual Analysis
Since lexical errors represent by far the most frequent category for both approaches in all language pairs, we complement the automatic analysis with a more fine-grained manual inspection, further distinguishing among lexical errors due to missing words, extra words, or wrong lexical choice. 11 The analysis was carried out on subsets of the PE-set, created in such a way to be suitable for manual annotation. Namely, we removed sentences for which the output of the two systems is: i) identical, ii) judged correct by post-editors (HTER=0), or iii) too poor to be reliably annotated for errors (HTER>40%). The resulting sets contain 207 sentences for en-de, 238 for en-es, and 285 for en-it.
This analysis reveals that, for all language pairs, wrong lexical choice is the most frequent error type (∼65% of lexical errors on average) followed by missing words (∼30%), and extra words (∼5%).
While errors due to lexical choice and superfluous words vary across languages, we observe a systematic behavior with respect to missing words (words that are present in the audio but are not translated). As we can see in Table 7, direct systems lose more information from the source input than their cascade counterparts, in terms of both single words and contiguous word sequences. It is particularly interesting to notice that also for en-es -where the direct system is significantly stronger than the cascade -the issue is still evident, although to a lesser extent.  Finally, we report that a non-negligible amount of missing words (between 10% and 20%) is represented by discourse markers, i.e. words or phrases used to connect and manage what is being said (e.g. "you know", "well", "now"). Although this is 11 Various error taxonomies covering different levels of granularity have been developed, and the distinction between these types of lexical errors is widely adopted, including the DQF-MQM frameworkhttps://info.taus.net/ dqf-mqm-error-typology-templ AUDIO "That's fine", says George,  a frequent phenomenon in speech, not translating discourse markers cannot be properly considered as an error, since markers i) do not carry semantic information, and ii) can be intentionally dropped in some use cases, such as in subtitling.

Classifiers' Verdict
So far, our inquiry has been entirely driven by predefined assumptions (the importance of certain audio properties) and linguistic criteria (the focus on specific error types). This top-down approach, however, might fail to disclose important differences, which were not specifically sought after when analysing the two paradigms. This consideration motivates the adoption of the complementary bottom-up approach that concludes our comparative study by answering the question: is the output of cascade and direct systems distinguishable? Understanding if and why discriminating between the two is possible would not only suggest new issues to look at. It would also highlight possible output regularities that, despite the similar overall performance, make one paradigm preferable over the other in specific application scenarios. To this aim, we set up a classification experiment, comparing the ability of humans to correctly identify the output of the two systems with the performance of an automatic text classifier.

Human Classification
After getting acquainted with systems' output through the previous manual analyses, our assessors were instructed to perform a classification task. The classification had to be performed on 10 blocks of items comprising a set of unseen English contiguous sentences (gold transcripts) from the MuST-C Common test set, and two sets of anonymized translations, one produced by the cascade and one by the direct model. For each block, the assessors had to assign each set of translations to the correct system, or label them as indistinguishable. To investigate whether more context helps in the assign-ment, we set up two experiments with respectively 10 and 20 contiguous sentences per block.  The results in Table 9 show that en-es and en-it systems are not distinguishable, since only a maximum of 4 blocks out of 10 were correctly classified, while most en-de blocks were correctly classified. According to the en-de assessor, this is due to the fact that the structure of the sentences generated by the direct system is very similar to that of the corresponding English sources. This characteristic stands out in German, which differs from English in terms of word order more than Italian and Spanish. This type of behavior does not necessarily imply the presence of errors but, like a fingerprint, makes the en-de direct system more recognizable by a human. Furthermore, being sub-optimal for German, this structure can cause preferential edits by the post-editors, which would be in line with the concentration of errors in the word order category observed in Table 6 (+19.6%).
Assessing the importance of context, the ability of humans to distinguish the systems does not improve when passing from 10 to 20 sentences per block. This suggests that the behavioral differences between cascade and direct systems are so subtle that, on larger samples, they mix up and balance making their fingerprints less traceable.

Automatic Classification
As a complement to the human classification experiment, we check whether an automatic tool is able to accomplish a similar task. Our classifier combines n-gram language models with the Naive Bayes algorithm, as proposed in (Peng and Schuurmans, 2003). We trained two 5-gram models, respectively using translations by the cascade and the direct systems. At classification time, given a translated text, the classifier computes the perplexity of the two models and assigns the cascade or direct label based on the model with the lowest perplexity. Also these experiments were carried out on the MuST-C Common set. The classifier was tested via k-fold cross-validation, for different values of k -i.e. different sizes of text to classify.
As shown in Figure 1, contrary to humans, the more data the classifier receives, the higher its accuracy in discriminating between systems. Already at a size of 20 sentences, accuracy is always ∼80%. This suggests that systems have their own "language", a fluency-related fingerprint.  Table 10 shows that the cascade output exhibits higher lexical diversity on all languages, with smaller differences on en-de and en-es compared to en-it. A plausible conclusion is that the cascade produces richer output, whose variety does not necessarily result in better translations nor is appreciated by humans. Indeed, annotators were able to correctly distinguish the output only for en-de, where lexical diversity is similar (see §7.1).

Conclusion and Final Remarks
There is a time when the possible transition from consolidated technological frameworks to new emerging paradigms depends on answering fundamental questions about their potential, strengths and weaknesses. A time when technology developers are faced with the choice of where to direct their future investments. Five years after its appearance on the scene, the direct approach to ST confronts the community with similar questions in relation to the traditional cascade paradigm that it aims to overtake. Our investigation showed that, in spite of the known data paucity conditions still penalizing the direct approach, the two technologies now perform substantially on par. Subtle differences in their behavior exist: overall performance being equal, the cascade still seems to have an edge in terms of morphology, word ordering and lexical diversity, which is balanced by the advantages of direct models in audio understanding and in capturing prosody. However, they do not seem sufficient and consistent enough across languages to make the output of the two approaches easily distinguishable, nor to make one model preferable to the other. Back to our title, they no longer make a difference.
We are aware that the generalizability of these results depends on several factors such as the considered languages, systems and benchmarks, as well as the human workforce deployed for the inquiry. Here, with the help of professionals, we proposed multi-faceted quantitative and qualitative analyses, run on the output of state-of-the-art systems on three language pairs -though, by now, covering only the most-explored and data-favorable condition, which has English as source. Although our findings hold for a specific scenario, in which free data were at our disposal (and to which we contribute back by releasing high-quality post-edits), they might not be generalizable to other (e.g. difficult, distant) languages and other (e.g. highly specialized) domains. Nevertheless, we present them as a timely contribution towards answering a burning question within the ST community.

A Systems' Description
In this section we describe the ST models created for our study (see Section 3.1 ). All the details about the different trainings are given below, while the validation set was common to all trainings, since we used the MuST-C dev set.
The source code for the ASR and the direct ST models is available at: https://github.com/ mgaido91/FBK-fairseq-ST.
The source code for the MT component of the cascade model can be found at: https://github. com/modernmt/modernmt.

A.1 Cascade approach
The Cascade system is composed of a pipeline of automatic speech recognition (ASR) and machine translation (MT) models.
The ASR model is a slightly revisited version (Gaido et al., 2020) of the S-Transformer (Di Gangi et al., 2019b), where the two 2D self-attention layers are replaced with two Transformer encoder layers (for a total of 8 layers), while the decoder is the same (with 6 layers). Hence, the model processes the input with two 3x3 2D CNNs (having 64 filters), whose output is first projected into a higherdimensional space and then summed with positional embeddings before being fed to the Transformer encoder layers; Transformer encoder layers use logarithmic distance penalty. The attention mechanism consists of 8 attention heads. The dimensionality of input and output is 512, while the inner-layers have dimensionality 2048. The resulting number of parameters is 63M.
The ASR model was trained with the goal of achieving state-of-the-art performance. To this aim, we relied on two data augmentation techniques the maximum value (5 × 10 −4 ), after that it follows an inverse square root decay; dropout is set to 0.3. Minibatches consist of 3072 tokens and update frequency is set to 4; the total number of iterations is 200k; the last 10 saved checkpoints (one out of 1k iterations) are averaged. The model uses label smoothing with a uniform prior distribution (0.1) over the vocabulary; source and target languages share a BPE vocabulary of 32k sub-words.  The training data, whose statistics are reported in Table 11, are collected from the OPUS repository. For English-Italian, they resulted in almost 70M segment pairs and about 800M English words; after deduplication and the internal ModernMT cleaning, the actual training data is reduced to 45M pairs and 550M English words. For English-{German,Spanish} pairs, the OPUS data were filtered through well-known data selection methods (Axelrod et al., 2011) using a general-domain seed; the resulting training data consist of, respectively, 17M and 19M segment pairs, for 270M and 330M English words. Trainings were performed on RTX 2080 Ti GPUs; for English-Italian, it was run on 7 GPUs and lasted 3 days, while for each of the other two directions, on a single GPU, it took 6 days.
The three models are then fine-tuned on MuST-C training data (∼250K pairs, 4-5M English words) by continuing the training for 4k iterations on the adaptation data, with a learning rate reduced by a factor of 5. To mitigate error propagation and make the MT system more robust to ASR errors, similarly to (Di Gangi et al., 2019a) fine tuning is run on the concatenation of human and automatic transcripts of MuST-C, both paired with manual translations.

A.2 Direct approach
Our direct model (Gaido et al., 2020) uses the same architecture of the English ASR model described in §A.1, but it has 11 Transformer encoder layers (instead of 8) and 4 Transformer decoder layers (instead of 6) for a total of 64M parameters. The ST model's encoder is initialized with the encoder of the ASR model (Bansal et al., 2019), with the missing layers initialized randomly. The ST decoder is also initialized randomly.
The training settings and the data augmentation methods employed for the direct ST model are the same described in Section A.1 for the ASR component of the cascade system. In addition, we performed synthetic data generation, by automatically translating the English transcripts of the ASR training corpora (Jia et al., 2019). Furthermore, we transfer knowledge from MT through knowledge distillation (Hinton et al., 2015). Knowledge distillation is performed from a teacher MT model by optimizing the KL divergence between the distributions produced by the teacher and the student ST model being trained (Liu et al., 2019). The teacher MT model is trained on the OPUS datasets (Tiedemann, 2016) and is a plain transformer with 16 attention heads and 1024 features in encoder/decoder embeddings, resulting into 212M parameters. The direct ST model is trained in two consecutive steps. First, it is optimized using KD. Then, the resulting model is fine-tuned on label-smoothed cross entropy (Szegedy et al., 2016). The training set is composed of the same corpora used for the ASR model, more precisely: i) the ST corpora and ii) the synthetic datasets derived from the ASR corpora.
The ST model is fed with the input utterance and a token representing the type of the target data, which can be: i) human reference translations (for the ST corpora), or ii) translations generated by the MT model fed with true case transcriptions with punctuation, and iii) translations generated by the MT model fed with lower-cased transcriptions without punctuation (for the ASR corpora). At inference time, the token "human reference" is always used to generate the translations. The token is added to the features extracted from the audio before they are passed to the encoder (Di Gangi et al., 2019c).
All trainings were performed on 8 K80 GPUs. The training of each direct model lasted 10 days, while the ASR and MT pre-trainings 6 days each.
The source code 19 implemented to build these models is based on Fairseq (Ott et al., 2019).

B Post-Editing Guidelines
In this task you are presented with (i) 550 audio segments that are recordings of portions of different English TED Talks, (ii) their transcripts, and (iii) corresponding automatic translations.
Starting from the original audio recording and its corresponding transcript (done by TED volunteer translators), you are asked to post-edit each given automatic translation by applying the minimal edits required to transform the system output into a fluent sentence with the same meaning as the audio/transcript.
While post-editing, remember the following guidelines: • We noticed that some audio player software applications cut the beginning or the end of the audio segments. If you notice some audiotranscript out-of-sync, please try another audio player or inform us about the problem.
• The audio should be your first source of information, while transcripts are given for your convenience. It could happen that the transcript is not faithful to the spoken original: in these cases you should not consider the transcript and refer to the audio only.
• Some transcripts contain the name or initials of the speaker (typically followed by colons). Please don't add this information into the sentence you are post-editing. In general, don't include in your post-edit any text that is not present in the audio (e.g. explanation of acronyms, disambiguation of pronouns), even though this information could ease the understanding of the sentence.
• The post-edited sentence is intended as a translation of spoken language. Also, depending on the style of the source language talk, you can use the corresponding style in the target language (e.g. if the talk uses a friendly/colloquial style you can use informal words too).
• The focus is the correctness of the single sentence within the given context, not the consistency of a group of sentences. Hence, surrounding segments should be used to understand the context but not to enforce consistency on the use of terms. In particular, different but correct translations of terms across segments should not be corrected.

C Tool for Automatic Error Classification
The tool used for the automatic analysis of linguistic errors (Section 6.1) is downloadable at wit3.fbk.eu/2016-02. It is a modified version of the tercom script, 20 which requires the lemmatized versions of both systems' outputs and post-edits.
To lemmatize the data we used the TreeTagger. 21