End-to-End Single-Channel Speaker-Turn Aware Conversational Speech Translation

Conventional speech-to-text translation (ST) systems are trained on single-speaker utterances, and they may not generalize to real-life scenarios where the audio contains conversations by multiple speakers. In this paper, we tackle single-channel multi-speaker conversational ST with an end-to-end and multi-task training model, named Speaker-Turn Aware Conversational Speech Translation, that combines automatic speech recognition, speech translation and speaker turn detection using special tokens in a serialized labeling format. We run experiments on the Fisher-CALLHOME corpus, which we adapted by merging the two single-speaker channels into one multi-speaker channel, thus representing the more realistic and challenging scenario with multi-speaker turns and cross-talk. Experimental results across single- and multi-speaker conditions and against conventional ST systems, show that our model outperforms the reference systems on the multi-speaker condition, while attaining comparable performance on the single-speaker condition. We release scripts for data processing and model training.


Introduction
Speech translation (ST) has seen wide adoption in commercial products and the research community (Anastasopoulos et al., 2021(Anastasopoulos et al., , 2022) ) due to its effectiveness in bridging language barriers.ST aims to translate audio of source languages into text of the target languages.This problem was tackled by a cascaded approach that pipelines Automatic Speech Recognition (ASR) and Machine Translation (MT) over the last few decades (Waibel et al., 1991;Vidal, 1997;Casacuberta et al., 2008, inter alia).However, end-to-end speech translation (E2E-ST) systems (Berard et al., 2016;Weiss et al., * Work conducted during an internship at Amazon.** Corresponding authors with equal contributions. 1 https://github.com/amazon-science/stac-speech-translation STAC-ST targets a more challenging scenario where multiple speakers converse with occasional cross-talks due to merged channels (bottom).
2017, inter alia) have recently gained increasing interest and popularity thanks to their simple architecture, less error propagation (Etchegoyhen et al., 2022), efficient training process, and competitive performance (Inaguma et al., 2019).Despite significant recent advances in E2E-ST (Gheini et al., 2023;Wang et al., 2023), most ST systems to date have focused on translating isolated speech utterances from monologue speech (Di Gangi et al., 2019), read speech (Kocabiyikoglu et al., 2018) or prompted speech (Wang et al., 2021).Being trained on single-turn utterances, these systems may lack the ability to handle real-life scenarios in which multiple speakers converse, and sometime overlap, in the same audio channel (Post et al., 2013).
In this work, we tackle the more challenging task of multi-speaker conversational ST.We refer to it as multi-turn & multi-speaker (MT-MS), as opposed to single-turn, which most ST systems implicitly assume.This is illustrated in Figure 1, where a "conversation" between two speakers recorded with separate channels (top) becomes more difficult to translate if the channels are merged (bottom), due to the introduction of speaker-turns and cross-talks.In particular, ST with cross-talks and speaker-turns is difficult because speech content of different sentences is mixed up or switched.While MT-MS speech has been studied in ASR (Raj et al., 2022), to the best of our knowledge, this is the first paper that investigates it in end-to-end ST.We tackle MT-MS ST with an approach we named Speaker-Turn Aware Conversational Speech Translation (STAC-ST).STAC-ST is a multi-task training framework that combines ASR, ST and speaker-turn detection using special tokens in a serialized labeling format.It is inspired by a recent speech foundation model, Whisper (Radford et al., 2023), which jointly trains ASR, X-to-English ST, voice activity detection, and language identification with 680k hours of speech data using labeling-based multitask learning.Our contributions are as follows: 1. We introduce the task of multi-turn & multispeaker ST, including cross-talks and speakerturns, that expands the realm of ST which has been limited to single-speaker utterances.2. We propose an end-to-end model (STAC-ST) which achieves state-of-the-art BLEU scores on Fisher-CALLHOME, a corpus that allows to target MT-MS without degradation on single-turn ST. 3. We explore a zero-shot scenario where MT-MS ST data is not available for training.We show that STAC-ST improves ST up to 8 BLEU by leveraging MT-MS ASR targets, mitigating the necessity of parallel data, which is lacking within the community.4.Besides serializing transcripts and translations at cross-talks, the STAC-ST model is also shown to learn the task of time-aligned speaker change detection.5. We conduct extensive ablation studies on important aspects of STAC-ST, including joint modeling of ASR & ST, impact of model size (up to 300M parameters), data size, and integration of task tokens.Thus, we shed light on the best practices for building conversational MT-MS ST systems.

Related Work
Joint ST & ASR Modeling Recent works in ST have leveraged ASR training data to improve translation quality.In principle, joint ASR and ST modeling (Gheini et al., 2023;Soky et al., 2022) requires 3-way parallel data for each training example, i.e., audio, transcript, and translation, as can be found, in limited amount, in the CoVoST (Wang et al., 2020(Wang et al., , 2021) ) and MuST-C (Di Gangi et al., 2019) corpora.Prior work proposed to overcome the 3-way parallel data bottleneck by pseudolabeling ST data (Gheini et al., 2023), or by pretraining an ASR model (van den Oord et al., 2018) on large multilingual data (Bapna et al., 2022;Zhang et al., 2023b) before training the joint ASR & ST model (Babu et al., 2022).Recently, the Whisper model (Radford et al., 2023) introduced an effective annotation format for jointly training ASR & ST with independent targets.
In this work, we report results on the Fisher-CALLHOME corpus (Post et al., 2013) which, similarly to the MSLT corpus, offers the opportunity to run contrasting experiments of single-speaker ST versus MT-MS ST, both without reference segmentation.
Speaker-Turn and Cross-Talk in ASR Speakerturns and cross-talks have been explored in the ASR field and commonly termed, multi-talker ASR.Kanda et al. (2020) proposed a serialized output training (SOT) strategy for multi-speaker overlapped speech recognition with special tokens.At inference time, word and speaker tags are output in a serialized manner for an unlimited number of speakers.SOT was later ported to the streaming scenario (Kanda et al., 2022).However, SOT may produce frequent speaker changes, which can degrade the overall performance.Thus, Liang et al. (2023) proposed to explicitly incorporate boundary knowledge with a separate block for speaker change detection task and boundary constraint loss.Multi-talker ASR has also been explored in the nonstreaming (Huang et al., 2023) and streaming (Raj et al., 2022) setups.Multi-turn ASR has been explored in automatic dubbing (Virkar et al., 2023) of scripted content, a challenging case due to the high number of speakers and short segments (Brannon et al., 2023), but improvements have come from aligning (Thompson andKoehn, 2019, 2020) automatic transcripts with available production scripts.Another branch of research targets cross-talk & multi-talker ASR (Yang et al., 2023) using speech separation of long-form conversational speech (Paturi et al., 2022) but these techniques have difficulty handling variable number of speakers and are not optimized end-to-end for ASR improvements.However, how to effectively deal with multispeaker conversational ST has been neglected.
3 Speaker-Turn Aware Conversational Speech Translation (STAC-ST) This section describes our end-to-end multi-task learning model for multi-turn multi-speaker conversational ST.

System Diagram
Figure 2 illustrates the proposed STAC-ST multitask learning framework for MT-MS ST.The model is an encoder-decoder Transformer architecture inspired by Vaswani et al. (2017).The multitask training format using special tokens ( §3.2) was inspired by Whisper (Radford et al., 2023), while the integration of Connectionist Temporal Classification (CTC) loss ( §3.3) was inspired by Watanabe et al. (2017).
STAC-ST has a standard front-end module.First, frame-level 80-dimensional filterbank features are extracted from the audio2 every 40ms.Second, we apply SpecAugment (Park et al., 2019) on the input audio features, an effective data augmentation technique that masks out certain regions of the input filterbank features.Then, the audio augmented features are passed to a 2-layer CNN that outputs a 5120-dim vector (flattened 2D→1D output tensor from the CNN layer).Finally, this vector feeds a linear layer that generates the input to the encoder model.The decoder takes the encoder outputs and generates a sequence of text.Formally, for each speech segment, the filterbank features can be represented as: X = {x t ∈ R F } T t=1 and the reference transcription or translation as: Where, F is the feature dimension, T is the number of speech frames, N is the number of text tokens, and V is the vocabulary.During training of STAC-ST, we concatenate independent datasets D ASR = (X, Y ASR ) and D ST = (X, Y ST ), for ASR & ST, respectively.Samples of training minibatches are jointly drawn from D ASR and D ST .

Serialized Labeling Based on Task Tokens
A key component of the model is the serialized multi-task labeling framework based on special tokens.As shown in Figure 2, besides the text tokens, special tokens are used to specify the task.There are four types of task tokens, i.e., [SL] (source language), [TL] (target language), [TURN] (speakerturn), and [XT] (cross-talk).
The first two tokens are language tokens that define the task for either ST (when At inference time, both language tokens are preset to specify the desired task.
[TURN] and [XT] specify the auxiliary tasks of detecting speaker-turn changes and cross-talks, which are critical for MT-MS speech processing and more aligned to acoustic tasks.Note that crosstalks always occur during speaker-turn changes, so [XT] always follows [TURN].
We concatenate transcripts or translations sequentially, inserting [TURN] and [XT] tokens when needed.If utterances u t and u t+1 overlap in time, we append the targets of utterance u t+1 after utter-ance u t .The order of utterances is determined by their start time.A demonstration of such serialization is shown below: [XT] WORD2 WORD3 ...

Joint CTC and NLL Loss
STAC-ST jointly models ASR and ST by balancing CTC (Graves et al., 2006) and Negative Log-Likelihood (NLL) losses (Chan et al., 2016), according to: L CT C and L N LL are computed by appending linear layers with dimension V on top of the encoder and decoder, respectively.Figure 2 shows the proposed joint CTC/NLL loss training scheme (Watanabe et al., 2017).In practice, the CTC loss models a probabilistic distribution by marginalizing over all possible mappings between the input (audio features, sampled at 40 ms) and output sequence (transcription or translation).We refer readers to the original implementation by Graves et al. (2006), for more details.Moreover, CTC loss has been proven to aid ST by helping to stabilize encoder representations at early stages of training, i.e., allowing the decoder to learn soft alignment patterns faster (Yan et al., 2023).Note that we do not include language tokens, [SL] and [TL], for L CT C computation because they do not correspond to acoustic features.Following previous work (Zhang et al., 2022(Zhang et al., , 2023a)), we set the weight λ of the CTC loss to 0.3.

Experimental Setup
This section introduces the datasets and metrics we used for evaluation, as well as architecture and training details of STAC-ST.

Conversational Multi-Turn & Multi-Speaker ST
We use the Fisher and CALLHOME corpora which respectively comprises 186 hr and 20 hr of audio and transcripts of telephone conversations in Spanish. 3 The Spanish-to-English translations are available from Post et al. (2013).We refer to them as Fisher-CALLHOME and summarize the data 3 LDC2010S01, LDC2010T04, LDC96S35, LDC96T17  statistics in Table 1.This corpus is well suited for MT-MS ST, as it contains a significant amount of labeled data and non-segmented (audio) long conversation between speakers.We merged Fisher and CALLHOME for training and up-sampled the audio to 16 kHz.
Segmentation.Each conversation on Fisher-CALLHOME occurred between two speakers with multiple turns over two channels (one speaker per channel).For MT-MS ST experiments, we merge the two channels into one, which creates natural speaker changes and cross-talks as illustrated in Figure 1.Human annotations in Fisher-CALLHOME provide time-aligned audio utterances, transcripts and translations, and have been used to segment each channel into single-turn utterances in prior work (e.g., Inaguma et al., 2019).Figure 3 plots the distributions of segment duration in the corpus.We observe that the majority of single-turn segments are less than 5 seconds long.
To build models with manageable size and computation, following Radford et al. (2023), we segment the merged-channel conversations into chunks of up to 30 seconds.For this step, we first used an off-the-shelf VAD-based segmentation tool, SHAS (Tsiamas et al., 2022), but we realized that the resulting duration histogram is almost uniform and far from the natural segmentation.Hence, we decided to rely on the manual time annotations as follows.Starting from the first utterance start, we find the farthest utterance end such that end−start is up to 30 seconds.We extract audio within this span as one segment and repeat this procedure until the last utterance end is reached.Note that one segment may stretch over multiple utterance start and end, so it may include silences, noise, speaker changes and cross-talks.We use this as the primary MT-MS segmentation strategy for both training and test data throughout the paper unless otherwise stated.More discussions can be found in Section 5.3.1.

Additional ASR & ST Corpora
Fisher-CALLHOME has limited training data size, so we explore additional corpora to improve our model and to evaluate its generalization ability.We also use the official CoVoST 2 and CV corpora are composed of single-turn pre-segmented utterances.To generate data consistent with our MT-MS segmentation, we randomly concatenate audio utterances and yield segments of up to 30 seconds.Note that these synthetic MT-MS segments contain no silences and cross-talks, but still have speaker-turn changes (labeled by [TURN]).

Evaluation Metrics
We report case-insensitive BLEU using Sacre-BLEU5 (Post, 2018) for translation and Word Error Rate (WER) for ASR.Note that we (1) remove all special task tokens before computing each metric and (2) evaluate on MT-MS segmentation unless otherwise stated.

Hyper-Parameters
We experiment with three model sizes, S(mall), M(edium), and L(arge), with increasing dimension (256,512,1024), number of encoder layers (12,14,16), number of heads (4, 8, 16), with same number of decoder layers (6) and FFN dimension set to 4x the model dimension.Their numbers of parameters are 21M, 86M, and 298M, respectively.We use the S-size model by default and scale up to larger sizes when out-of-domain training data are added.We apply BPE sub-words (Sennrich et al., 2016) on both translations and transcripts with 5K operations.We create a joint BPE model for the language pair or when we add CV+CoVoST2 corpora (only §5.3.2 and §5.3.3).
We train for 100k steps the S-size models and 200k steps the M-and L-size models.We use AdamW (Kingma and Ba, 2015) optimizer with a peak learning rate of 5e −3 for the S model and 1e −3 for M and L models.The learning rate scheduler has warmup and cooldown phases, both taking 10% of the total training steps (Zhai et al., 2022).We set dropout (Srivastava et al., 2014) to 0.1 for the attention and hidden layers, and use GELU (Gaussian Error Linear Units) as the activation function (Hendrycks and Gimpel, 2016).We use gradient norm clipping (Pascanu et al., 2013) 6 and SpecAugment (Park et al., 2019) for data augmentation.The training configuration and architecture are based on a LibriSpeech recipe for Transformerbased ASR from the SpeechBrain toolkit (Ravanelli et al., 2021).7

Results
Our experimental results document three properties of the STAC-ST model: (1) robustness to the MT-MS ST condition with no degradation in the singleturn ST condition; (2) ability to leverage speakerturn and cross-talk information, which translates into improved WER and BLEU scores; (3) ability to perform time-aligned speaker change detection.

Multi-Task Learning
We explored various training data configurations for multi-task learning (see Table 2).Row-0 in Table 2 represents how a conventional ST system (i.e., trained on only single-turn ST data) performs under the challenging multi-turn multi-speaker scenario.Other systems in Table 2 yield insights into how to boost the performances by augmenting the training data with auxiliary tasks.
Joint training of single-turn and multi-turn tasks is beneficial.Adding multi-turn ST data for training gives marginal improvements (Row-1 vs. Row-0); this suggests that simply adding limited multi-turn data will not suffice for the MT-MS cases.When either single-turn or multi-turn Multi-turn ASR data helps multi-turn ST.In our training data, there are more labeled single-turn ST data and multi-turn ASR data than multi-turn ST data.We tested a zero-shot setting where, for the multi-turn condition is only covered by ASR training data (Row-5).Comparing to training with single-turn ST+ASR data only (Row-2), the resulting model brings 3-8 BLEU gains.We hypothesize that, as the encoder is target-language-agnostic, the acoustic representations and the turn detection capacity learned from multi-turn ASR data does partially transfer to the ST task.
Multi-turn ST does not seem to help multiturn ASR.This can be seen by comparing WER scores in Row-2 and Row-6.We hypothesize that the non-monotonicity of the multi-turn ST task disrupts multi-turn ASR performance (Yan et

2023
).However, this can be fixed by adding back multi-turn ASR data (Row-4).Note that we use the Row-4 data configuration for the rest of the paper.

Speaker-Turn and Cross-Talk Detection
The STAC-ST multi-task learning framework also encodes speaker-turn and cross-talk information with task tokens [TURN] and [XT].We run experiments to study how these task labels impact on ASR and ST performance in MT-MS setting and how they even enable speaker change detection.
Modeling speaker-turn and cross-talk detection helps multi-speaker ST and ASR.We run experiments by ablating the two task tokens.Evaluation results in Table 3 show that incrementally adding speaker-turn and cross-talk detection tasks improves translation and transcription quality measured by BLEU and WER.These results support the hypothesis that explicitly learning the two tasks helps the model to better handle MT-MS scenarios.
Modeling speaker-turn and cross-talk detection enables the model to perform speaker change detection.The CTC loss helps the encoder to align input audio to text tokens per acoustic frame, including the two task tokens.We trace speakerturns and cross-talks in the timeline by ( 1 when there are actually no speaker changes.The MDR computed the rate that STAC-ST misses generating [TURN] tokens at speaker changes.While the former two are widely used in speaker segmentation research (Bredin et al., 2020), the F1-score provides an overall assessment of the performance.
To compute these metrics, we first prepare Rich Transcription Time Marked (RTTM) files for each test set from the time-aligned CTC [TURN] spikes.We compared performance of two STAC-ST models (S and L) against a reference system, the speaker segmentation pipeline of the popular PyAnnote toolkit (Bredin and Laurent, 2021). 9From results listed in Table 4, STAC-ST gets on-par F1-score vs. the reference system in the Fisher-CALLHOME test sets.Using a stronger STAC-ST (L) model improves by 2.5 absolute the F1 score.These results corroborate the importance of the [TURN] task tokens for improving ASR and ST quality.

Benchmarking STAC-ST
We run extensive benchmarks to compare STAC-ST with related work in various settings, including (1) different audio segmentation strategies, (2) model size, and (3) evaluation on single-turn ST.

MT-MS vs. VAD Segmentation
A common practice for translating long-form audio files is to first segment them into smaller chunks based on voice activity detection (VAD).We compare our MT-MS segmentation approach with two popular VAD-based audio segmenters, i.e., WebRTC (Blum et al., 2021) and SHAS (Tsiamas et al., 2022), on the channel-merged Fisher-CALLHOME test sets. 10hen the audio and reference translation segments are not aligned, like in the case of VADbased segmentation, the standard process is to first concatenate translation hypotheses and then align and re-segment the conversation-level translation based on the segmented reference translation.11However, our preliminary results show that this process yields poor BLEU scores, partially because VAD treats noise as speech, which leads to noisy translation and misalignment.Therefore, we calculate BLEU scores on concatenated hypotheses and references for the whole conversation.BLEU scores in this section are not comparable with the ones reported elsewhere.
As shown in Figure 5, for both Fisher and CALL-HOME test sets, BLEU scores of using VAD-based tools (either WebRTC or SHAS) for test data segmentation are below the ones using our MT-MS segmentation.Despite being popular in conventional speech translation, segmenting long-form audio with VAD-based tools is not the best choice for handling multi-talks conversations with speakerturns.Thus, we resort to using MT-MS segmentation based on human annotations for preparing the test data.This highlights a potential future work direction of producing robust segmentation on noisy long-form conversational audio.

Scaled STAC-ST vs. Whisper
Given the lack of prior work on MT-MS ST, we compare STAC-ST against a strong multi-task model, i.e., Whisper (Radford et al., 2023).Whisper is trained with over 2,000 times more speech data than our model (although Fisher-CALLHOME is not included among them) and its smallest version is larger than STAC-ST S. To enable a more fair comparison, we added more speech training data (cf.§4.2) to STAC-ST with size M and L.
Results in Table 5 demonstrate that when we add out-of-domain training data and scale the model accordingly (Kaplan et al., 2020;Bapna et al., 2022;Zhai et al., 2022), STAC-ST achieves better BLEU and WER scores than Whisper with comparable model sizes, although our training data is still three orders of magnitude smaller.Casc.ST (Post et al., 2013) 36.5 -65.3 11.6 Multi-task (Weiss et al., 2017)

STAC-ST for Single-Turn ST
To position STAC-ST against previous work on ST, we also run experiments under the conventional single-turn ST condition.These experiments enable us to (1) see how our end-to-end multi-task learning approach performs on a specific input condition, and (2) compare STAC-ST against four previous models trained and evaluated on the same task.To allow for comparing results across singleturn and MS-MT conditions, we also report performance with three Whisper systems.Results of these experiments are reported in Table 6.We observe that all our STAC-ST models are competitive with the previous models, also optimized on the Fisher-CALLHOME task.Comparison against the Whisper models confirms the trends observed in Table 5 under the MS-MT condition.Overall, STAC-ST L yields the best BLEU scores on both Fisher and CALLHOME.

Conclusions
In this work, we present STAC-ST, an end-to-end system designed for single-channel multi-turn & multi-speaker speech translation that uses a multitask training framework to leverage both ASR and ST datasets.We demonstrate that STAC-ST generalizes to both standard pre-segmented ST benchmarks and multi-turn conversational ST, the latter being a more challenging scenario.STAC-ST is also shown to learn the task of speaker change detection, which helps multi-speaker ST and ASR.We investigate different aspects of STAC-ST, including the impact of model and data size, automatic segmentation for long-form conversational ST, zero-shot multi-turn & multi-speaker ST with-out specific training data.Overall, this work sheds light on future work towards more robust conversational ST systems that can handle speaker-turns and cross-talks.
Limitations 1.Our primary test sets, Fisher and CALL-HOME, have narrowly one translation direction (Spanish→English).The only other public conversational ST dataset we are aware of is MSLT (Federmann and Lewis, 2016), but it only contains independent utterances, which is far from representing a realistic MT-MS use case.We call for more publicly available longform conversational ST data under a friendly license.2. Due to the same limitation of publicly available datasets, we do only explore conversations between two speakers.3. We segment the test sets based on human annotations.Despite being the best choice for the MT-MS data in our study ( §5.3.1), it is not a realistic scenario for testing.We leave improving segmentation on noisy long-form conversational audio as future work.4. We segment long-form audio files into up to 30s pieces following Radford et al. ( 2023), but we do not use the preceding segments as context.We focus on improving translation quality of conversations by speaker-turn and cross-talk detection, yet using the context information could also help.In addition, within each MT-MS segment, the inter-utterance context could have already been leveraged (Zhang et al., 2021).We leave analysis of the interand intra-segment context as future work. 5. We only test the Transformer architecture as we focus on solving a challenging MT-MS ST task with multi-task learning, which is orthogonal to the architecture choice.We leave exploring other architecture options, such as Conformer (Radfar et al., 2023), HyperConformer (Mai et al., 2023) or Conmer (Radfar et al., 2023) as future work.

A Evaluating Different CTC Weights
In this section, we evaluate different CTC weights for joint ASR & ST training under the STAC-ST framework.We show in Figure 6 the results for different S-size models trained on the Fisher-CALLHOME corpora.We confirm that BLEU and WER scores achieve the best with a λ = 0.3, akin to previous work (Zhang et al., 2022).

B Complete Main Evaluation Results on Fisher-CALLHOME
We list complete main results on Fisher-CALLHOME corpora for all the official subsets.
Multi-Turn Segments.Table 9 lists BLEU scores for all subsets of Fisher-CALLHOME, while Table 10 lists WER scores.
Single-Turn Segments.For the sake of completeness, we also report the performance of STAC-ST on each subset of Fisher-CALLHOME with the default utterance segmentation (single-turn).

C Impact of Speech Overlap Ratio
In MT-MS data, each segment contains different degree of overlaps.We calculate the overlap ratio for each segment in Fisher and CALLHOME, group the segment-level overlap ratios into 4 bins, and report BLEU scores for each bin in In Table 8 we evaluate different tolerance values when computing the speaker change detection metrics con both Fisher-CALLHOME test sets.The tolerance (in seconds) allows us to reduce the granularity that we expect in speaker change detection.Giving the fact that STAC-ST is not directly optimized for this task, we note that a value of at least 0.25 is critical to reach acceptable scores -by increasing the tolerance from 0.1 to 0.25 seconds, we see a 22% relative increase in F1 score.Setting it to 0.5 seconds further brings a 10% relative improvement.

E Complete Ablation Results for [TURN] & [XT] Task Tokens
We provide compete ablation results of adding [TURN] & [XT] task tokens on all the official development and test sets of Fisher-CALLHOME, as listed in Table 13.

F More Details of VAD-Based Segmentation
With WebRTC, audio is split when 90% of consecutive frames do not include speech.We set the frame length to 30 ms and the aggressiveness parameter to 1 as in (Tsiamas et al., 2022).With SHAS, we set 1-30 as the min-max sequence length.SHAS was trained on monologue corpora with MuST-C (Di Gangi et al., 2019).Thus, we per-  (Blum et al., 2021) or SHAS (Tsiamas et al., 2022).

Single
form an additional pre-processing step to minimize the domain mismatch between SHAS and Fisher-CALLHOME.
(1) We extract the speech activity boundaries for each audio file from the original metadata.
(2) We modify each audio file by masking with 0 all the regions in the signal where there is no speech activity, i.e., setting all the non-speech activity regions to silence.(3) We then use the masked long-form audio files with SHAS.This step decreases the false alarms rate that can be produced by SHAS on noisy segments or between contiguous utterances where there are close-talks.Close-talks are areas where two utterances are too close and the segmentation tools might not generalize well.In order to keep comparable the experimental and evaluation setup, we perform the same pre-processing step when using WebRTC.Besides SHAS (Figure 3), we also plot the segmentation distribution of WebRTC on the Fisher els, such as STAC-ST 21M vs. Whisper 244M on French→English.These results along with our main paper demonstrate that our proposed approach is well-suited for both the novel singlechannel multi-speaker speech translation task and the conventional pre-segmented speech translation.
Figure 1: A two-speaker multi-turn conversational segment.Previous work focuses on separated channels without considering cross-talks and speaker-turns (top).STAC-ST targets a more challenging scenario where multiple speakers converse with occasional cross-talks due to merged channels (bottom).

Figure 2 :
Figure 2: Proposed model architecture of STAC-ST for multi-turn & multi-speaker ST.

Figure 3 :
Figure 3: Fisher-CALLHOME test set distribution of segment length with three different segmentation approaches: single-turn, MT-MS, and SHAS.

Figure 5 :
Figure5: ST performance on Fisher-CALLHOME test data using different segmentation techniques for long-form audio: MT-MS (ours), WebRTC, and SHAS.BLEU scores of using VAD-based tools (either We-bRTC or SHAS) for test data segmentation are lower than BLEU computed using our MT-MS segmentation.

Figure 6 :
Figure 6: Ablation of the CTC weight in the overall loss computation and its impact in BLEU and WERs for Fisher and CALLHOME development & evaluation sets.Error bars show the standard deviation between dev/dev2/test sets for Fisher and devset/evlset for CALL-HOME.Single-turn and MS-MS results are shown with straight and dashed lines, respectively.

Figure 7 :
Figure 7: Ground-truth speaker activities and CTC spikes of [TURN] and [XT] task tokens on three randomly selected Fisher samples.The Tile list the ID (recording, file number, start and end time), the ground truth transcript and translation.

Figure 8 :
Figure 8: Data distribution for Fisher test set with different segmentation approaches.

Figure 9 :
Figure 9: We compare different segmentation techniques with two training data configurations: only Single-turn data and Both single-turn and multi-turn data.The bars denote different segmentation techniques for long-form audio, including MT-MS segmentation (proposed in this work), VAD via WebRTC(Blum et al., 2021) or SHAS(Tsiamas et al., 2022).

Table 2 :
(Nguyen et al., 2021;Lupo et al., 2022)different training data configurations.Joint training with single-turn and multi-turn data of both ASR and ST tasks achieves the best scores.datahasreasonablesize (i.e., augmenting ASR data), combining them yields more pronounced improvements (Row-4 vs. Row-2/Row-3).Although single-turn and multi-turn data share the same utterances, split/concatenation-based data augmentation is known to be effective in the low-resource training regime(Nguyen et al., 2021;Lupo et al., 2022).
Joint training of ST and ASR is beneficial.Interestingly, training a model with only multi-turn ST data failed to converge, but adding multi-turn ASR data stabilizes the training (Row-3). 8Moreover, by adding both single-turn and multi-turn ASR data for joint training on top of Row-1, both BLEU and WER are improved by a significant margin (Row-4).

Table 3 :
al., ASR and ST performance of STAC-ST with the incremental addition of task tokens.Modeling speakerturn and cross-talk detection with [TURN] and [XT] tokens enhances ASR and MT accuracy.

Table 4 :
Speaker activity on a Fisher corpus sample.On the top, ground truth human annotation on two audio channels.On the bottom, CTC spikes of turn and cross-talk tokens detected by STAC-ST in the merged channel.Speaker change detection performance measured by F1, MDR and FAR.We compare STAC-ST with PyAnnote.The strongest L-size STAC-ST model (from Table5) shows on-par F1-score with PyAnnote.Tolerance is set to 0.25s.

Table 5 :
ASR and ST performance with increasing model size of STAC-ST and Whisper.STAC-ST achieves better BLEU and WER scores than Whisper with comparable model sizes.

Table 6 :
ASR and ST performance with the official single-speaker manual segmentation.Previous work results and Whisper baselines are provided.Our strongest model, STAC-ST L yields the best scores.
Table 11 lists the BLEU scores, while Table 12 list WER scores.

Table 7 :
We calculate the overlap ratio for each segment in Fisher and CALLHOME and then group the segmentlevel overlap ratios into 4 bins.We report BLEU score and the number of words in reference within each bin.
ID: 20051115_212123_516_fsp-0-042565-045054 TRANCRIPTION: yo creo que la tecnología del teléfono han echo avances también porque ya puedo hacer llamadas de largas distancias y no me valen nada porque uno paga ah una cuota mensual [turn] ajá [turn] [xt] y puede hacer todas las llamadas que uno quiera [turn] oh pero acá en [turn] [xt] y eso no era as eso no era así hace cinco o diez veinte años [turn] [xt] claro o sea pero aquí en estados unidos [turn] [xt] aquí en estados unidos sí TRANSLATION: i think phone technology has made progress because i can also make long distance phone calls and i do not have to pay ah a monthly fee [turn] yeah [turn] [xt] and you can make all the calls you want [turn] oh but here in [turn] [xt] and that wasn &apos;t that wasn &apos;t like that in five or ten twenty years [turn] [xt] ofcourse but here in the united states [turn] [xt] here in the united states yes que [turn] [xt] pero pero y qué opinas de que osea de que no va a tener como compañeros de escuela eso eso [turn] [xt] bueno [turn] eso es una experiencia también no osea [turn] sí tienen muchos aquí en miami programas para la gente que que enseñan sus hijos en la casa [turn] [xt] ajá [turn] entonces eh normalmente una vez a la semana ellos se se juntan [turn] ah okey TRANSLATION: so [turn] [xt] but what do you think about her not having school mates [turn] [xt] well [turn] that &apos;s also not an experience bone [turn] yes there are many programs here in miami for people who teach their children at home [turn] [xt] aha [turn] then usually once a week they will be together [turn] ah okay me gusta la música con ritmo también me gusta bailar [turn] okay [turn] te gusta bailar [turn] sí me gusta para hablar también [turn] oh que bien [turn] yo bailaba más cuando era joven pero ahora ya no bailo mucho se paró [turn] [xt] oh yo también ahora bailo cuando estoy sola limpiando la casa eres casada [turn] sí soy casada [turn] ah y hijos [turn] no no no tengo hijos TRANSLATION: eh i like music with rythm i also like to dance [turn] ok [turn] do you like to dance [turn] yes i also like to talk as well [turn] oh that is good [turn] i danced more when i was young but now i don &apos;t dance as much it stopped [turn] [xt] oh me too now i dance when i &apos;m alone cleaning my house are you married [turn] yes i &apos;m married [turn] ah and children [turn] no no i don &apos;t have children

Table 9 :
BLEU scores on each multi-turn dataset for all the official Fisher-CALLHOME development and test subset.AVG lists the average between dev and test sets.

Table 10 :
WERs on each multi-turn dataset for all the official Fisher-CALLHOME development and test subset.AVG lists the average between dev and test sets.

Table 11 :
BLEU scores on each single-turn dataset for all the official Fisher-CALLHOME development and test subset.AVG lists the average between dev and test sets.

Table 12 :
WERs on each single-turn dataset for all the official Fisher-CALLHOME development and test subset.AVG lists the average between dev and test sets.

Table 13 :
Ablation of the impact of encoding speaker turn and cross-talk information with [TURN] and [XT].BLEU scores and WERs are listed for multi-turn dataset for all the official Fisher-CALLHOME development and test sets.AVG lists the average between dev and test sets.
(Wang et al., 2021)Size DE → EN FR → EN ES → ENTable17: BLEU scores on three language directions of the CoVoST 2 corpus test set(Wang et al., 2021).The results show that (1) our multilingual large model outperforms Whisper and XLS-R multilingual models with comparable sizes, even though Whisper and XLS-R where trained on data two orders of magnitude larger; (2) our models with smaller sizes sometimes outperform larger Whisper models, such as STAC-ST 21M vs. Whisper 244M on French→English.