Segmenting Subtitles for Correcting ASR Segmentation Errors

Typical ASR systems segment the input audio into utterances using purely acoustic information, which may not resemble the sentence-like units that are expected by conventional machine translation (MT) systems for Spoken Language Translation. In this work, we propose a model for correcting the acoustic segmentation of ASR models for low-resource languages to improve performance on downstream tasks. We propose the use of subtitles as a proxy dataset for correcting ASR acoustic segmentation, creating synthetic acoustic utterances by modeling common error modes. We train a neural tagging model for correcting ASR acoustic segmentation and show that it improves downstream performance on MT and audio-document cross-language information retrieval (CLIR).


Introduction
Typical ASR systems segment the input audio into utterances using purely acoustic information, i.e., pauses in speaking or other dips in the audio signal, which may not resemble the sentence-like units that are expected by conventional MT systems for spoken language translation (SLT) (Cho et al., 2017). Longer utterances may span multiple sentences, while shorter utterances may be sentence fragments containing only a few words (see Figure 1 for examples). Both can be problematic for downstream MT systems. In this work, we propose a model for correcting the acoustic segmentation of an ASR model to improve performance on downstream tasks, focusing on the challenges inherent to SLT pipelines for low-resource languages.
While prior work has trained intermediate components to segment ASR output into sentence-like units (Matusov et al., 2007;Rao et al., 2007), these have primarily focused on highly resourced language pairs such as Arabic and Chinese. When the source language is low-resource, suitable training data may be very limited for ASR and MT, and even nonexistent for segmentation. Since typical low-resource language ASR datasets crawled from the web do not have hand-annotated segments, we propose deriving proxy segmentation datasets from film and television subtitles. Subtitles typically contain segment boundary information like sentence-final punctuation, and while they are not exact transcriptions, they are closer to transcribed speech than many other large text corpora.
Our proposed model takes as input a sequence of tokens and segmentation boundaries produced by the acoustic segmentation of the ASR system and returns a corrected segmentation. While subtitles are often similar to speech transcripts, they lack an existing acoustic segmentation for our model to correct. To account for this, we generate synthetic acoustic segmentation by explicitly modeling two common error modes of ASR acoustic segmentation: under-and over-segmentation.
We evaluate the downstream MT performance in a larger SLT pipeline, and show improvements in translation quality when using our segmentation model to correct the acoustic segmentation provided by ASR. We also extrinsically evaluate our improved SLT pipeline as part of a documentlevel cross-lingual information retrieval (CLIR) task, where we show that improvements in ASR segmentation also lead to improved relevance of search results. We report results for nine translation settings: Bulgarian (BG) to English, Lithuanian (LT) to English, and Farsi (FA) to English, and when using either phrase-based, statistical MT (SMT) or one of two neural MT (NMT) models. We finally perform an ablation study to examine the effects of our synthetic acoustic boundaries and our over-and under-segmentation noise.
This paper makes the following contributions. (i) We propose the use of subtitles as a proxy dataset for correcting ASR acoustic segmentation and (ii) a method for adding synthetic acoustic utterance segmentations to a subtitle dataset, as well as (iii) a  simple neural tagging model for correcting ASR acoustic segmentation before use in an MT pipeline.
(iv) Finally, we show downstream performance increases on MT and document-level CLIR tasks, especially for more syntactically complex segments.

Related Work
Segmentation in SLT has been studied quite extensively in high-resource settings. Early work used kernel-based SVM models to predict sentence boundaries using language model probabilities along with prosodic features such as pause duration (Matusov et al., 2007;Rao et al., 2007) and part-of-speech features derived from a fixed window size (Rangarajan Sridhar et al., 2013). Other work has modeled the problem using hidden markov models (Shriberg et al., 2000;Gotoh and Renals, 2000;Christensen et al., 2001;Kim and Woodland, 2001) and conditional random fields (Liu et al., 2005;Lu and Ng, 2010).
More recent segmentation work uses neural architectures, such as LSTM (Sperber et al., 2018) and Transformer models (Pham et al., 2019). These models benefit from the large training data available for high-resource languages. For example, the TED corpus (Cettolo et al., 2012) for SLT from English to German includes about 340 hours of welltranscribed data. To our knowledge, such datasets do not exist for the languages we are interested in. Wan et al. (2020) develop a segmentation model in our setting using subtitles; however, they do not take into account explicit modeling of segmentation errors and show only minimal and intermittent improvements in downstream tasks.
Recent work has increasingly focused on endto-end models of SLT in a high-resource setting, since these systems reduce error propagation and latency when compared to cascaded approaches (Weiss et al., 2017;Cross Vila et al., 2018;Sperber et al., 2019;Gaido et al., 2020;Bahar et al., 2020;Lakumarapu et al., 2020). In spite of these advantages, end-to-end systems have only very recently achieved competitive results due to the limited amount of parallel data for speech translation as compared to the data that is available to train ASR systems and translation systems separately (Gaido et al., 2020;Ansari et al., 2020).

Problem Definition
We treat the ASR acoustic segmentation problem as a sequence tagging problem (Stolcke and Shriberg, 1996). Unlike a typical tagging problem, which aims to tag a single input sequence, our input is a pair of aligned sequences of n items, x = [x 1 , . . . , x n ] and γ = [γ 1 , . . . , γ n ] where x and γ are the ASR tokens and acoustic segmentation respectively. The tokens x i belong to a finite vocabulary V, while the acoustic segmentation boundary tags are binary, i.e., γ i ∈ {0, 1}, where γ i = 1 indicates that the ASR acoustic segmentation placed a boundary between tokens x i and x i+1 . The goal is to predict a corrected segment boundary tag sequence y = [y 1 , . . . , y n ] ∈ {0, 1} n from x and γ.
We do this by learning a probabilistic mapping from token/segmentation sequences to corrected segmentation p(·|x, γ; θ) : {0, 1} n → (0, 1) where p is a neural tagging model with parameters θ. While γ are produced solely from acoustic cues, p can take advantage of both the acoustic information (via γ) as well as syntactic/semantic cues implicit in x.

YEAH THE HOLIDAY MARKET IS TOO BUSY YES
Synthetic Acoustic Segmentation Generation: YEAH THE HOLIDAY MARKET IS TOO BUSY YES Figure 2: Example of synthetic acoustic segmentation (γ) creation. For each training datapoint, we have as model input the tokens x and the corresponding model output sentence boundary labels y. To generate the synthetic acoustic segmentation (γ), we apply under-segmentation (γ) and over-segmentation (γ) noise to y. Dashes indicate the tokens where the particular noise is not applicable. Bold indicates the changed labels due to the noise. We generate the additional input γ by combining bothγ andγ.

Generating Training Data from Subtitles
One of our primary contributions is a method for converting subtitle data into suitable training data for an ASR segmentation correction model. The subtitle data contains speech-like utterances of dialogue between characters in film and television shows. For the purposes of this paper, we do not use information about speaker identity, only the text and information about segmentation. We obtain the ground truth output label segmentation y by segmenting the subtitle text on sentence final punctuation. 1 We remove the punctuation but keep the implied label sequence to obtain the input token sequence x and ground truth output label segmentation y. However, we do not have acoustic segmentation available for x, y pairs derived from subtitle data, which we will need as additional input if our model is to learn to correct acoustic segmentation provided by an ASR component. We thus create a synthetic acoustic segmentation sequence γ as input by adding two types of noise to y. Specifically, we imitate two common ASR system errors, undersegmentation noise and over-segmentation, so that at test time the model can correct those errors.
Under-segmentation Noise In the ASR model, under-segmentation occurs when pauses between words are brief, and the resulting ASR output is an utterance that could ideally be split into multiple sentence-like segments. We simulate this by adding under-segmentation noise which converts ground truth segmentation boundaries y i = 1, to y i = 0 with probabilityα and leaves y i = 0 unchanged.
Over-segmentation Noise Over-segmentation occurs in an ASR model when a speaker takes a longer pause in the middle of what could be interpreted as a contiguous sentence-like utterance.
Over-segmentation noise is simulated by inserting random segment boundaries within an utterance. That is, with probabilityα we convert a non-boundary tag y i = 0 to y i = 1, while leaving all y i = 1 unchanged.

Synthetic Segmentation Input Generation
We can then sample a synthetic acoustic segmentation sequence γ from the following distribution, This can be thought of as dropout applied to the correct label sequence y. See Figure 2 for an example. Our proposed segmentation correction model will learn to denoise the input segmentation sequence γ and produce the corrected sequence y.

Model
We employ a Long Short-Term Memory (LSTM)based model architecture for this task (Hochreiter and Schmidhuber, 1997). Given an input sequence of ASR tokens x = [x 1 , . . . , x n ] along with corresponding ASR segmentation sequence γ = [γ 1 , . . . , γ n ], we first get an embedding representation e i ∈ R 316 for each token as follows: where G ∈ R |V|×300 and F ∈ R 2×16 are embedding lookup tables, and ⊕ is the concatenation operator. We initialized G with FastText embeddings pre-trained on Common Crawl data (Mikolov et al., 2018). F is randomly initialized. We pass the embedding sequence through a twolayer bi-directional LSTM, with 512 hidden units each, to get the contextual representation h i ∈ R 1024 for each token as follows: LSTM are the forward direction and backward direction LSTMs respectively.
Each output state h i is then passed through a linear projection layer with a logistic sigmoid to compute the probability of a segment boundary p(y i = 1|h i ; θ). The log-likelihood of a corrected segmentation boundary sequence is log p(y|x, γ; θ) = n i=1 log p(y i |h i ; θ). We fit the parameters, θ, by approximately minimizing the negative log-likelihood on the training set D, L(θ) = − 1 |D| (x,γ,y)∈D log p (y|γ, x; θ), using mini-batch stochastic gradient descent.

Subtitles Dataset
We obtain monolingual subtitle data from the OpenSubtitles 2018 corpus (Lison and Tiedemann, 2016). OpenSubtitles contains monolingual subtitles for 62 languages drawn from movies and television. The number of subtitle documents varies considerably from language to language. LT has only 1,976 documents, while BG and FA have 107,923 and 12,185 respectively. We randomly down-sample from the larger collection to 2,000 documents to ensure our segmentation correction models are all trained with similar amounts of data.
Treating the subtitles for a complete television episode or movie as the source of a single training instance (x, γ, y) introduces some complications because they are usually quite long relative to typical SLT system input. To better match our evaluation conditions, we arbitrarily split each document into M instances, where the length l in tokens for each instance m is sampled from L ∼ U (1, 100), i.e. uniformly from 1 to 100 tokens. This range was determined to to approximate the length distribution of our evaluation datasets.
See Table 1 for statistics on the number of training instances created as well as the average number of sentence segments per instance. Note that even though the number of subtitle documents is close to equal, the documents can vary considerably in length. As result, the BG dataset has more than twice the training instances of FA or LT. In some cases, an instance may contain only a few words that do not constitute a sentence, and such instances would have no segment boundaries; this helps prevent the model from learning pathological solutions such as always inserting a segment boundary at the end of the sequence.
Since we do not evaluate the segmentation directly on OpenSubtitles, we split the available data into training and development partitions, with 90% of the instances in the training set.

Speech Retrieval Dataset
For extrinsic evaluation of ASR segments, we use the speech retrieval dataset from the MATERIAL 2 program. The goal of MATERIAL is to develop systems that can retrieve text and speech documents in low-resource languages that are relevant to a given query in English. To bootstrap speech retrieval systems in low-resource languages, MATE-RIAL collects BG, FA, and LT speech training data for ASR systems, as well as additional separate collections of BG, FA, and LT speech documents along with their relevance judgments for a set of English language queries. Since the retrieval of speech documents requires a cascade of ASR, MT, and CLIR systems, the MATERIAL data allows us to measure the impact of ASR segmentation on both the translation quality, as well as the downstream retrieval system. The data partitions in MA-TERIAL are numerous and to avoid confusion, we briefly describe them here.
The BUILD partition contains a small amount of ASR training and development data for BG, FA, and LT, i.e. audio files paired with reference transcripts. We use the BUILD data for fine-tuning our subtitle trained model. We apply the same synthetic acoustic segmentation generation procedure to this collection as we do to the subtitle data when using it for fine-tuning. See Table 1   ments and a set of English language queries and relevance judgements for those queries. At test time, we use the acoustic segmentation provided by the ASR system as the input γ instead of generating acoustic label sequences. Additionally, roughly half of the audio documents in this collection include ground-truth transcriptions and translations to English, which allows us to evaluate MT. The Test (Large) partition is similar to the Test (Small) partition, but much bigger in size. There are no transcripts or translations, so it can be used only to evaluate CLIR. The Test (Large) partition is available only for LT.
We use the translated portion of Test (Small) as a test set for MT and both Test (Small) and Test (Large) as extrinsic test sets for CLIR. The statistics of the MATERIAL partitions can be found in Table 2. 3 The speech retrieval datasets come from three domains: news broadcast, topical broadcast such as podcasts, and conversational speech from multiple low-resource languages. Some speech documents have two speakers, with each speaker on a separate channel, i.e., completely isolated from the other speaker. When performing segmentation we treat each channel independently, creating a separate (resegmented) ASR output for each channel. To create the document transcript for MT, we merge the two output sequences by sorting the token segments based on their wall-clock start time.

Segmentation Model Training
For all datasets, we tokenize all data with Moses (Koehn et al., 2007). To improve performance on out of vocabulary words, we use Byte-Pair-Encoding (Sennrich et al., 2016) with 32,000 merge operations to create subwords for each language.
We then train the segmentation model on the subtitle dataset. When creating γ sequences on the subtitles data, we set under-and over-segmentation noise toα = 0.25 andα = 0.25 respectively. 4 We use the Adam optimizer (Kingma and Ba, 2015) with learning rate of 0.001. We use early stopping on the validation loss of the OpenSubtitles validation set to select the best stopping epoch for the segmentation model.
We further fine-tune this model on the BUILD partition to expose the model to some in-domain training data. The data is similarly prepared as OpenSubtitles. We use early stopping on the development loss of this partition.

ASR-Segmentation-MT-CLIR Pipeline
We evaluate our segmentation correction model in the context of a CLIR pipeline for retrieving audio documents in BG, FA, or LT that are relevant to English queries. We refer to the three languages BG, FA, and LT as source languages. This pipeline uses ASR to convert source language audio documents to source language text transcripts, and MT to translate the source language transcripts into English transcripts. Then a monolingual English IR component is used to return source language documents that are relevant to the issued English queries. We insert our segmentation correction model into this pipeline between the ASR and MT components, i.e. (i) ASR to (ii) Segmentation Correction, to (iii) MT to (iv) IR. For clarity we reiterate, the segmentation model takes as input a source language transcript and returns the source language transcript with corrected segmentation.
To implement the ASR, MT, and IR components, we use implementations developed by MATERIAL program participants (Oard et al., 2019).

ASR System
We use the ASR systems developed jointly by the University of Cambridge and the University of Edinburgh (Ragni and Gales, 2018;Carmantini et al., 2019). The ASR system uses a neural network based acoustic model, trained in a semisupervised manner on web-scraped audio data, to overcome the small amount of training data in the BUILD data. Separate models are trained for narrow-band audio (i.e., conversational speech) and wide-band audio (i.e. news and topical broadcast).

Segmentation Correction
At test time, given a speech document, the ASR system produces a series of acoustically derived utterances, i.e. x (1) , . . . , x (m) , from this input. In our setting, the corresponding acoustic label sequence γ (i) for each utterance would be zero everywhere except the final position, i.e. γ (i) = [0, 0, . . . , 0, 1]. If we were to process each utterance, (x (i) , γ (i) ), individually, the model may not have enough context to correct under-segmentation at the ends of the utterance. For example, when correcting the final token position, which by definition will precede a long audio pause, the model will only see the left-hand side of the context. To avoid this, we run our segmentation correction model on consecutive pairs of ASR output utterances, i.e.
. Under this formulation each ASR output utterance is corrected twice (except for the first and last utterances which are only corrected once), therefore we have two predictionsŷ (i,L) j andŷ (i,R) j for the j-th segment boundary. We resolve these with the logical-OR operation to obtain the final segmentation correction, i.e. y (i) Based on the segmentation corrections produced by our model, we re-segment the ASR output tokens and hand the resulting segments off to the MT component where they are individually translated.

MT Systems
We evaluate with three different MT systems. We use the neural MT model developed by the University of Edinburgh (EDI-NMT) and the neural and phrase-based statistical MT systems from the University of Maryland (UMD-NMT and UMD-SMT, respectively). The EDI-NMT and UMD-NMT systems are Transformer-based models (Vaswani et al., 2017) trained using the Marian Toolkit (Junczys-Dowmunt et al., 2018) and Sockeye (Hieber et al., 2018), respectively. UMD-NMT trains a single model for both directions of a language pair (Niu et al., 2018), while EDI-NMT has a separate model for each direction. UMD-SMT is trained using the Moses SMT Toolkit (Koehn et al., 2003), where the weights were optimized using MERT (Och, 2003).

IR System
For the IR system, we use the bag-of-words language model implemented in Indri (Strohman et al., 2005). Documents and queries are both tokenized and normalized on the character level to avoid potential mismatch in the vocabulary. The queries are relatively short, typically consisting of only a few words, and they define two types of relevancy -the conceptual queries require the relevant documents to be topically relevant to the query, while the simple queries require the relevant document to contain the translation of the query. However, no specific processing is used for these two relevance types in our experiments.

MT Evaluation
Our first extrinsic evaluation measures the BLEU (Papineni et al., 2002) score of the MT output on the Test (Small) sets after running our segmentation correction model, where we have ground truth reference English translations. We refer to our model trained only on the BUILD data as Seg, and our subtitle-trained model as Seg + Sub. As our baseline, we compare the same pipeline using the segmentation produced by the acoustic model of the ASR system, denoted Acous.
Since each segmentation model produces segments with different boundaries, we are unable to use BLEU directly to compare to the reference sentences. Therefore, we concatenate all segments of a document and treat them as one segment, which we refer to as "document-level" BLEU score. We use SacreBLEU 5 (Post, 2018) with the lowercase option due to the different casing for the reference English translation and MT output.
We also provide BLEU scores for the MT output using the reference transcriptions (Ref) to show the maximum score the system can achieve when there is no ASR or segmentation error. This represents the theoretical upper bound for our pipeline with a perfect ASR system.
Segmentation errors (i.e., the acoustic model incorrectly segmented an utterance) and word errors (i.e., the ASR system produces an incorrect word) can both affect the downstream MT performance. To isolate the segmentation errors from word errors, we align the ASR output tokens to the reference transcriptions by timecode in order to obtain a reference system that has no segmentation errors, but does have transcription errors. This represents a more realistic ceiling for our model because while we can correct segmentation, we cannot correct word errors. We refer to this system in the results section as Align.

Document-Level CLIR Evaluation
Our second extrinsic evaluation is done on the MA-TERIAL CLIR task. We are given English queries and asked to retrieve audio documents in either BG, FA, or LT. In our setup, we only search over the English translations of the segmented transcripts produced by our pipeline, i.e., we do not translate the English query into the other languages or search the audio signal directly. We evaluate the performance of CLIR using the Maximum Query Weighted Value (MQWV) from the ground-truth query-relevance judgements for documents in the Test (Small & Large) collections. MQWV, which is a variant of the official MATERIAL program metric called Actual Query Weighted Value (NIST, 2017, AQWV), is a recall-oriented rank metric that measures how well we order the retrieval collection with respect to query relevance.
AQWV is calculated as the average of 1 − (P m + β * P f a ) for each query, where P m is the probability of misses, P f a is the probability of false alarms, and β is a hyperparameter. The maximum possible value is 1 and the minimum value is given by −β.
In our experiments β it is set to 40. AQWV thus not only depends on the ranking of the documents but also on β.
Additionally, AQWV is sensitive to the threshold used by the IR system to determine document relevance. To avoid the tuning of thresholds, we  report MQWV which is calculated for the optimal threshold; in our experiments this threshold is estimated over the ranks of the documents. Thus, MQWV doesn't depend on the ability to estimate the threshold and only depends on the quality of the document ranking for a given query. Table 3 shows the results of the MT evaluation. The best non-reference system for each language and MT system is in bold. We compute statistical significance against the acoustic (Acous.) segmentation baseline using Welch's T Test (Welch, 1947). Our subtitle-based segmentation model (Seg + Sub) consistently improves BLEU scores of NMT models for BG and FA, while not making significant differences in SMT. This echoes prior work (Khayrallah and Koehn, 2018;Rosales Núñez et al., 2019) suggesting SMT models are more robust to noisy inputs than neural models.  In 6 out of 9 cases, we see that adding the subtitles data improves over using only the BUILD data. Of the remaining cases, the scores remain similar (i.e., it doesn't hurt the model). Training on the BUILD data alone improves BG NMT models, but for SMT and the other languages, it either makes no difference or is worse than the acoustic model.

MT
Comparing Seg + Sub with Align in all languages, we see that there is only a small gap between the two. This suggests that our model is nearing the ceiling on what correcting segmentation can do to improve downstream MT. Furthermore, on LT where our model offers only small or no improvement, we see that the original acoustic segmentation is almost performing as well as Align. This suggests that there is relatively little room for improving LT MT by correcting sentence boundaries alone.

Document-Level CLIR
MQWV on the Test (Small) and Test (Large) partitions are shown in Table 4 and Table 5 respectively. On the Test (Small) partition, we see that our segmentation model improves the CLIR performance over the acoustic segmentation in 7 out of 9 cases. On the Test (Large) partition, we see that our segmentation model improves downstream retrieval performance consistently across all three MT systems. We note that while we measure the downstream retrieval performance separately for each MT system, a real-world CLIR system could perform IR over the union of multiple MT systems, which could yield even further improvements in retrieval performance (Zhang et al., 2020).

Complexity Analysis
We hypothesize that the effects of improved segmentation should be more pronounced for more complex utterances with more opportunities to mis-   place boundaries. Therefore, we calculate a measure of sentence complexity, the Automated Readability Index (ARI) (Senter and Smith, 1967), for all documents in Test (Small) 6 and examine the performance of our Sub model on MT. We separate the documents into quartiles based on their calculated ARI, where a higher ARI (and thus a higher quartile) indicates a more complex document, and present the average document-level BLEU score for each quartile in Table 6. In the interest of space, we present results for Bulgarian and for NMT, and defer other languages and SMT to Appendix A. We see that the most dramatic gains in BLEU occur for documents in the third and fourth quartiles, which matches our intuition. In other words, our segmentation model most improves the translation quality of more syntactically complex segments.

Ablation Study
We perform an ablation study on two components in our proposed model, (i) the use of acoustic segmentation boundary labels γ as input and (ii) training with a combination of over-and undersegmentation noise. We use the same training and  Table 7: Document-level BLEU score for the models in ablation studies. We provide the acoustic model (Acous.) and our proposed model (Full).
evaluation process and only modify the affected component. We perform our ablation on the BG MT task, since it had a wider range of improvements than the other languages.

Use of Acoustic Segmentation Boundaries.
We train a segmentation model using only ASR output tokens x as input without the the ASR segmentation sequence γ. For this model, we modify the embedding representation e i so that we do not use F : This model, which we refer to as Lex., must exclusively use the lexical information of the ASR token sequence x to make predictions.
Over-segmentation and Under-segmentation.
The two segmentation problems of the system may have different impact on the MT system. To see their individual effects, we train two models where the synthetic acoustic segmentation boundary sequence γ is created using only under-segmentation or over-segmentation noise. We refer to those models as Lex. + Under and Lex. + Over respectively. Table 7 shows the effects of the model ablations on MT system BLEU score. On both NMT systems, we see that there is a roughly 1 point improvement on BLEU when including the ASR segmentation boundaries as input.

Results
For both NMT models we also find that oversegmentation noise helps slightly more than adding under-segmentation noise, but that these additions are complementary, i.e. the full model does best overall. For SMT, we surprisingly find that model without acoustic segmentation boundary input does best. The overall difference between the acoustic (Acous.) baseline and any of the segmentation correction models is small compared to the gains had on NMT. This again suggests that SMT is more robust to changes in segmentation.

Conclusion
We propose an ASR segmentation correction model for improving SLT pipelines. Our model makes use of subtitles data as well as a simple model of acoustic segmentation error to train an improved ASR segmentation model. We demonstrate downstream improvements on MT and CLIR tasks. In future work, we would like to find a better segmentation error model that works well in conjunction with SMT systems in addition to NMT systems.

A Full Complexity Analysis
We present the full results of our complexity analysis as described in subsection 8.3. Bulgarian (Table 8, Lithuanian (Table 9), and Farsi (Table 10) results are shown for all three MT models as well as both the acoustic segmentation and our Seg + Sub segmentation correction model. The best score for each MT system and quartile is bolded.