Joint Generation of Captions and Subtitles with Dual Decoding

As the amount of audio-visual content increases, the need to develop automatic captioning and subtitling solutions to match the expectations of a growing international audience appears as the only viable way to boost throughput and lower the related post-production costs. Automatic captioning and subtitling often need to be tightly intertwined to achieve an appropriate level of consistency and synchronization with each other and with the video signal. In this work, we assess a dual decoding scheme to achieve a strong coupling between these two tasks and show how adequacy and consistency are increased, with virtually no additional cost in terms of model size and training complexity.


Introduction
As the amount of online audio-visual content continues to grow, the need for captions and subtitles1 in multiple languages also steadily increases, as it widens the potential audience of these contents.Both activities are closely related: human subtitle translators often generate subtitles directly based on the original captions without viewing or listening to the original audio/video file.This strategy however runs the risk of amplifying, in the subtitle approximations, simplifications or errors present in the captioning.It may even happen that both texts need to be simultaneously displayed on screen: for instance, in countries with several official languages, or to help foreign language learners.This means that captions and subtitles need to be consistent not only with the video content, but also with each other.It also implies that they should be synchronized (Karakanta et al., 2021).Finally, even in scenarios where only subtitles would be needed, generating captions at the same time may still help to better check the correctness of subtitles.
Early approaches to automatic subtitling (e.g.Piperidis et al., 2004) also assumed a pipeline architecture (Figure 1 (b)), where subtitles are translated from captions derived from automatic speech transcripts.A recent alternative (Figure 1 (a)), which mitigates cascading errors, is to independently perform captioning and subtitling in an endto-end manner (Liu et al., 2020;Karakanta et al., 2020a); the risk however is to generate inconsistencies (both in alignment and content) between the two textual streams.This approach might also be limited by the lack of appropriate training resources (Sperber and Paulik, 2020).Various ways to further strengthen the interactions between these tasks by sharing parameters or loss terms are evaluated by Sperber et al. (2020).Figure 1 (c) illustrates these approaches.
In this work, we explore an even tighter integration consisting of simultaneously generating both captions and subtitles from automatic speech recognition (ASR) transcripts using one single dual decoding process (Zhou et al., 2019;Wang et al., 2019;Le et al., 2020;He et al., 2021;Xu and Yvon, 2021), illustrated in Figure 1 (d ing, automatically turning ASR transcripts into full-fledged captions involves multiple changes, depending on the specification of the captioning task.In our case, this transformation comprises four main aspects: segmentation for display (via tag insertion), removal of certain features from spoken language (eg.fillers, repetitions or hesitations), ASR errors correction, and punctuation prediction.
The transcript-to-subtitle task involves the same transformations, with an additional translation step to produce text in another language.Table 1 illustrates the various transformations that occur between input transcripts and the corresponding output segments.
As our experiments suggest, a tighter integration not only improves the quality and the consistency of captions and subtitles, but it also enables a better use of all available data, with hardly any impact on model size or training complexity.Our main contributions are the following: (i) we show that simultaneously generating captions and subtitles can improve performance in both languages, reporting significant improvements in BLEU score with respect to several baselines; (ii) we initialize dual decoder from a standard encoder-decoder model trained with large scale data, thereby mitigating the data scarcity problem; (iii) we explore a new parameter sharing scheme, where the two decoders share all their parameters, and achieve comparable performance at a much reduced model size in our experimental conditions; (iv) using 2-round decoding, we show how to alleviate the exposure bias problem observed in dual decoding, leading to a clear boost in performance.

Model
In a nutshell, dual decoding aims to generate two output sentences e 1 and e2 for each input sentence f .This means that instead of having two independent models (Eq.( 1)), the generation of each target is influenced by the other output (Eq.( 2)): where In our experiments, ASR transcripts are considered as the source language while captions and subtitles are the two target languages (Wang et al., 2019;He et al., 2021;Xu and Yvon, 2021).The dual decoder model has also been proposed in several application scenarios other than multi-target translation such as bi-directional translation (Zhou et al., 2019;Zhang et al., 2020a;He et al., 2021), and also to simultaneously generate transcripts and translations from the audio source (Le et al., 2020).
To implement the interaction between the two decoders, we mostly follow Le et al. (2020) and Xu and Yvon (2021) who add a decoder crossattention layer in each decoder block, so that the hidden states of previous layers of each decoder H 1 l and H 2 l can attend to each other.The decoder cross-attention layers take the form: 2 Both decoders are thus fully synchronous since each requires the hidden states of the other to compute its own hidden states.

Sharing Decoders
One weakness of the dual decoder model is that it contains two separate decoders, yielding an increased number of parameters (×1.6 in our models w.r.t.standard translation models).Inspired by the idea of tying parameters in embedding matrices (Inan et al., 2017;Press and Wolf, 2017), we extend the dual decoder model by sharing all the parameters matrices in the two decoders: in this way, the total number of parameters remains close to that of a standard translation model (×1.1), since the only increase comes from the additional decoder crossattention layer.When implementing inference with this multilingual shared decoder, we prefix each target sentence with a tag indicating the intended output (captioning or subtitling).

Training and Fine-tuning
The dual decoder model is trained using a joint loss combining the log-likelihood of the two targets: where θ represents the set of parameters.Training this model requires triplets of instances associating one source with two targets.Such resources are difficult to find and the largest tri-parallel open source corpus we know of is the MuST-Cinema dataset (Karakanta et al., 2020b), which is clearly smaller than what exists to separately train automatic transcription or translation systems.
In order to leverage large scale parallel translation data for English-French, we adopt a finetuning strategy where we initially pre-train a standard (encoder-decoder) translation model using all available resources, which serves to initialize the parameters of our dual decoder model.As the dual decoder network employs two decoders with shared parameters, we use also the decoder of the pretrained model to initialize this subnetwork.Finetuning is performed on a tri-parallel corpus.We discuss the effect of decoder initialization in Section 3.4.1.Finally, for all fine-tuned models, the decoder cross-attention layer which binds the two decoders together is always randomly initialized.with subword-nmt.4

Experimental Settings
We implement the dual decoder model based on the Transformer (Vaswani et al., 2017) model using fairseq5 (Ott et al., 2019). 6All models are trained until no improvement is found for 4 consecutive checkpoints on the development set, except for the EN→FR pre-trained translation model which is trained during 300k iterations (further details in Appendix B).We mainly measure performance with SacreBLEU (Post, 2018);7 TER and BERTScores (Zhang et al., 2020b) are also reported in Appendix D. Segmentation tags in subtitles are taken into account and BLEU scores are computed over full sentences.In addition to BLEU score, measuring the consistency between captions and subtitles is also an important aspect.We reuse the structural and lexical consistency score proposed by Karakanta et al. (2021).Structural consistency measures the percentage of utterances having the same number of blocks in both languages, while lexical scores count the proportion of words in the two languages that are aligned in the same block (refer to Appendix C for additional details).
We call the dual decoder model dual.Baseline translation models trained separately on each direction (T en →C en , T en →S fr ) are denoted by base.
To study the effectiveness of dual decoding, we mainly compare dual with a pipeline system.The latter uses the base model to produce captions which are then translated into subtitles using an independent system trained to translate from caption to subtitle (T en →C en →S fr ).
Like the dual model, base and pipeline systems also benefit from pre-training.For the former, we pre-train the direct transcript-to-subtitle translation model (T en →S fr ); for pipeline, the caption-to-subtitle model (C en →S fr ) is pre-trained, while the first step (T en →C en ) remains as in the base system.Note that all fine-tuned systems start with the same model pre-trained using WMT EN-FR data.We only report in Table 2 the performance of the two baselines and fine-tuned (+FT) models, as our preliminary experiments showed that training the dual decoder model with only tri-parallel data was not optimal.The BLEU score of the do nothing baseline, which copies the source ASR transcripts to the output, is 28.0, which suggests that the captioning task actually involves much more transformations than simply inserting segmentation tags.We see that fine-tuning improves subtitles generated by base and pipeline systems by ∼1 BLEU.Our dual decoder model, after fine-tuned using synthetic tri-parallel data, respectively outperforms base+FT by 0.7 BLEU, and pipeline+FT by 1.4 BLEU.Sharing all parameters of both decoders yields further increase of 0.2 BLEU, with about one third less parameters.

BLEU
We also measure the structural and lexical consistency between captions and subtitles generated by our systems (see Table 2).As expected, pipeline settings always generate very consistent pairs of captions and subtitles, as subtitles are direct translations of the captions; all other methods generate both outputs from the ASR transcripts.dual models do not perform as well, but are still able to generate captions and subtitles with a much higher structural and lexical consistency between the two outputs than in the base systems.Xu and Yvon (2021) show that dual decoder models generate translations that are more consistent in content.We further show here that our dual models generates hypotheses which are also more consistent in structure.Examples output captions and subtitles are in Appendix E.

The Effect of Fine-tuning
As the pre-trained uni-directional translation model has never seen sentences in the source language on the target side, we first only use it to initialize the subtitling decoder, and use a random initialization for the captioning decoder.To study the effect of initialization, we conduct an ablation study by comparing three settings: initializing only the subtitling decoder, both decoders or the shared decoder (see Table 3).Initializing both decoders brings improvements in both directions, with a gain of 1.6 BLEU for captioning and 0.3 BLEU for subtitling.Moreover, sharing parameters between decoders further boost the subtitling performance by 0.2 BLEU.As it seems, the captioning decoder also benefits from a decoder pre-trained in another language.

Exposure Bias
Due to error accumulations in both decoders, the exposure bias problem seems more severe for dual decoder model than for regular translation models (Zhou et al., 2019;Zhang et al., 2020a;Xu and Yvon, 2021).These authors propose to use pseudo tri-parallel data with synthetic references to alleviate this problem.We analyze the influence of this exposure bias issue in our application scenario.
To this end, we compare fine-tuning the dual model with original vs artificial tri-parallel data.For simplicity, we only report in Table 4 the average BLEU scores of captioning and subtitling.Results show that fine-tuning with the original data (w.real)strongly degrades the automatic metrics for the generated text , resulting in performance that are worse than the baseline.In another set of experiments, we follow Xu and Yvon ( 2021) and perform asynchronous 2-round decoding.We first decode the dual models to obtain hypotheses in both languages e 1 and e 2 .During the second decoding round, we use the output English caption e 1 as a forced prefix when generating the French subtitles e .The final English caption e 1 is obtained similarly.Note that when generating the t-th token in e 2 , the decoder crossattention module only attends to the t first tokens of e 1 , even though the full of e 1 is actually known.The 2-round scores for e 1 and e 2 are in Table 4, and compared with the optimal situation where we use references instead of model predictions as forced prefix in the second round (in col.'Ref').
Results in Table 4 suggest that dual decoder models fine-tuned with original data (w.real) are quite sensible to exposure bias, which can be mitigated with artificial tri-parallel data.Their performance can however be improved by ∼1.5 BLEU when using 2-round decoding, thereby almost closing the initial gap with models using synthetic data.The latter approach is overall slightly better and also more stable across decoding configurations.

Conclusion
In this paper, we have explored dual decoding to jointly generate captions and subtitles from ASR transcripts.Experimentally, we found that dual decoding improves translation quality for both captioning and subtitling, while delivering more con-sistent output pairs.Additionally, we showed that (a) model sharing on the decoder side is viable and effective, at least for related languages; (b) initializing with pre-trained models vastly improves performance; (c) 2-round decoding allowed us to mitigate the exposure bias problem in our model.In the future, we would like to experiment on more distant language pairs to validate our approach in a more general scenario.

A Data Processing Details
For the English to French language pair, MuST-Cinema 8 (Karakanta et al., 2020b) contains 275k sentences for training and 1079 and 544 lines for development and testing, respectively.The ASR system used by Karakanta et al. (2020a) to produce transcripts was based on the KALDI toolkit (Povey et al., 2011), and had been trained on the clean portion of LibriSpeech (Panayotov et al., 2015) (∼460h) and a subset of .In order to emulate a real production scenario, we segment these transcripts as if they were from an ASR system performing segmentation based on prosody.As this kind of system tends to produce longer sequences compared to typical written text (Cho et al., 2012), we randomly concatenate the English captions into longer sequences, to which we align the ASR transcripts using the conventional edit distance, thus adding a subsegmentation aspect to the translation task.Edit distance computations are based on a Weighted Finite-State Transducer (WSFT), implemented with Pynini (Gorman, 2016), which represents editing operations (match, insertion, deletion, replacement) at the character level, with weights depending on the characters and the previous operation context.After composing the edit WFST with the transcript string and 8 License: CC BY-NC-ND 4.0 the caption string, the optimal operation sequence is computed using a shortest-distance algorithm (Mohri, 2002).The number of sentences to be concatenated is sampled normally, with an average around of 2. This process results in 133k, 499 and 255 lines for training, development and testing, respectively.
For pre-training, we use all available WMT14 EN-FR data,9 in which we discard sentence pairs with invalid language label as computed by fasttext language identification model10 ( Bojanowski et al., 2017).This pre-training data contains 33.9M sentence pairs.

B Experimental Details
We build our dual decoder model with a hidden size of 512 and a feedforward size of 2048.We optimize with Adam, set up with a maximum learning rate of 0.0007 and an inverse square root decay schedule, as well as 4000 warmup steps.For finetuning, we use Adam with a fixed learning rate of 8e−5.For all models, we share lexical embeddings between the encoder and the input and output decoder matrices.All models are trained with mixed precision and a batch size of 8192 tokens on 4 V100 GPUs.
The two models in the base setting are trained separately using transcript→caption and transcript→subtitle data.The second model of the pipeline setting is trained using caption→subtitle data.When performing finetuning, we first pre-train an EN→FR translation model pre-train using WMT EN-FR data.For base+FT setting, the transcript→subtitle model is fine-tuned from pre-train, while the transcript→caption is the same as base since languages on both source and target sides are English.For pipeline+FT, the caption→subtitle model is fine-tuned from pre-train.For dual+FT, the encoder and the two decoders are fine-tuned from the same pre-train model.The decoder cross-attention layers cannot be fine-tuned and are randomly initialized.Due to computation limits, we are not able to conduct multiple runs for our models.However, all results are obtained by using the parameters averaged over the last 5 checkpoints.
As defined by Karakanta et al. (2021), for the stuctural consistency, both captions (EN) and subtitles (FR) have the same number of 3 blocks.For lexical consistency, there are 6 tokens of the subtitles which are not aligned to captions in the same block: "le capitalisme ," , "au même titre".The Lex C→S is calculated as the percentage of aligned words normalized by number of words in the caption.Therefore, Lex C→S = 20 22 = 90.9%; the computation is identical in the other direction, yielding Lex S→C = 17 23 = 73.9%, the average lexical consistency of this segment is thus Lex pair = Lex C→S +Lex S→C 2 = 82.4%.When computing the lexical consistency between captions and subtitles, we use the WMT14 EN-FR data to train an alignment model using fast_align11 (Dyer et al., 2013) in both directions and use it to predict word alignments for model outputs.(Zhang et al., 2020b).Note that for BERTScores, we remove segmentation tokens ([eob] and [eol]) from hypotheses and references, as special tokens are out-of-vocabulary for pre-trained BERT models.

E Examples
Some examples of dual decoding improving the quality of both captioning and subtitling compared to the pipeline system are in Table 6.

Figure 1 :
Figure 1: A graphical view of various captioning and subtitling strategies.T refers to transcripts.C and S respectively denote captions and subtitles.

Table 1 :
). Generally speak-Transcript i 'm combining specific types of signals the mimic how our body response to in an injury to help us regenerate Caption I'm combining specific types of signals [eob] that mimic how our body responds to injury [eol] to help us regenerate.[eob] Subtitle Je combine différents types de signaux [eob] qui imitent la réponse du corps [eol] aux blessures pour nous aider à guérir.[eob] Example of a triplet (transcript, caption, subtitle) from our tri-parallel data.Differences between transcript and caption are in bold.

Table 2 :
BLEU scores for captions (EN) and subtitles (FR), with measures of structural and lexical consistency between the two hypotheses.These scores are in percentage (higher is better).The base and pipeline settings are trained from scratch with original data.share refers to tying all decoder parameters.

Table 3 :
BLEU scores for multiple initializations.

Table 4 :
Performance of various decoding methods.All BLEU scores are averaged over the two outputs.2round (resp.Ref ) refers to decoding with model predictions (resp.references) as forced prefix in one direction.

Table 5 reports
TER and BERTScores 12

Table 5 :
TER, BERTScore and BLEU scores for captions(EN)and subtitles (FR), with measures of structural and lexical consistency between the two hypotheses.The base and pipeline settings are trained from scratch with original data.share refers to tying all decoder parameters.Signature of BERTScore (EN): microsoft/debertaxlarge-mnli_L40_no-idf_version=0.3.11(hug_trans=4.10.3)-rescaled_fast-tokenizer.Signature of BERTScore (FR): bert-base-multilingual-cased_L9_no-idf_version=0.3.11(hug_trans=4.10.3)-rescaled_fast-tokenizer.Source take time to write down your values your objectives and your key results do it today EN pipeline +FT Take time to write down [eol] your values, your objectives, [eob] and your key results do it today.[eob] EN share +FT Take time to write down your values, [eol] your objectives, [eob] and your key results do it today.[eob] EN ref Take time to write down your values, [eob] your objectives and your key results.[eob] Do it today.[eob] FR pipeline +FT Prenez le temps d'écrire vos valeurs, [eol] vos objectifs, [eob] et vos principaux résultats [eol] le font aujourd'hui.[eob] FR share +FT Prenez le temps d'écrire vos valeurs, [eob] vos objectifs et vos résultats clés.[eob] Faites-le aujourd'hui.[eob] FR ref Prenez le temps d'écrire vos valeurs, [eob] vos objectifs et vos résultats clés.[eob] Faites-le aujourd'hui.[eob] Source and as it turns out what are you willing to give up is exactly the right question to ask EN pipeline +FT And as it turns out, what are you willing [eol] to give up is exactly [eob] the right question to ask? [eob] EN share +FT And as it turns out, what are you willing [eol] to give up [eob] is exactly the right question to ask? [eob] EN ref And as it turns out, [eob] "What are you willing to give up?" [eob] is exactly the right question to ask.[eob] FR pipeline +FT Et il s'avère que ce que vous voulez abandonner [eol] est exactement [eob] la bonne question à poser ?[eob] FR share +FT Et il s'avère que ce que vous voulez abandonner [eob] est exactement la bonne question à poser.[eob] FR ref Et il s'avère que [eob] « Qu'êtes-vous prêts à abandonner ?» [eob] est exactement la question à poser.[eob]

Table 6 :
Examples of dual decoding improving both captioning and subtitling.Major improvements are marked in bold.