Between Flexibility and Consistency: Joint Generation of Captions and Subtitles

Alina Karakanta, Marco Gaido, Matteo Negri, Marco Turchi


Abstract
Speech translation (ST) has lately received growing interest for the generation of subtitles without the need for an intermediate source language transcription and timing (i.e. captions). However, the joint generation of source captions and target subtitles does not only bring potential output quality advantages when the two decoding processes inform each other, but it is also often required in multilingual scenarios. In this work, we focus on ST models which generate consistent captions-subtitles in terms of structure and lexical content. We further introduce new metrics for evaluating subtitling consistency. Our findings show that joint decoding leads to increased performance and consistency between the generated captions and subtitles while still allowing for sufficient flexibility to produce subtitles conforming to language-specific needs and norms.
Anthology ID:
2021.iwslt-1.26
Volume:
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
Month:
August
Year:
2021
Address:
Bangkok, Thailand (online)
Venues:
ACL | IJCNLP | IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
215–225
Language:
URL:
https://aclanthology.org/2021.iwslt-1.26
DOI:
10.18653/v1/2021.iwslt-1.26
Bibkey:
Cite (ACL):
Alina Karakanta, Marco Gaido, Matteo Negri, and Marco Turchi. 2021. Between Flexibility and Consistency: Joint Generation of Captions and Subtitles. In Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021), pages 215–225, Bangkok, Thailand (online). Association for Computational Linguistics.
Cite (Informal):
Between Flexibility and Consistency: Joint Generation of Captions and Subtitles (Karakanta et al., IWSLT 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.iwslt-1.26.pdf
Code
 mgaido91/FBK-fairseq-ST
Data
MuST-Cinema