Unsupervised Subtitle Segmentation with Masked Language Models

David Ponce, Thierry Etchegoyhen, Victor Ruiz


Abstract
We describe a novel unsupervised approach to subtitle segmentation, based on pretrained masked language models, where line endings and subtitle breaks are predicted according to the likelihood of punctuation to occur at candidate segmentation points. Our approach obtained competitive results in terms of segmentation accuracy across metrics, while also fully preserving the original text and complying with length constraints. Although supervised models trained on in-domain data and with access to source audio information can provide better segmentation accuracy, our approach is highly portable across languages and domains and may constitute a robust off-the-shelf solution for subtitle segmentation.
Anthology ID:
2023.acl-short.67
Volume:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
771–781
Language:
URL:
https://aclanthology.org/2023.acl-short.67
DOI:
10.18653/v1/2023.acl-short.67
Bibkey:
Cite (ACL):
David Ponce, Thierry Etchegoyhen, and Victor Ruiz. 2023. Unsupervised Subtitle Segmentation with Masked Language Models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 771–781, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
Unsupervised Subtitle Segmentation with Masked Language Models (Ponce et al., ACL 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.acl-short.67.pdf
Video:
 https://aclanthology.org/2023.acl-short.67.mp4