Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings

Tollef Emil Jørgensen; Ole Jakob Mengshoel

Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings

Tollef Emil Jørgensen, Ole Jakob Mengshoel

Abstract

This paper explores the joint task of machine translation and sentence compression, emphasizing its application in subtitle generation for broadcast and live media for low-resource languages and hardware. We develop CLSC (Cross-Lingual Sentence Compression), a system trained on openly available parallel corpora organized by compression ratios, where the target length is constrained to a fraction of the source sentence length. We present two training methods: 1) Multiple Models (MM), where individual models are trained separately for each compression ratio, and 2) a Controllable Model (CM), a single model per language using a compression token to encode length constraints. We evaluate both subtitle data and transcriptions from the EuroParl corpus. To accommodate low-resource settings, we constrain data sampling for training and show results for transcriptions in French, Hungarian, Lithuanian, and Polish and subtitles in Albanian, Basque, Malay, and Norwegian. Our models preserve high semantic meaning and metric evaluations for compressed contexts.

Anthology ID:: 2025.coling-main.429
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6447–6458
Language:
URL:: https://aclanthology.org/2025.coling-main.429/
DOI:
Bibkey:
Cite (ACL):: Tollef Emil Jørgensen and Ole Jakob Mengshoel. 2025. Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings. In Proceedings of the 31st International Conference on Computational Linguistics, pages 6447–6458, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: Cross-Lingual Sentence Compression for Length-Constrained Subtitles in Low-Resource Settings (Jørgensen & Mengshoel, COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.429.pdf

PDF Cite Search Fix data