Towards Dynamic Attention Masking for Simultaneous Speech Translation

Benjamin Pong


Abstract
We present a proof-of-concept system for simultaneous speech translation based on dynamic attention masking. Our approach builds on SeamlessM4T by injecting lightweight per-layer schedulers into the conformer-encoder, training each scheduler to predict the number of future frames needed for translation. The schedulers are trained jointly with LoRA adapters across three language directions: English to German, Italian, and Chinese. At inference time, we evaluate our system using sliding window retranslation inference regime (Sen et al., 2022), and an adapted version of StreamAtt (Papi et al., 2024) that replaces the fixed cutoff with a content-aware threshold derived from the learnt representations from the scheduler outputs.
Anthology ID:
2026.iwslt-1.20
Volume:
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:
July
Year:
2026
Address:
San Diego, USA (in-person and online)
Editors:
Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:
IWSLT | WS
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–188
Language:
URL:
https://aclanthology.org/2026.iwslt-1.20/
DOI:
Bibkey:
Cite (ACL):
Benjamin Pong. 2026. Towards Dynamic Attention Masking for Simultaneous Speech Translation. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 183–188, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Towards Dynamic Attention Masking for Simultaneous Speech Translation (Pong, IWSLT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwslt-1.20.pdf