A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation

Yulin Xue, Siqi Ouyang, Lei Li


Abstract
Simultaneous speech-to-speech translation (SimulS2ST) enables real-time cross-lingual communication, but existing evaluation has focused largely on short or pre-segmented speech rather than long-form, continuous input. Prior approaches are difficult to reproduce and make assumptions that do not hold for end-to-end systems. We present a practical evaluation method for long-form SimulS2ST. Given source speech, pre-segmented source transcripts, and reference translations, we run automatic speech recognition (ASR) and forced alignment on the generated target speech to recover token-level timestamps, then apply a sentence-embedding-based aligner to match the target text to its corresponding source sentences. This enables sentence-level computation of latency and quality metrics, including YAAL and xCOMET, which are then aggregated into final system-level scores. Experiments on representative SimulS2ST systems show that the method is effective in practice and reveal that current systems suffer from substantial latency accumulation on long speech.
Anthology ID:
2026.iwslt-1.3
Volume:
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:
July
Year:
2026
Address:
San Diego, USA (in-person and online)
Editors:
Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:
IWSLT | WS
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
32–39
Language:
URL:
https://aclanthology.org/2026.iwslt-1.3/
DOI:
Bibkey:
Cite (ACL):
Yulin Xue, Siqi Ouyang, and Lei Li. 2026. A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 32–39, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):
A Practical Evaluation Method for Long-Form Simultaneous Speech-to-Speech Translation (Xue et al., IWSLT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwslt-1.3.pdf