Hurdles of Automatic Metric for Speech Translation Evaluation

Victor Eugen Zarzu, Vilem Zouhar


Abstract
Automatic evaluation of speech translation has so far relied on text-only automated metrics that ignore speech phenomena. One would expect that incorporating the source audio modality would improve the performance of automatic metrics. We implement two standard metric paradigms: a COMET-audio regression model using audio and text encoders, and one based on prompting a speech large language model. Surprisingly, both audio-infused models fail to reliably surpass text-only baselines. We attribute this failure to the noise pollution and audio-transcript mismatches present in the audio signal, which makes the modality unreliable from the metric’s perspective. Furthermore, we argue that current human-annotated evaluation datasets for automated metrics predominantly feature technical content or short texts where paralinguistic features like prosody lack importance, rendering the extra audio information unhelpful for quality estimation (QE).
Anthology ID:
2026.iwslt-1.34
Volume:
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)
Month:
July
Year:
2026
Address:
San Diego, USA (in-person and online)
Editors:
Elizabeth Salesky, Antonios Anastasopoulos, Matteo Negri, Marcello Federico
Venues:
IWSLT | WS
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
Note:
Pages:
305–315
Language:
URL:
https://aclanthology.org/2026.iwslt-1.34/
DOI:
Bibkey:
Cite (ACL):
Victor Eugen Zarzu and Vilem Zouhar. 2026. Hurdles of Automatic Metric for Speech Translation Evaluation. In Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026), pages 305–315, San Diego, USA (in-person and online). Association for Computational Linguistics.
Cite (Informal):
Hurdles of Automatic Metric for Speech Translation Evaluation (Zarzu & Zouhar, IWSLT 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwslt-1.34.pdf