Victor Eugen Zarzu

2026

Hurdles of Automatic Metric for Speech Translation Evaluation
Victor Eugen Zarzu | Vilem Zouhar
Proceedings of the 23rd International Conference on Spoken Language Translation (IWSLT 2026)

Automatic evaluation of speech translation has so far relied on text-only automated metrics that ignore speech phenomena. One would expect that incorporating the source audio modality would improve the performance of automatic metrics. We implement two standard metric paradigms: a COMET-audio regression model using audio and text encoders, and one based on prompting a speech large language model. Surprisingly, both audio-infused models fail to reliably surpass text-only baselines. We attribute this failure to the noise pollution and audio-transcript mismatches present in the audio signal, which makes the modality unreliable from the metric’s perspective. Furthermore, we argue that current human-annotated evaluation datasets for automated metrics predominantly feature technical content or short texts where paralinguistic features like prosody lack importance, rendering the extra audio information unhelpful for quality estimation (QE).

Co-authors

Vilém Zouhar 1

Venues

IWSLT1
WS1

Fix author