Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Sami Haq; Sheila Castilho; Yvette Graham

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality

Sami Haq, Sheila Castilho, Yvette Graham

Abstract

Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text-centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g., Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text-only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech’s richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

Anthology ID:: 2025.wmt-1.3
Volume:: Proceedings of the Tenth Conference on Machine Translation
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:: WMT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 52–63
Language:
URL:: https://aclanthology.org/2025.wmt-1.3/
DOI:
Bibkey:
Cite (ACL):: Sami Haq, Sheila Castilho, and Yvette Graham. 2025. Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality. In Proceedings of the Tenth Conference on Machine Translation, pages 52–63, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality (Haq et al., WMT 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.wmt-1.3.pdf

PDF Cite Search Fix data