Sami Haq

2025

Audio-Based Crowd-Sourced Evaluation of Machine Translation Quality
Sami Haq | Sheila Castilho | Yvette Graham
Proceedings of the Tenth Conference on Machine Translation

Machine Translation (MT) has achieved remarkable performance, with growing interest in speech translation and multimodal approaches. However, despite these advancements, MT quality assessment remains largely text-centric, typically relying on human experts who read and compare texts. Since many real-world MT applications (e.g., Google Translate Voice Mode, iFLYTEK Translator) involve translation being spoken rather printed or read, a more natural way to assess translation quality would be through speech as opposed text-only evaluations. This study compares text-only and audio-based evaluations of 10 MT systems from the WMT General MT Shared Task, using crowd-sourced judgments collected via Amazon Mechanical Turk. We additionally, performed statistical significance testing and self-replication experiments to test reliability and consistency of audio-based approach. Crowd-sourced assessments based on audio yield rankings largely consistent with text-only evaluations but, in some cases, identify significant differences between translation systems. We attribute this to speech’s richer, more natural modality and propose incorporating speech-based assessments into future MT evaluation frameworks.

pdf bib abs

Long-context Reference-based MT Quality Estimation
Sami Haq | Chinonso Osuji | Sheila Castilho | Brian Davis | Thiago Castro Ferreira
Proceedings of the Tenth Conference on Machine Translation

In this paper, we present our submission to the Tenth Conference on Machine Translation (WMT25) Shared Task on Automated Translation Quality Evaluation. Our systems are built upon the COMET framework and trained to predict segment-level ESA scores using augmented long-context data. To construct long-context training examples, we concatenate multiple in-domain sentences and compute a weighted average of their scores. We further integrate human judgment datasets MQM, SQM, and DA) through score normalisation and train multilingual models on the source, hypothesis, and reference translations. Experimental results demonstrate that incorporating long-context information yields higher correlations with human judgments compared to models trained exclusively on short segments.

2024

pdf bib abs

DCU ADAPT at WMT24: English to Low-resource Multi-Modal Translation Task
Sami Haq | Rudali Huidrom | Sheila Castilho
Proceedings of the Ninth Conference on Machine Translation

This paper presents the system description of “DCU_NMT’s” submission to the WMT-WAT24 English-to-Low-Resource Multimodal Translation Task. We participated in the English-to-Hindi track, developing both text-only and multimodal neural machine translation (NMT) systems. The text-only systems were trained from scratch on constrained data and augmented with back-translated data. For the multimodal approach, we implemented a context-aware transformer model that integrates visual features as additional contextual information. Specifically, image descriptions generated by an image captioning model were encoded using BERT and concatenated with the textual input.The results indicate that our multimodal system, trained solely on limited data, showed improvements over the text-only baseline in both the challenge and evaluation sets, suggesting the potential benefits of incorporating visual information.

Co-authors

Chinonso Osuji 1

Venues

WMT3
WS1

Fix author