Nikita Mehandru


pdf bib
Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors
Nikita Mehandru | Sweta Agrawal | Yimin Xiao | Ge Gao | Elaine Khoong | Marine Carpuat | Niloufar Salehi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A major challenge in the practical use of Machine Translation (MT) is that users lack information on translation quality to make informed decisions about how to rely on outputs. Progress in quality estimation research provides techniques to automatically assess MT quality, but these techniques have primarily been evaluated in vitro by comparison against human judgments outside of a specific context of use. This paper evaluates quality estimation feedback in vivo with a human study in realistic high-stakes medical settings. Using Emergency Department discharge instructions, we study how interventions based on quality estimation versus backtranslation assist physicians in deciding whether to show MT outputs to a patient. We find that quality estimation improves appropriate reliance on MT, but backtranslation helps physicians detect more clinically harmful errors that QE alone often misses.


pdf bib
Quality Estimation via Backtranslation at the WMT 2022 Quality Estimation Task
Sweta Agrawal | Nikita Mehandru | Niloufar Salehi | Marine Carpuat
Proceedings of the Seventh Conference on Machine Translation (WMT)

This paper describes submission to the WMT 2022 Quality Estimation shared task (Task 1: sentence-level quality prediction). We follow a simple and intuitive approach, which consists of estimating MT quality by automatically back-translating hypotheses into the source language using a multilingual MT system. We then compare the resulting backtranslation with the original source using standard MT evaluation metrics. We find that even the best-performing backtranslation-based scores perform substantially worse than supervised QE systems, including the organizers’ baseline. However, combining backtranslation-based metrics with off-the-shelf QE scorers improves correlation with human judgments, suggesting that they can indeed complement a supervised QE system.