Benchmarking Visually-Situated Translation of Text in Natural Images

Elizabeth Salesky, Philipp Koehn, Matt Post


Abstract
We introduce a benchmark, Vistra, for visually-situated translation of English text in natural images to four target languages. We describe the dataset construction and composition. We benchmark open-source and commercial OCR and MT models on Vistra, and present both quantitative results and a taxonomy of common OCR error classes with their effect on downstream MT. Finally, we assess direct image-to-text translation with a multimodal LLM, and show that it is able in some cases but not yet consistently to disambiguate possible translations with visual context. We show that this is an unsolved and challenging task even for strong commercial models. We hope that the creation and release of this benchmark which is the first of its kind for these language pairs will encourage further research in this direction.
Anthology ID:
2024.wmt-1.115
Volume:
Proceedings of the Ninth Conference on Machine Translation
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Barry Haddow, Tom Kocmi, Philipp Koehn, Christof Monz
Venue:
WMT
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1167–1182
Language:
URL:
https://aclanthology.org/2024.wmt-1.115
DOI:
Bibkey:
Cite (ACL):
Elizabeth Salesky, Philipp Koehn, and Matt Post. 2024. Benchmarking Visually-Situated Translation of Text in Natural Images. In Proceedings of the Ninth Conference on Machine Translation, pages 1167–1182, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Benchmarking Visually-Situated Translation of Text in Natural Images (Salesky et al., WMT 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.wmt-1.115.pdf