ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning

Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, Sicheng Lai


Abstract
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/
Anthology ID:
2025.emnlp-demos.17
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Ivan Habernal, Peter Schulam, Jörg Tiedemann
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
228–243
Language:
URL:
https://aclanthology.org/2025.emnlp-demos.17/
DOI:
Bibkey:
Cite (ACL):
Yichen Lu, Wei Dai, Jiaen Liu, Ching Wing Kwok, Zongheng Wu, Xudong Xiao, Ao Sun, Sheng Fu, Jianyuan Zhan, Yian Wang, Takatomo Saito, and Sicheng Lai. 2025. ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 228–243, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning (Lu et al., EMNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.emnlp-demos.17.pdf