Sicheng Lai
2025
ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
Yichen Lu
|
Wei Dai
|
Jiaen Liu
|
Ching Wing Kwok
|
Zongheng Wu
|
Xudong Xiao
|
Ao Sun
|
Sheng Fu
|
Jianyuan Zhan
|
Yian Wang
|
Takatomo Saito
|
Sicheng Lai
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our demo is available here: https://vidove.willbe03.com/
Both Text and Images Leaked! A Systematic Analysis of Data Contamination in Multimodal LLM
Dingjie Song
|
Sicheng Lai
|
Mingxuan Wang
|
Shunian Chen
|
Lichao Sun
|
Benyou Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
The rapid advancement of multimodal large language models (MLLMs) has significantly enhanced performance across benchmarks. However, data contamination — partial/entire benchmark data is included in the model’s training set — poses critical challenges for fair evaluation. Existing detection methods for unimodal large language models (LLMs) are inadequate for MLLMs due to multimodal data complexity and multi-phase training. We systematically analyze multimodal data contamination using our analytical framework, MM-DETECT, which defines two contamination categories — unimodal and cross-modal — and effectively quantifies contamination severity across multiple-choice and caption-based Visual Question Answering tasks. Evaluations on twelve MLLMs and five benchmarks reveal significant contamination, particularly in proprietary models and older benchmarks. Crucially, contamination sometimes originates during unimodal pre-training rather than solely from multimodal fine-tuning. Our insights refine contamination understanding, guiding evaluation practices and improving multimodal model reliability.