End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation

Zhiyuan Fan; Longfei Yun; Ming Yan; Yumeng Wang; Dadi Guo; Brian Mak; James Kwok; Yi R. Fung

doi:10.18653/v1/2025.findings-emnlp.24

End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation

Zhiyuan Fan, Longfei Yun, Ming Yan, Yumeng Wang, Dadi Guo, Brian Mak, James Kwok, Yi R. Fung

Abstract

Multimodal Retrieval-Augmented Generation (MM-RAG) has emerged as a promising approach for enhancing the reliability and factuality of large vision-language models (LVLMs). While end-to-end loss backpropagation is infeasible due to non-differentiable operations during the forward process, current methods primarily focus on component-level optimizations, necessitate extensive component-specific training datasets and suffer from a gap between local and global optimization objectives. In this paper, we propose a new paradigm that backpropagates global rewards from the system output to each component and then transforms these rewards into specific local losses, enabling each component to perform gradient descent and thus ensuring end-to-end optimization. Specifically, we first insert two lightweight multimodal components, a query translator and an adaptive reranker, to address the heterogeneity of multimodal knowledge and the varying knowledge demands for different questions, and then tune only these inserted components using our proposed paradigm to integrate the entire system. Our method achieves SOTA performance on multiple knowledge-intensive multimodal benchmarks with high training efficiency, relying exclusively on supervised signals from an external reward model. Experimental results and our detailed analysis of the evolution of components during training collectively reveal the advantages and considerable potential of this paradigm as a promising direction for MM-RAG research.

Anthology ID:: 2025.findings-emnlp.24
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 443–466
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.24/
DOI:: 10.18653/v1/2025.findings-emnlp.24
Bibkey:
Cite (ACL):: Zhiyuan Fan, Longfei Yun, Ming Yan, Yumeng Wang, Dadi Guo, Brian Mak, James Kwok, and Yi R. Fung. 2025. End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 443–466, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: End-to-End Optimization for Multimodal Retrieval-Augmented Generation via Reward Backpropagation (Fan et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.24.pdf
Checklist:: 2025.findings-emnlp.24.checklist.pdf

PDF Cite Search Checklist Fix data