Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

JiaXin Dai; Zehang Wei; Jiamin Yan; Xiang Xiang

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

JiaXin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

Abstract

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the Full RAG track, our resource-aware approach demonstrates exceptional precision in both information retrieval and persona-conditioned generation.

Anthology ID:: 2026.magmar-main.12
Volume:: Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Month:: July
Year:: 2026
Address:: San Diego, USA
Editors:: Kenton Murray, Reno Kriz
Venues:: MAGMaR | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 81–91
Language:
URL:: https://aclanthology.org/2026.magmar-main.12/
DOI:
Bibkey:
Cite (ACL):: JiaXin Dai, Zehang Wei, Jiamin Yan, and Xiang Xiang. 2026. Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation. In Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026), pages 81–91, San Diego, USA. Association for Computational Linguistics.
Cite (Informal):: Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation (Dai et al., MAGMaR 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.magmar-main.12.pdf

PDF Cite Search Fix data