MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering

Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, Yunyao Li


Abstract
Recent advancements in retrieval-augmented generation have demonstrated impressive performance on the question-answering task. However, most previous work predominantly focuses on text-based answers. Although some studies have explored multimodal data, they still fall short in generating comprehensive multimodal answers, especially step-by-step tutorials for accomplishing specific goals. This capability is especially valuable in application scenarios such as enterprise chatbots, customer service systems, and educational platforms. In this paper, we propose a simple and effective framework, MuRAR (Multimodal Retrieval and Answer Refinement). MuRAR starts by generating an initial text answer based on the user’s question. It then retrieves multimodal data relevant to the snippets of the initial text answer. By leveraging the retrieved multimodal data and contextual features, MuRAR refines the initial text answer to create a more comprehensive and informative response. This highly adaptable framework can be easily integrated into an enterprise chatbot to produce multimodal answers with minimal modifications. Human evaluations demonstrate that the multimodal answers generated by MuRAR are significantly more useful and readable than plain text responses. A video demo of MuRAR is available at https://youtu.be/ykGRtyVVQpU.
Anthology ID:
2025.coling-demos.13
Volume:
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert, Brodie Mather, Mark Dras
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
126–135
Language:
URL:
https://aclanthology.org/2025.coling-demos.13/
DOI:
Bibkey:
Cite (ACL):
Zhengyuan Zhu, Daniel Lee, Hong Zhang, Sai Sree Harsha, Loic Feujio, Akash Maharaj, and Yunyao Li. 2025. MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering. In Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations, pages 126–135, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
MuRAR: A Simple and Effective Multimodal Retrieval and Answer Refinement Framework for Multimodal Question Answering (Zhu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-demos.13.pdf