How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA

Yujia Zhou; Zheng Liu; Zhicheng Dou (窦志成)

How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA

Abstract

Retrieval-augmented Large Language Models (RaLLMs) are reshaping knowledge acquisition, offering long-form, knowledge-grounded answers through advanced reasoning and generation capabilities. Despite the emergence of impactful systems like WebGPT and New Bing, the reliability of RaLLMs, especially in complex situations, is under scrutiny. Our study tackles this concern by evaluating RaLLMs’ question-answering performance using a novel benchmark focusing on Correctness and Groundedness. Correctness measures the logical soundness of the responses, and Groundedness checks for support by relevant references. We introduce an automated model-based evaluation pipeline for multi-hop question-answering tasks, revealing RaLLMs’ proneness to generating inaccuracies when dealing with flawed or partial knowledge. To improve accuracy, we introduce two reasoning strategies, Self-Reflection’ and Self-Completion,’ enabling RaLLMs to identify and fill knowledge gaps, significantly improving answer quality without extensive model retraining.

Anthology ID:: 2025.coling-main.285
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4232–4242
Language:
URL:: https://aclanthology.org/2025.coling-main.285/
DOI:
Bibkey:
Cite (ACL):: Yujia Zhou, Zheng Liu, and Zhicheng Dou. 2025. How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4232–4242, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: How Credible Is an Answer From Retrieval-Augmented LLMs? Investigation and Evaluation With Multi-Hop QA (Zhou et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.285.pdf

PDF Cite Search Fix data