Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Nan Hu; Jiaoyan Chen; Yike Wu; Guilin Qi; Hongru Wang; Sheng Bi; Yongrui Chen; Tongtong Wu; Jeff Z. Pan

doi:10.18653/v1/2025.acl-long.837

Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs

Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, Jeff Z. Pan

Abstract

Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA.

Anthology ID:: 2025.acl-long.837
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 17096–17118
Language:
URL:: https://aclanthology.org/2025.acl-long.837/
DOI:: 10.18653/v1/2025.acl-long.837
Bibkey:
Cite (ACL):: Nan Hu, Jiaoyan Chen, Yike Wu, Guilin Qi, Hongru Wang, Sheng Bi, Yongrui Chen, Tongtong Wu, and Jeff Z. Pan. 2025. Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 17096–17118, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs (Hu et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.837.pdf

PDF Cite Search Fix data