DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang; Leonardo F. R. Ribeiro; Momchil Hardalov; Bhuwan Dhingra; Markus Dreyer; Venkatesh Saligrama

DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality

Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, Venkatesh Saligrama

Abstract

Search-augmented LLM agents can produce deep research reports (DRRs), but verifying claim-level factuality remains challenging. Existing fact-checkers usually target general-domain atomic claims, and there is no benchmark to test whether such verifiers transfer to DRRs.Yet building such a benchmark for DRR fact-checkers is itself difficult because it requires expert judgments over cognitively demanding, domain-specific claims.In a controlled study with PhD-level specialists, unassisted experts achieve only 60.8% accuracy on hidden known-answer claims. We therefore propose evolving benchmarking via **Audit-then-Score** (**AtS**), in which labels and rationales remain revisable: when a verifier disagrees with the current benchmark, it submits evidence; an auditor adjudicates the dispute; and accepted revisions update the benchmark before scoring. After three additional **AtS** rounds, expert accuracy rises to 90.9%, showing that experts are better auditors than one-shot labelers.We instantiate **AtS** as **DeepFactBench**, a versioned DRR factuality benchmark with auditable rationales, and introduce **DeepFactEval**, a claim-level verifier.On the frozen **DeepFactBench** release, **DeepFactEval** achieves 83.4% accuracy, outperforming the best prior deep-research and traditional fact-checkers by 14.3 and 24.9 points, respectively, and transferring well to external factuality datasets.

Anthology ID:: 2026.acl-long.1586
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34356–34386
Language:
URL:: https://aclanthology.org/2026.acl-long.1586/
DOI:
Bibkey:
Cite (ACL):: Yukun Huang, Leonardo F. R. Ribeiro, Momchil Hardalov, Bhuwan Dhingra, Markus Dreyer, and Venkatesh Saligrama. 2026. DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34356–34386, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DeepFact: Co-Evolving Benchmarks and Agents for Deep Research Factuality (Huang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1586.pdf
Checklist:: 2026.acl-long.1586.checklist.pdf

PDF Cite Search Checklist Fix data