Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models

Wenxuan Wang; Yuk-Kit Chan; Zixuan Ling; Shi Juluan; Youliang Yuan; Jen-tse Huang; Yifei Zhang; Wenxiang Jiao; Zhaopeng Tu; Michael R. Lyu

Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models

Wenxuan Wang, Yuk-Kit Chan, Zixuan Ling, Shi Juluan, Youliang Yuan, Jen-tse Huang, Yifei Zhang, Wenxiang Jiao, Zhaopeng Tu, Michael R. Lyu

Abstract

Large Language Models (LLMs) like ChatGPT are foundational in various applications due to their extensive knowledge from pre-training and fine-tuning. Despite this, they are prone to generating factual and commonsense errors, raising concerns in critical areas like healthcare, journalism, and education to mislead users. Current methods for evaluating LLMs’ veracity are limited by the need for extensive human labor, test data contamination, or limited scope, hindering efficient and effective exposure of errors. To address these challenges, we propose HalluHunter, a novel, fully automated framework for systematically uncovering factual inaccuracies in LLMs. HalluHunter employs a knowledge-graph-based approach, extracting fact triplets to generate diverse question types for single- and multi-hop reasoning using rule-based Natural Language Processing (NLP) techniques. Its iterative process starts with random triplet selection for question generation, followed by adaptive selection in subsequent iterations, targeting triplets where LLMs frequently err based on their performance analysis. Our extensive tests on nine prominent LLMs reveal that HalluHunter can trigger factual errors in up to 55% of questions in these models. Moreover, we demonstrate that HalluHunter’s test cases, particularly in adaptive selection, could further expose the weaknesses in benchmarking the factuality in LLMs meanwhile maintaining the coverage of questions. All code, data, and results will be released for future research.

Anthology ID:: 2026.findings-acl.1714
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34288–34309
Language:
URL:: https://aclanthology.org/2026.findings-acl.1714/
DOI:
Bibkey:
Cite (ACL):: Wenxuan Wang, Yuk-Kit Chan, Zixuan Ling, Shi Juluan, Youliang Yuan, Jen-tse Huang, Yifei Zhang, Wenxiang Jiao, Zhaopeng Tu, and Michael R. Lyu. 2026. Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2026, pages 34288–34309, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Identifying the Achilles’ Heel: An Iterative Method for Uncovering Factual Errors in Large Language Models (Wang et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1714.pdf
Checklist:: 2026.findings-acl.1714.checklist.pdf

PDF Cite Search Checklist Fix data