TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Jialin Ouyang

doi:10.18653/v1/2025.acl-short.84

TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation

Abstract

Large language models (LLMs) now achieve near-human performance on standard math word problem benchmarks (e.g., GSM8K), yet their true reasoning ability remains disputed. A key concern is that models often produce confident, yet unfounded, answers to unanswerable problems. We introduce TreeCut, a synthetic dataset that systematically generates infinite unanswerable math word problems and their answerable counterparts, by representing each question as a tree and removing chosen necessary conditions. Experiments show TreeCut effectively induce hallucinations in large language models, including GPT-4o and o3-mini, with rates of 64% and 44% in their respective worst-case scenarios under zero-shot setting. Further analysis highlights that deeper or more complex trees, composite item names, and removing necessary condition near the middle of a path all increase the likelihood of hallucinations, underscoring the persistent challenges LLMs face in identifying unanswerable math problems. The dataset generation code and sample data are available at https://github.com/j-bagel/treecut-math.

Anthology ID:: 2025.acl-short.84
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1073–1085
Language:
URL:: https://aclanthology.org/2025.acl-short.84/
DOI:: 10.18653/v1/2025.acl-short.84
Bibkey:
Cite (ACL):: Jialin Ouyang. 2025. TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1073–1085, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: TreeCut: A Synthetic Unanswerable Math Word Problem Dataset for LLM Hallucination Evaluation (Ouyang, ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-short.84.pdf

PDF Cite Search Fix data