UTMath: A Benchmark for Math Evaluation with Unit Test

Bo Yang; Qingping Yang; Yingwei Ma; Runtao Liu

doi:10.18653/v1/2025.findings-emnlp.315

UTMath: A Benchmark for Math Evaluation with Unit Test

Bo Yang, Qingping Yang, Yingwei Ma, Runtao Liu

Abstract

The evaluation of mathematical reasoning capabilities constitutes a critical pathway toward achieving Artificial General Intelligence (AGI). Prevailing benchmarks including MATH and AIME mainly feature single-instantiation problems with fixed numbers, permitting pattern matching instead of principled deductive reasoning and leaving generalization on isomorphic problem variants untested. To address these limitations, we propose the UTMath Benchmark, employing rigorous unit testing methodology that simultaneously quantifies solution accuracy and solution space generality. It comprises 1,053 problems spanning 9 mathematical domains, each accompanied by an average of 68 varied test cases. With answer possibilities per problem on average, UTMath sets new standards for robust reasoning while preventing memorization. UTMath is highly challenging, with the best-performing model, o1-mini, solving only 32.57% of the problems, followed by o1-preview at 27.16%, and GPT-4o at 26.93%. We further propose Reasoning-to-Code Thoughts (RCoT), a prompting strategy that decouples symbolic reasoning from code synthesis. RCoT guides LLMs to first derive formal reasoning structures before generating executable code, producing generalizable solutions rather than situation-specific answers. To help the community push mathematical reasoning further, we release UTMath-Train (70k samples), a companion training set generated under the same protocol. Our benchmark can be accessed via the following link: [UTMath](https://utmathhomepage.github.io/)

Anthology ID:: 2025.findings-emnlp.315
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 5899–5915
Language:
URL:: https://aclanthology.org/2025.findings-emnlp.315/
DOI:: 10.18653/v1/2025.findings-emnlp.315
Bibkey:
Cite (ACL):: Bo Yang, Qingping Yang, Yingwei Ma, and Runtao Liu. 2025. UTMath: A Benchmark for Math Evaluation with Unit Test. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5899–5915, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: UTMath: A Benchmark for Math Evaluation with Unit Test (Yang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-emnlp.315.pdf
Checklist:: 2025.findings-emnlp.315.checklist.pdf

PDF Cite Search Checklist Fix data