Towards Robust Mathematical Reasoning

Minh-Thang Luong; Dawsen Hwang; Hoang H Nguyen; Golnaz Ghiasi; Yuri Chervonyi; Insuk Seo; Junsu Kim; Garrett Bingham; Jonathan Lee; Swaroop Mishra; Alex Zhai; Huiyi Hu; Henryk Michalewski; Jimin Kim; Jeonghyun Ahn; Junhwi Bae; Xingyou Song; Trieu Hoang Trinh; Quoc Le; Junehyuk Jung

doi:10.18653/v1/2025.emnlp-main.1794

Towards Robust Mathematical Reasoning

Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu Hoang Trinh, Quoc V Le, Junehyuk Jung

Abstract

Finding the right north-star metrics is highly critical for advancing mathematical reasoning capabilities of foundation models, especially given that existing evaluations are either too easy or only focusing on getting correct short answers. To address these issues, we present IMO-Bench, a suite of advanced reasoning benchmarks that specifically targets the level of the International Mathematical Olympiad (IMO), the most prestigious venue for young mathematicians. IMOAnswerBench first tests models on 400 diverse Olympiad problems with verifiable short answers. IMO-ProofBench is the next-level evaluation for proof-writing capabilities, which includes both basic and advanced IMO problems as well as detailed grading guidelines to facilitate automatic grading. These benchmarks played a crucial role in our historic achievement of the gold-level performance at IMO 2025 with Gemini Deep Think (Luong and Lockhart, 2025). Our model achieved 80.0% on IMO-AnswerBench and 65.7% on the advanced IMO-ProofBench, surpassing the best non-Gemini models by large margins of 6.9% and 42.4% respectively. We also showed that autograders built with Gemini reasoning correlate well with human evaluations and construct IMO-GradingBench, with 1000 human gradings on proofs, to enable further progress in automatic evaluation of long-form answers. We hope that IMO-Bench will help the community towards advancing robust mathematical reasoning and release it at https://github.com/google-deepmind/superhuman/imobench.

Anthology ID:: 2025.emnlp-main.1794
Original:: 2025.emnlp-main.1794v1
Version 2:: 2025.emnlp-main.1794v2
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 35418–35442
Language:
URL:: https://aclanthology.org/2025.emnlp-main.1794/
DOI:: 10.18653/v1/2025.emnlp-main.1794
Bibkey:
Cite (ACL):: Thang Luong, Dawsen Hwang, Hoang H Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu Hoang Trinh, Quoc V Le, and Junehyuk Jung. 2025. Towards Robust Mathematical Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 35418–35442, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Towards Robust Mathematical Reasoning (Luong et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.1794.pdf
Checklist:: 2025.emnlp-main.1794.checklist.pdf

PDF (v2) PDF (v1) Cite Search Checklist Fix data