Rationales for Answers to Simple Math Word Problems Confuse Large Language Models

Yidan Zhang, Mingfeng Xue, Dayiheng Liu, Zhenan He


Abstract
Recently, large language models (LLMs) have demonstrated breakthrough mathematical problem-solving capabilities in grade school math word problems (MWP). For example, on the MWP benchmark GSM8K, the accuracy of GPT-3.5-Turbo and MetaMath-70B reaches 80.80% and 82.30%, respectively. One question arises, does it mean that LLMs have truly mastered related mathematical problem-solving abilities? In this paper, by presenting two types of benchmarks, where MCGSM8K aims at selecting one correct solution from four solutions, while GSM8K-Judgement judges whether a solution to a given question is true or false, we demonstrate that the ability of most LLMs to evaluate the mathematical reasoning process of MWP is far from sufficient. To compensate for this issue, we propose hybrid supervised fine-tuning data from the training data of GSM8K, MCGSM8K, and GSM8K-Judgement, which significantly improves performance on the proposed reasoning process evaluation benchmarks. For example, fine-tuning improves the performance of LLaMA-2-13B from 33.51% to 70.89% on MCGSM8K. In conclusion, we experimentally demonstrate that most LLMs have limited ability to evaluate the mathematical reasoning process of MWP, which can be enhanced through fine-tuning.
Anthology ID:
2024.findings-acl.524
Volume:
Findings of the Association for Computational Linguistics ACL 2024
Month:
August
Year:
2024
Address:
Bangkok, Thailand and virtual meeting
Editors:
Lun-Wei Ku, Andre Martins, Vivek Srikumar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8853–8869
Language:
URL:
https://aclanthology.org/2024.findings-acl.524
DOI:
Bibkey:
Cite (ACL):
Yidan Zhang, Mingfeng Xue, Dayiheng Liu, and Zhenan He. 2024. Rationales for Answers to Simple Math Word Problems Confuse Large Language Models. In Findings of the Association for Computational Linguistics ACL 2024, pages 8853–8869, Bangkok, Thailand and virtual meeting. Association for Computational Linguistics.
Cite (Informal):
Rationales for Answers to Simple Math Word Problems Confuse Large Language Models (Zhang et al., Findings 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.findings-acl.524.pdf