Haoxin Xu
2025
TMATH A Dataset for Evaluating Large Language Models in Generating Educational Hints for Math Word Problems
Changyong Qi
|
Yuang Wei
|
Haoxin Xu
|
Longwei Zheng
|
Peiji Chen
|
Xiaoqing Gu
Proceedings of the 31st International Conference on Computational Linguistics
Large Language Models (LLMs) are increasingly being applied in education, showing significant potential in personalized instruction, student feedback, and intelligent tutoring. Generating hints for Math Word Problems (MWPs) has become a critical application, particularly in helping students understand problem-solving steps and logic. However, existing models struggle to provide pedagogically sound guidance that fosters learning without offering direct answers. To address this issue, we introduce TMATH, a dataset specifically designed to evaluate LLMs’ ability to generate high-quality hints for MWPs. TMATH contains diverse mathematical problems paired with carefully crafted, human-generated hints. To assess its impact, we fine-tuned a series of 7B-scale language models using TMATH. Our results, based on quantitative evaluations and expert assessments, show that while LLMs still face challenges in complex reasoning, the TMATH dataset significantly enhances their ability to generate more accurate and contextually appropriate educational hints.