Evaluating Step-by-step Reasoning Traces: A Survey

Jinu Lee, Julia Hockenmaier


Abstract
Step-by-step reasoning is widely used to enhance the reasoning ability of large language models (LLMs) in complex problems. Evaluating the quality of reasoning traces is crucial for understanding and improving LLM reasoning. However, existing evaluation practices are highly inconsistent, resulting in fragmented progress across evaluator design and benchmark development. To address this gap, this survey provides a comprehensive overview of step-by-step reasoning evaluation, proposing a taxonomy of evaluation criteria with four top-level categories (factuality, validity, coherence, and utility). Based on the taxonomy, we review different datasets, evaluator implementations, and recent findings, leading to promising directions for future research.
Anthology ID:
2025.findings-emnlp.94
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1789–1814
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.94/
DOI:
Bibkey:
Cite (ACL):
Jinu Lee and Julia Hockenmaier. 2025. Evaluating Step-by-step Reasoning Traces: A Survey. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1789–1814, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating Step-by-step Reasoning Traces: A Survey (Lee & Hockenmaier, Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.94.pdf
Checklist:
 2025.findings-emnlp.94.checklist.pdf