Zipeng Ling
2025
Dissecting Logical Reasoning in LLMs: A Fine-Grained Evaluation and Supervision Study
Yujun Zhou
|
Jiayi Ye
|
Zipeng Ling
|
Yufei Han
|
Yue Huang
|
Haomin Zhuang
|
Zhenwen Liang
|
Kehan Guo
|
Taicheng Guo
|
Xiangqi Wang
|
Xiangliang Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Logical reasoning is a core capability for large language models (LLMs), yet existing benchmarks that rely solely on final-answer accuracy fail to capture the quality of the reasoning process. To address this, we introduce FineLogic, a fine-grained evaluation framework that assesses logical reasoning across three dimensions: overall accuracy, stepwise soundness, and representation-level probing. Leveraging this framework, we conduct a comprehensive study on how different supervision formats in fine-tuning shape reasoning abilities. We fine-tune LLMs on four supervision styles—one in natural language and three symbolic variants—and find a key trade-off: natural language supervision excels at generalization to out-of-distribution and long-chain problems, whereas symbolic supervision is superior at instilling structurally sound, atomic reasoning steps. Furthermore, our probing analysis indicates that fine-tuning primarily refines the model’s step-by-step generation process, rather than improving its ability to converge on an answer early. Together, our framework and analysis provide a more rigorous lens for evaluating and improving logical reasoning in LLMs. The code is available at https://github.com/YujunZhou/FineLogic.
Search
Fix author
Co-authors
- Kehan Guo 1
- Taicheng Guo 1
- Yufei Han 1
- Yue Huang 1
- Zhenwen Liang 1
- show all...