Structure Trumps Size: Rethinking Data Quality for LLM Reasoning

Hu Xu, Zeyan Li, Rui Wang, Jianfeng Xu


Abstract
As domain-specific datasets continue to expand, Large Language Models (LLMs) have achieved significant improvements across various fields through supervised fine-tuning (SFT). However, is more data always better for model fine-tuning? Through a series of controlled experiments, we discover that dataset structure—rather than mere size—plays a decisive role in enhancing LLM reasoning capabilities. While existing methods acknowledge that good data quality can make training more efficient, they primarily rely on simple heuristic strategies and lack systematic, quantitative frameworks for evaluating data quality. To address this gap, we introduce MCSQ—the first multi-dimensional quantitative framework for reasoning data management. MCSQ rigorously evaluates and optimizes datasets along six orthogonal dimensions. Through comprehensive controlled experiments, we find that selectively incorporating “distorted” (model-disagreed) or “mismatched” (low-relevance) samples—which are typically discarded in traditional approaches—can outperform conventional “clean” data on certain advanced reasoning benchmarks. Our findings challenge traditional assumptions about data “quality” in LLM fine-tuning and provide actionable, quantitative guidance for efficient, structure-aware dataset management. The datasets and codes are both available at https://github.com/xuhu0115/MCSQ.
Anthology ID:
2025.findings-emnlp.616
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
11489–11513
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.616/
DOI:
Bibkey:
Cite (ACL):
Hu Xu, Zeyan Li, Rui Wang, and Jianfeng Xu. 2025. Structure Trumps Size: Rethinking Data Quality for LLM Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11489–11513, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Structure Trumps Size: Rethinking Data Quality for LLM Reasoning (Xu et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.616.pdf
Checklist:
 2025.findings-emnlp.616.checklist.pdf