Jianfeng Xu

2025

pdf bib abs
Structure Trumps Size: Rethinking Data Quality for LLM Reasoning
Hu Xu | Zeyan Li | Rui Wang | Jianfeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

As domain-specific datasets continue to expand, Large Language Models (LLMs) have achieved significant improvements across various fields through supervised fine-tuning (SFT). However, is more data always better for model fine-tuning? Through a series of controlled experiments, we discover that dataset structure—rather than mere size—plays a decisive role in enhancing LLM reasoning capabilities. While existing methods acknowledge that good data quality can make training more efficient, they primarily rely on simple heuristic strategies and lack systematic, quantitative frameworks for evaluating data quality. To address this gap, we introduce MCSQ—the first multi-dimensional quantitative framework for reasoning data management. MCSQ rigorously evaluates and optimizes datasets along six orthogonal dimensions. Through comprehensive controlled experiments, we find that selectively incorporating “distorted” (model-disagreed) or “mismatched” (low-relevance) samples—which are typically discarded in traditional approaches—can outperform conventional “clean” data on certain advanced reasoning benchmarks. Our findings challenge traditional assumptions about data “quality” in LLM fine-tuning and provide actionable, quantitative guidance for efficient, structure-aware dataset management. The datasets and codes are both available at https://github.com/xuhu0115/MCSQ.

Co-authors

Venues

findings1

Fix author