Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Yuxin Jiang; Yufei Wang; Chuhan Wu; Xinyi Dai; Yan Xu; Weinan Gan; Yasheng Wang; Xin Jiang; Lifeng Shang; Ruiming Tang; Wei Wang

doi:10.18653/v1/2025.findings-acl.343

Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction

Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, Wei Wang

Abstract

The improvement of LLMs’ instruction-following capabilities depends critically on the availability of high-quality instruction-response pairs. While existing automatic data synthetic methods alleviate the burden of manual curation, they often rely heavily on either the quality of seed data or strong assumptions about the structure and content of web documents. To tackle these challenges, we propose Web Reconstruction (WebR), a fully automated framework for synthesizing high-quality instruction-tuning (IT) data directly from raw web documents with minimal assumptions. Leveraging the inherent diversity of raw web content, we conceptualize web reconstruction as an instruction-tuning data synthesis task via a novel dual-perspective paradigm—Web as Instruction and Web as Response—where each web document is designated as either the input or output role to trigger the reconstruction process. Comprehensive experiments show that datasets generated by WebR outperform state-of-the-art baselines by up to 16.65% across four instruction-following benchmarks. Notably, WebR demonstrates superior compatibility, data efficiency, and scalability, enabling enhanced domain adaptation with minimal effort.

Anthology ID:: 2025.findings-acl.343
Volume:: Findings of the Association for Computational Linguistics: ACL 2025
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6603–6618
Language:
URL:: https://aclanthology.org/2025.findings-acl.343/
DOI:: 10.18653/v1/2025.findings-acl.343
Bibkey:
Cite (ACL):: Yuxin Jiang, Yufei Wang, Chuhan Wu, Xinyi Dai, Yan Xu, Weinan Gan, Yasheng Wang, Xin Jiang, Lifeng Shang, Ruiming Tang, and Wei Wang. 2025. Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction. In Findings of the Association for Computational Linguistics: ACL 2025, pages 6603–6618, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction (Jiang et al., Findings 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.findings-acl.343.pdf

PDF Cite Search Fix data