HSS-Synth: Humanities and Social Sciences Data Synthesis for LLMs

Ru Peng; Tianyu Zhao; Xijun Gu; Zhiting Fan; Haokai Xu; Jinyang Zhang; Yawen Zeng; Yihong Zhuang; Kexin Yang; Junyang Lin; Dayiheng Liu; Junbo Zhao

HSS-Synth: Humanities and Social Sciences Data Synthesis for LLMs

Ru Peng, Tianyu Zhao, Xijun Gu, Zhiting Fan, Haokai Xu, Jinyang Zhang, Yawen Zeng, Yihong Zhuang, Kexin Yang, Junyang Lin, Dayiheng Liu, Junbo Zhao

Abstract

High-quality, diverse data are vital for large language models (LLMs) but remain scarce and costly. Data synthesis is a viable alternative and succeeds on closed tasks, yet the humanities and social sciences (HSS) are overlooked, and their open-ended nature makes synthesis challenging.Moving beyond prior capability-centric, fragmented attempts, we adopt a subject-centric paradigm, define the first HSS domain system covering 14 mainstream fields, and introduce HSS-Synth—the first data synthesis pipeline for HSS.HSS-Synth comprises: (1) constructing seed document from web corpora via multi-step filtering and text refinement evaluated by a judge; (2) specifying “requirements + persona” to backtranslate seed document into diverse yet faithful instructions with strict Q&A alignment check; and (3) breaking LLM response limits via teacher-forced Answering that fed seed documents during response to anchor semantics, reduce hallucinations, and preserve tone and integrity.HSS-Synth yields 237k high-quality, diverse instruction-tuning samples that outperform 14 leading baselines on 16 benchmarks. The fine-tuned Qwen3-8B-Base set new SOTA and approached official Qwen3-8B, improving both human preference and knowledge capability without performance seesaws. Extensive experiments demonstrate the HSS-Synth’s robustness and transferability.Our code is publicly available at https://github.com/pengr/HSS-Synth.

Anthology ID:: 2026.findings-acl.1880
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 37706–37732
Language:
URL:: https://aclanthology.org/2026.findings-acl.1880/
DOI:
Bibkey:
Cite (ACL):: Ru Peng, Tianyu Zhao, Xijun Gu, Zhiting Fan, Haokai Xu, Jinyang Zhang, Yawen Zeng, Yihong Zhuang, Kexin Yang, Junyang Lin, Dayiheng Liu, and Junbo Zhao. 2026. HSS-Synth: Humanities and Social Sciences Data Synthesis for LLMs. In Findings of the Association for Computational Linguistics: ACL 2026, pages 37706–37732, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: HSS-Synth: Humanities and Social Sciences Data Synthesis for LLMs (Peng et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1880.pdf
Checklist:: 2026.findings-acl.1880.checklist.pdf

PDF Cite Search Checklist Fix data