LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang; Can Ren; Chengying Tu; Rongxiang Weng; Hongfei Yan; Jingang Wang; Xunliang Cai

LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points

Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, Xunliang Cai

Abstract

The advancement of large language models (LLMs) struggles with the scarcity of high-quality, diverse training data. To address this limitation, we propose LinkSyn, a KP-graph-based synthesis framework that for the first time enables flexible control over discipline and difficulty distributions while balancing KP coverage and popularity. LinkSyn extracts KPs from question-answering (QA) seed data and constructs a KP graph to synthesize diverse QA data from multiple seeds strongly linked by KPs and sampled from graph walks. Specifically, LinkSyn incorporates (1) a knowledge value function to guide the adjustment of path sampling probability and balance KP coverage and popularity during graph walks; (2) diffusion-based synthesis via a strong reasoning model by leveraging multiple seeds with dense logical associations along each path; and (3) high-difficulty QA enhancement within given disciplines by flexible difficulty adjustments. By executing LinkSyn, we synthesize LinkQA, a diverse multi-disciplinary QA dataset with 50B tokens. Extensive experiments on Llama-3 8B demonstrate that continual pre-training with LinkQA yields an average improvement of 11.51% on MMLU and CMMLU, establishing new SOTA results. LinkQA consistently enhances performance across model size and initial FLOPs scales.

Anthology ID:: 2026.acl-long.1036
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 22620–22651
Language:
URL:: https://aclanthology.org/2026.acl-long.1036/
DOI:
Bibkey:
Cite (ACL):: Xuemiao Zhang, Can Ren, Chengying Tu, Rongxiang Weng, Hongfei Yan, Jingang Wang, and Xunliang Cai. 2026. LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 22620–22651, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LinkQA: Synthesizing Diverse QA from Multiple Seeds Strongly Linked by Knowledge Points (Zhang et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.1036.pdf
Checklist:: 2026.acl-long.1036.checklist.pdf

PDF Cite Search Checklist Fix data