Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Yu Li; Xiaoran Shang; Qizhi Pei; Yun Zhu; Xin Gao; Honglin Lin; Zhanping Zhong; Zhuoshi Pan; Zheng Liu; Xiaoyang Wang; Conghui He; Dahua Lin; Feng Zhao; Lijun Wu

Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs

Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, Lijun Wu

Abstract

Post-training data plays a pivotal role in shaping the capabilities of Large Language Models (LLMs), yet datasets are often treated as isolated artifacts, overlooking the systemic connections that underlie their evolution. To disentangle these complex relationships, we introduce the concept of data lineage to the LLM ecosystem and propose an automated multi-agent framework to reconstruct the evolutionary graph of dataset development. Through large-scale lineage analysis, we characterize domain-specific structural patterns, such as vertical refinement in Math-oriented datasets and horizontal aggregation in General-domain corpora. Moreover, we uncover pervasive systemic issues, including structural redundancy induced by implicit dataset intersections and the propagation of benchmark contamination along lineage paths. To demonstrate the practical value of lineage analysis for data construction, we leverage the reconstructed lineage graph to create a lineage-aware diversity-oriented dataset. By anchoring instruction sampling at upstream leaf sources, this approach mitigates downstream homogenization and hidden redundancy, yielding a more diverse post-training corpus. We further highlight lineage-centric analysis as an efficient and robust topological alternative to sample-level dataset comparison for large-scale data ecosystems. By grounding data construction in explicit lineage structures, our work advances post-training data curation toward a more systematic and controllable paradigm.

Anthology ID:: 2026.acl-long.435
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 9606–9625
Language:
URL:: https://aclanthology.org/2026.acl-long.435/
DOI:
Bibkey:
Cite (ACL):: Yu Li, Xiaoran Shang, Qizhi Pei, Yun Zhu, Xin Gao, Honglin Lin, Zhanping Zhong, Zhuoshi Pan, Zheng Liu, Xiaoyang Wang, Conghui He, Dahua Lin, Feng Zhao, and Lijun Wu. 2026. Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9606–9625, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Tracing the Roots: A Multi-Agent Framework for Uncovering Data Lineage in Post-Training LLMs (Li et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-long.435.pdf
Checklist:: 2026.acl-long.435.checklist.pdf

PDF Cite Search Checklist Fix data