RepoDistill: Distilling Repository Knowledge through Compression-Aware Budget Allocation and Policy Optimization

Xin Yin; Zixiang Ding; Yiang Zhang; Qiang Wang; Rui Wang; Chao Ni; Zhe Cui

RepoDistill: Distilling Repository Knowledge through Compression-Aware Budget Allocation and Policy Optimization

Xin Yin, Zixiang Ding, Yiang Zhang, Qiang Wang, Rui Wang, Chao Ni, Zhe Cui

Abstract

Large Language Models (LLMs) have achieved strong performance on many code-related tasks, yet they still struggle with repository-level scenarios where reasoning depends on long, noisy, and structurally complex contexts. While existing retrieval methods, including both similarity-based and graph-based approaches, can identify relevant code snippets, they often retrieve excessive contexts that intensify the "lost-in-the-middle" phenomenon and dilute model attention with redundant contexts. To address this, we present RepoDistill, a novel framework that integrates retrieval with learned budget allocation for fine-grained context compression. RepoDistill first employs a plug-and-play lightweight GraphRAG to retrieve context that follows logical flows. It then applies Compression-Aware Budget Allocation guided by Compression-Aware Policy Optimization, which formulates context management as a multi-step decision problem and learns allocation policies for contexts. Experiments show that RepoDistill outperforms baselines, achieving gains of up to +7.00 on SWE-QA, +24.4% on CoderEval, and +0.25 on LongCodeU. Furthermore, a compact 4B-parameter model trained with RepoDistill can serve as an effective context compressor for closed-source LLMs, reducing input tokens by up to 66% while maintaining comparable performance. We release our code at https://anonymous.4open.science/r/RepoDistill-12B0.

Anthology ID:: 2026.findings-acl.217
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4425–4443
Language:
URL:: https://aclanthology.org/2026.findings-acl.217/
DOI:
Bibkey:
Cite (ACL):: Xin Yin, Zixiang Ding, Yiang Zhang, Qiang Wang, Rui Wang, Chao Ni, and Zhe Cui. 2026. RepoDistill: Distilling Repository Knowledge through Compression-Aware Budget Allocation and Policy Optimization. In Findings of the Association for Computational Linguistics: ACL 2026, pages 4425–4443, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: RepoDistill: Distilling Repository Knowledge through Compression-Aware Budget Allocation and Policy Optimization (Yin et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.217.pdf
Checklist:: 2026.findings-acl.217.checklist.pdf

PDF Cite Search Checklist Fix data