Ijun Jang

2026

Stable On-Policy Distillation through Adaptive Target Reformulation
Ijun Jang | Jewon Yeom | Juan Yeo | Hyunggyu Lim | Taesup Kim
Findings of the Association for Computational Linguistics: ACL 2026

Knowledge distillation (KD) is a widely adopted technique for transferring capabilities from large language models to smaller student models. However, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities and noisy teacher feedback during early optimization stages. These challenges manifest as pathological gradients in forward KL objectives when students encounter unfamiliar tokens, or as a collapse in distributional diversity within reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric target distribution in logit space to emphasize agreement between the teacher and the student. By introducing a tunable parameter 𝛽, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.

pdf bib abs

Legal QA benchmarks have predominantly focused on case law, overlooking the unique challenges of statute-centric regulatory reasoning. In statutory domains, relevant evidence is distributed across hierarchically linked documents, creating a statutory retrieval gap where conventional retrievers fail and models often hallucinate under incomplete context. We introduce SearchFireSafety, a structure- and safety-aware benchmark for statute-centric legal QA. Instantiated on fire-safety regulations as a representative case, the benchmark evaluates whether models can retrieve hierarchically fragmented evidence and safely abstain when statutory context is insufficient. SearchFireSafety adopts a dual-track evaluation framework combining real-world questions that require citation-aware retrieval and synthetic partial-context scenarios that stress-test hallucination and refusal behavior. Experiments across multiple large language models show that graph-guided retrieval substantially improves performance, but also reveal a critical safety trade-off: domain-adapted models are more likely to hallucinate when key statutory evidence is missing. Our findings highlight the need for benchmarks that jointly evaluate hierarchical retrieval and model safety in statute-centric regulatory settings.

Co-authors

Venues

ACL1
Findings1

Fix author