Zhengze Zhang

2025

Large language models (LLMs) have demonstrated exceptional performance across various applications, but their conversational abilities decline sharply as model size decreases, presenting a barrier to their deployment in resource-constrained environments. Knowledge distillation (KD) with Direct Preference Optimization (DPO) has emerged as a promising approach to enhance the conversational abilities of smaller models using a larger teacher model. However, current methods primarily focus on “black-box” KD, which only uses the teacher’s responses, overlooking the rich distributional information within the teacher’s probability distribution. This paper addresses this gap by introducing daDPO (Distillation-Aware DPO), a novel framework that integrates the teacher’s distributional information into DPO distillation while preserving theoretical guarantees. Our framework offers a unified objective that enhances both preference optimization and distribution-based distillation. We provide rigorous theoretical analysis and empirical validation, showing that daDPO outperforms existing methods in restoring performance for pruned models and enhancing smaller models within the same LLM family. Notably, in in-domain evaluation, our method enables a 20% pruned Vicuna1.5-7B to achieve near-teacher performance (-7.3% preference rate), and allows Qwen2.5-1.5B to occasionally outperform its 7b teacher model (14.0% win rate).

pdf bib abs

KV Cache is commonly used to accelerate LLM inference with long contexts, yet its high memory demand drives the need for cache compression. Existing compression methods, however, are largely heuristic and lack dynamic budget allocation. To address this limitation, we introduce a unified framework for cache compression by minimizing information loss in Transformer residual streams. Building on it, we analyze the layer attention output loss and derive a new metric to compare cache entries across heads, enabling layer-wise compression with dynamic head budgets. Additionally, by contrasting cross-layer information, we also achieve dynamic layer budgets. LAVa is the first unified strategy for cache eviction and dynamic budget allocation that, unlike prior methods, does not rely on training or the combination of multiple strategies. Experiments with four benchmarks (LongBench, Needle-In-A-Haystack, Ruler, and InfiniteBench) demonstrate its superiority over strong baselines. Moreover, our experiments reveal a new insight: dynamic layer budgets are crucial for generation tasks (e.g., code completion), while dynamic head budgets play a key role in extraction tasks (e.g., extractive QA). As a fully dynamic compression method, LAVa consistently maintains top performance across task types.

2024

pdf bib abs

Reward Difference Optimization For Sample Reweighting In Offline RLHF
Shiqi Wang | Zhengze Zhang | Rui Zhao | Fei Tan | Cam Tu Nguyen
Findings of the Association for Computational Linguistics: EMNLP 2024

With the wide deployment of Large Language Models (LLMs), aligning LLMs with human values becomes increasingly important. Although Reinforcement Learning with Human Feedback (RLHF) proves effective, it is complicated and highly resource-intensive. As such, offline RLHF has been introduced as an alternative solution, which directly optimizes LLMs with ranking losses on a fixed preference dataset. Current offline RLHF only captures the ordering relationship between responses, overlooking the crucial aspect of “how much” one is preferred over the others. To address this issue, we propose a simple yet effective solution based on reward difference prediction. Specifically, we introduce reward difference coefficients to reweigh sample pairs in offline RLHF. We then propose a difference model that considers rich interactions between a pair of responses for predicting these difference coefficients. Experiments with 7B LLMs on the HH and TL;DR dataset verify the effectiveness of our method in both automatic metrics and human evaluation, highlighting its potential for aligning LLMs with human values.

Co-authors

Venues

Findings3

Fix author