Haoqi Yang
2026
Faster MoE LLM Inference for Extremely Large Models
Haoqi Yang | Luohe Shi | Qiwei Li | Zuchao Li | Ping Wang | Hao Huang | Hai Zhao
Findings of the Association for Computational Linguistics: ACL 2026
Haoqi Yang | Luohe Shi | Qiwei Li | Zuchao Li | Ping Wang | Hao Huang | Hai Zhao
Findings of the Association for Computational Linguistics: ACL 2026
In fine-grained sparse Mixture-of-Experts (MoE) models, a large pool of specialized experts replaces a small homogeneous set, shifting performance and throughput to be governed by inference-time expert activation. Yet most existing optimization recipes implicitly assume a fixed activation budget (e.g., a constant Top-k per layer), whose behavior in fine-grained MoEs is poorly understood. We first characterize runtime skipping strategies, quantifying the accuracy–efficiency trade-off of (i) uniform fixed activation and (ii) static layer-wise Top-k allocation found by search. Our analysis reveals that static skipping can already provide substantial throughput gains, but optimal static schedules vary significantly across models and routing mechanisms. We therefore introduce Adaptive Skipping with Entropy-Penalized Thresholding (ASET), a training-free policy that adapts token-level activation using router confidence and entropy while remaining within the model’s original budget. Across the fine-grained MoEs we study, static skipping policies yield 10–78% throughput gains with minimal performance degradation, including ≥10% improvement on DeepSeek-V3 without measurable loss. On the OLMoE testbed, ASET yields a Pareto frontier between average activation and task quality. Overall, these results identify expert skipping as a practical lever for faster fine-grained MoE inference, with adaptive activation helping when fixed budgets are too rigid.
2025
XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression
Haoqi Yang | Yao Yao | Zuchao Li | Baoyuan Qi | Liu Guoming | Hai Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Haoqi Yang | Yao Yao | Zuchao Li | Baoyuan Qi | Liu Guoming | Hai Zhao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy. The source code is available at https://github.com/brinenick511/XQuant.