LightMoE: Task-Aware Expert Availability Management for Memory-Efficient MoE-LLM Inference

Puhan Luo; Yunhao Yao; Junyang Wang; Junyang Zhang; Xiangyang Li

LightMoE: Task-Aware Expert Availability Management for Memory-Efficient MoE-LLM Inference

Puhan Luo, Yunhao Yao, Junyang Wang, Junyang Zhang, Xiangyang Li

Abstract

Mixture-of-Experts (MoE) models offer a promising path for scaling model capacity, yet their massive memory footprint poses significant challenges for deployment on resource-constrained edge devices. Existing solutions, such as static pruning or dynamic offloading, often struggle to balance model accuracy with inference latency due to irreversible information loss or prohibitive I/O overhead. In this paper, we propose LightMoE, a novel framework for memory-efficient MoE inference that exploits the inherent functional redundancy and temporal locality of expert activation. LightMoE employs a frequency-aware expert initialization strategy to retain a compact core of resident experts and introduces a similarity-based redirection mechanism to compensate for missing experts without incurring I/O costs. Furthermore, it incorporates a lightweight runtime manager that performs coarse-grained, task-level expert replacement to adapt to shifting data distributions. Empirical evaluations on representative edge platforms demonstrate that LightMoE achieves a superior accuracy-efficiency trade-off, improving average accuracy by 4.3% over static pruning and 2.4% over dynamic swapping methods, while maintaining inference latency comparable to strictly pruned models.

Anthology ID:: 2026.findings-acl.1062
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21124–21134
Language:
URL:: https://aclanthology.org/2026.findings-acl.1062/
DOI:
Bibkey:
Cite (ACL):: Puhan Luo, Yunhao Yao, Junyang Wang, Junyang Zhang, and Xiangyang Li. 2026. LightMoE: Task-Aware Expert Availability Management for Memory-Efficient MoE-LLM Inference. In Findings of the Association for Computational Linguistics: ACL 2026, pages 21124–21134, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: LightMoE: Task-Aware Expert Availability Management for Memory-Efficient MoE-LLM Inference (Luo et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.1062.pdf
Checklist:: 2026.findings-acl.1062.checklist.pdf

PDF Cite Search Checklist Fix data