Yutong Lu

2026

Neural Processing Units (NPUs) are critical for AI infrastructure, yet developing kernels remains a bottleneck due to the complexity of vendor-specific Domain-Specific Languages (DSLs). While LLMs excel in general coding, they fail to meet the stringent constraints of NPU development, showing a near-zero success rate on complex kernels in our preliminary study. To address these challenges, we present AscendKernelGen, the first comprehensive framework for NPU kernel development, marking a pioneering effort in this field. This framework consists of three interconnected components: (1) Ascend-CoT, the first dataset in the NPU kernel domain that incorporates chain-of-thought reasoning from real-world kernel implementations; (2) KernelGen-LM, a domain-adaptive model trained on this novel dataset using supervised fine-tuning and reinforcement learning; and (3) NPUKernelBench, the first benchmark platform designed to evaluate the compilation, correctness, and performance of generated NPU kernels. Experimental results demonstrate that our approach dramatically bridges the gap in hardware-specific coding: compilation success on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), with 64% functional correctness. AscendKernGen is available at AscendKernGen and NPUKernelBench.

pdf bib abs

FLARE: Fine-Grained Length-Aware Routing for Resource-Efficient Heterogeneous LLM Serving
Yujia Fu | Heming Zhong | Dan Huang | Yutong Lu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

With the rapid proliferation of large language models (LLMs), model pools have become increasingly heterogeneous in both capability and efficiency. Larger LLMs can improve quality but incur higher latency and cost, while smaller LLMs are the opposite, making per-query model selection crucial in practice. This has spawned LLM routers that dispatch each query to an appropriate model. Existing routers lack fine-grained resource awareness across deployment settings, which degrades efficiency metrics in real-world serving. To this end, We propose FLARE, a length-centric, resource-aware multi-LLM routing framework that uses length-based models to estimate per-query latency and cost. FLARE formulates routing as a discrete multi-objective optimization problem to achieve efficient trade-off. Experiments show that FLARE reduces latency and cost by up to 68% and 75% while maintaining competitive accuracy, and can be easily applied to new datasets and LLMs.