Jing Zhang
Other people with similar names: Jing Zhang, Jing Zhang, Jing Zhang
Unverified author pages with similar names: Jing Zhang
2026
Optimizing Native Sparse Attention with Latent Attention and Local Global Alternating Strategies
Yuxuan Hu | Jianchao Tan | Jiaqi Zhang | Wen Zan | Pingwei Sun | Yifan Lu | Xunliang Cai | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Yuxuan Hu | Jianchao Tan | Jiaqi Zhang | Wen Zan | Pingwei Sun | Yifan Lu | Xunliang Cai | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
In this work, we conduct a systematic analysis of Native Sparse Attention (NSA) and propose targeted improvements that enhance long-context modeling. A key insight is that alternating between local (sliding-window) and global (compression/selective) attention across layers, rather than using fixed patterns, enables more effective propagation of long-range dependencies and substantially boosts performance on long-sequence tasks. Meanwhile, we further refine NSA’s branches with Latent Attention that the sliding-window branch is enhanced with Multi-head Latent Attention (MLA) while compression and selective branches adopt Group-head Latent Attention (GLA). These changes reduce KV-cache memory by 50% versus NSA while improving the model’s common-sense reasoning and long-text understanding capabilities. Experiments on models from 340M to 1.3B parameters (trained on 15B and 100B tokens) show our method matches or exceeds full attention and native sparse attention in both common-sense reasoning and long-context understanding tasks.
MTP-RL: Acceleration of Reinforcement Learning Rollouts with Policy-Aligned Multi-Token Prediction
Ke Wang | Aohan Zeng | Zhengxiao Du | Yuxuan Hu | Bohan Zhang | Xinyi Wang | Jie Tang | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Ke Wang | Aohan Zeng | Zhengxiao Du | Yuxuan Hu | Bohan Zhang | Xinyi Wang | Jie Tang | Jing Zhang
Findings of the Association for Computational Linguistics: ACL 2026
Reinforcement learning (RL) is widely applied to boost the performance of pretrained models, yet its training efficiency is severely constrained by rollout generation. While speculative decoding based on multi-token prediction (MTP) offers a potential acceleration pathway, its widespread adoption is hindered by the absence of MTP in vanilla pretrained models and the rapid degradation of the MTP acceptance length in RL training. To address these issues, this paper proposes MTP-RL, a two-stage framework that pioneers effective training of MTPs in RL and accelerates the rollout phase for diverse models. It involves a pipeline to equip the multi-layer parameter-sharing MTP for all models and an innovative advantage-aware MTP optimization strategy to facilitate policy-aligned training of MTPs. Experiments demonstrate that our method not only achieves stable growth of acceptance length during RL training, but also accelerates RL rollouts, achieving an average 23.1%–55.3% reduction in rollout time compared to baselines.
Sparse-RL: Breaking the Memory Wall in LLM Reinforcement Learning via Stable Sparse Rollouts
Sijia Luo | Xiaokang Zhang | Yuxuan Hu | Bohan Zhang | Ke Wang | Jinbo Su | Mengshu Sun | Lei Liang | Jing Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sijia Luo | Xiaokang Zhang | Yuxuan Hu | Bohan Zhang | Ke Wang | Jinbo Su | Mengshu Sun | Lei Liang | Jing Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning (RL) has become essential for eliciting complex reasoning capabilities in Large Language Models (LLMs). However, the substantial memory overhead of storing Key-Value (KV) caches during long-horizon rollouts acts as a critical bottleneck, often prohibiting efficient training on limited hardware. While existing KV compression techniques offer a remedy for inference, directly applying them to RL training induces a severe policy mismatch, leading to catastrophic performance collapse. To address this, we introduce Sparse-RL, which empowers stable RL training under sparse rollouts. We show that instability arises from a fundamental policy mismatch among the dense old policy, the sparse sampler policy, and the learner policy. To mitigate this issue, Sparse-RL incorporates Sparsity-Aware Rejection Sampling and Importance-based Reweighting to correct the off-policy bias introduced by compression-induced information loss. Experimental results show that Sparse-RL reduces rollout overhead compared to dense baselines while preserving the performance. Furthermore, Sparse-RL inherently implements sparsity-aware training, significantly enhancing model robustness during sparse inference deployment.