Jianjin Wang
2026
Bypassing Neural Evaluations for Fast Audio Editing via Adaptive Trajectory Extrapolation
Xiaoqian Liu | Zhengkun Ge | Jianjin Wang | Haoran Zhang | Yuan Ge | Kaiyan Chang | Chen Xu | Tong Xiao | Zhengtao Yu | Linfeng Zhang | JingBo Zhu
Findings of the Association for Computational Linguistics: ACL 2026
Xiaoqian Liu | Zhengkun Ge | Jianjin Wang | Haoran Zhang | Yuan Ge | Kaiyan Chang | Chen Xu | Tong Xiao | Zhengtao Yu | Linfeng Zhang | JingBo Zhu
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in audio diffusion models have significantly improved text-to-audio editing via inversion techniques. However, these models typically rely on dense, fixed-step sampling trajectories to maintain structural integrity during inversion and generation, leading to prohibitive computational costs. We propose AdaTE, a model-agnostic Adaptive Trajectory Extrapolation framework that accelerates the inversion-based editing process by dynamically evaluating only the most critical generative phases. Specifically, we introduce a hierarchical probing mechanism that monitors curvature acceleration and information gain to detect pivotal transitions within the latent flow. This allows the model to selectively skip redundant segments via linear extrapolation while preserving dense neural evaluations for complex semantic changes. Extensive experiments across AudioLDM2, Auffusion, and Tango2 demonstrate that AdaTE achieves up to a 3.9× speedup with negligible loss in fidelity. AdaTE significantly shifts the Pareto frontier, providing an efficient solution for high-fidelity audio synthesis and editing.
On the Emotion Understanding of Synthesized Speech
Yuan Ge | Haishu Zhao | AoKai Hao | Junxiang Zhang | Bei Li | Xiaoqian Liu | Chenglong Wang | Jianjin Wang | Bingsen Zhou | Bingyu Liu | JingBo Zhu | Zhengtao Yu | Tong Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuan Ge | Haishu Zhao | AoKai Hao | Junxiang Zhang | Bei Li | Xiaoqian Liu | Chenglong Wang | Jianjin Wang | Bingsen Zhou | Bingyu Liu | JingBo Zhu | Zhengtao Yu | Tong Xiao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.