DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Jiaming Zhou; Xuxin Cheng; Shiwan Zhao; Yuhang Jia; Cao Liu; Ke Zeng; Xunliang Cai; Yong Qin

DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding

Jiaming Zhou, Xuxin Cheng, Shiwan Zhao, Yuhang Jia, Cao Liu, Ke Zeng, Xunliang Cai, Yong Qin

Abstract

Autoregressive (AR) large audio language models (LALMs) such as Qwen-2.5-Omni have achieved strong performance on audio understanding and interaction, but scaling them remains costly in data and computation, and strictly sequential decoding limits inference efficiency. Diffusion large language models (dLLMs) have recently been shown to make effective use of limited training data, and prior work on DIFFA indicates that replacing an AR backbone with a diffusion counterpart can substantially improve audio understanding under matched settings, albeit at a proof-of-concept scale without large-scale instruction tuning, preference alignment, or practical decoding schemes. We introduce DIFFA-2, a practical diffusion-based LALM for general audio understanding. DIFFA-2 upgrades the speech encoder, employs dual semantic and acoustic adapters, and is trained with a four-stage curriculum that combines semantic and acoustic alignment, large-scale supervised fine-tuning, and variance-reduced preference optimization, using only fully open-source corpora. Experiments on MMSU, MMAU, and MMAR show that DIFFA-2 consistently improves over DIFFA and is competitive to strong AR LALMs under practical training budgets, supporting diffusion-based modeling is a viable backbone for large-scale audio understanding.

Anthology ID:: 2026.findings-acl.235
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 4789–4805
Language:
URL:: https://aclanthology.org/2026.findings-acl.235/
DOI:
Bibkey:
Cite (ACL):: Jiaming Zhou, Xuxin Cheng, Shiwan Zhao, Yuhang Jia, Cao Liu, Ke Zeng, Xunliang Cai, and Yong Qin. 2026. DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding. In Findings of the Association for Computational Linguistics: ACL 2026, pages 4789–4805, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: DIFFA-2: A Practical Diffusion Large Language Model for General Audio Understanding (Zhou et al., Findings 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.findings-acl.235.pdf
Checklist:: 2026.findings-acl.235.checklist.pdf

PDF Cite Search Checklist Fix data