The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues

Zhi Rui Tam; Wen-Yu Chang; Yun-Nung Chen

The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues

Zhi Rui Tam, Wen Yu Chang, Yun-Nung Chen

Abstract

This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements.

Anthology ID:: 2026.iwsds-1.7
Volume:: Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:: February
Year:: 2026
Address:: Trento, Italy
Editors:: Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:: IWSDS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 76–82
Language:
URL:: https://aclanthology.org/2026.iwsds-1.7/
DOI:
Bibkey:
Cite (ACL):: Zhi Rui Tam, Wen Yu Chang, and Yun-Nung Chen. 2026. The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 76–82, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):: The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues (Tam et al., IWSDS 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.iwsds-1.7.pdf

PDF Cite Search Fix data