Correct Metadata for
Abstract
This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements.- Anthology ID:
- 2026.iwsds-1.7
- Volume:
- Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
- Month:
- February
- Year:
- 2026
- Address:
- Trento, Italy
- Editors:
- Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
- Venue:
- IWSDS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 76–82
- Language:
- URL:
- https://aclanthology.org/2026.iwsds-1.7/
- DOI:
- Bibkey:
- Cite (ACL):
- Zhi Rui Tam, Wen Yu Chang, and Yun-Nung Chen. 2026. The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 76–82, Trento, Italy. Association for Computational Linguistics.
- Cite (Informal):
- The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues (Tam et al., IWSDS 2026)
- Copy Citation:
- PDF:
- https://aclanthology.org/2026.iwsds-1.7.pdf
Export citation
@inproceedings{tam-etal-2026-context,
title = "The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues",
author = "Tam, Zhi Rui and
Chang, Wen Yu and
Chen, Yun-Nung",
editor = "Riccardi, Giuseppe and
Mousavi, Seyed Mahed and
Torres, Maria Ines and
Yoshino, Koichiro and
Callejas, Zoraida and
Chowdhury, Shammur Absar and
Chen, Yun-Nung and
Bechet, Frederic and
Gustafson, Joakim and
Damnati, G{\'e}raldine and
Papangelis, Alex and
D{'}Haro, Luis Fernando and
Mendon{\c{c}}a, John and
Bernardi, Raffaella and
Hakkani-Tur, Dilek and
Di Fabbrizio, Giuseppe {''}Pino{''} and
Kawahara, Tatsuya and
Alam, Firoj and
Tur, Gokhan and
Johnston, Michael",
booktitle = "Proceedings of the 16th International Workshop on Spoken Dialogue System Technology",
month = feb,
year = "2026",
address = "Trento, Italy",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2026.iwsds-1.7/",
pages = "76--82",
abstract = "This study systematically compares end-to-end ({E}2{E}) audio language models ({A}udio{LM}s) against modular ({ASR}, {LLM}, {TTS}) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that {E}2{E} configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the {E}2{E} models' dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of {A}udio{LM}s and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="tam-etal-2026-context">
<titleInfo>
<title>The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues</title>
</titleInfo>
<name type="personal">
<namePart type="given">Zhi</namePart>
<namePart type="given">Rui</namePart>
<namePart type="family">Tam</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Wen</namePart>
<namePart type="given">Yu</namePart>
<namePart type="family">Chang</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yun-Nung</namePart>
<namePart type="family">Chen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2026-02</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 16th International Workshop on Spoken Dialogue System Technology</title>
</titleInfo>
<name type="personal">
<namePart type="given">Giuseppe</namePart>
<namePart type="family">Riccardi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Seyed</namePart>
<namePart type="given">Mahed</namePart>
<namePart type="family">Mousavi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Maria</namePart>
<namePart type="given">Ines</namePart>
<namePart type="family">Torres</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Koichiro</namePart>
<namePart type="family">Yoshino</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Zoraida</namePart>
<namePart type="family">Callejas</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Shammur</namePart>
<namePart type="given">Absar</namePart>
<namePart type="family">Chowdhury</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yun-Nung</namePart>
<namePart type="family">Chen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Frederic</namePart>
<namePart type="family">Bechet</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Joakim</namePart>
<namePart type="family">Gustafson</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Géraldine</namePart>
<namePart type="family">Damnati</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Alex</namePart>
<namePart type="family">Papangelis</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Luis</namePart>
<namePart type="given">Fernando</namePart>
<namePart type="family">D’Haro</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">John</namePart>
<namePart type="family">Mendonça</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Raffaella</namePart>
<namePart type="family">Bernardi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Dilek</namePart>
<namePart type="family">Hakkani-Tur</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Giuseppe</namePart>
<namePart type="given">”Pino”</namePart>
<namePart type="family">Di Fabbrizio</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Tatsuya</namePart>
<namePart type="family">Kawahara</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Firoj</namePart>
<namePart type="family">Alam</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Gokhan</namePart>
<namePart type="family">Tur</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Michael</namePart>
<namePart type="family">Johnston</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Trento, Italy</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
</relatedItem>
<abstract>This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements.</abstract>
<identifier type="citekey">tam-etal-2026-context</identifier>
<location>
<url>https://aclanthology.org/2026.iwsds-1.7/</url>
</location>
<part>
<date>2026-02</date>
<extent unit="page">
<start>76</start>
<end>82</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings %T The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues %A Tam, Zhi Rui %A Chang, Wen Yu %A Chen, Yun-Nung %Y Riccardi, Giuseppe %Y Mousavi, Seyed Mahed %Y Torres, Maria Ines %Y Yoshino, Koichiro %Y Callejas, Zoraida %Y Chowdhury, Shammur Absar %Y Chen, Yun-Nung %Y Bechet, Frederic %Y Gustafson, Joakim %Y Damnati, Géraldine %Y Papangelis, Alex %Y D’Haro, Luis Fernando %Y Mendonça, John %Y Bernardi, Raffaella %Y Hakkani-Tur, Dilek %Y Di Fabbrizio, Giuseppe ”Pino” %Y Kawahara, Tatsuya %Y Alam, Firoj %Y Tur, Gokhan %Y Johnston, Michael %S Proceedings of the 16th International Workshop on Spoken Dialogue System Technology %D 2026 %8 February %I Association for Computational Linguistics %C Trento, Italy %F tam-etal-2026-context %X This study systematically compares end-to-end (E2E) audio language models (AudioLMs) against modular (ASR, LLM, TTS) systems for multi-phase task-oriented dialogues. We evaluate open-source models on key metrics: conversational naturalness and dialogue consistency. Our findings show that E2E configurations consistently underperform their modular counterparts, exhibiting severe degradation in dialogue quality across turns. Investigating this failure, our analysis reveals that the core issue lies in the E2E models’ dialogue modeling capabilities, specifically in context maintenance and topic tracking. This work highlights a critical gap between the purported low-latency benefit of AudioLMs and their practical ability to maintain coherence in complex, multi-turn dialogues, suggesting a need for focused architectural improvements. %U https://aclanthology.org/2026.iwsds-1.7/ %P 76-82
Markdown (Informal)
[The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues](https://aclanthology.org/2026.iwsds-1.7/) (Tam et al., IWSDS 2026)
- The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues (Tam et al., IWSDS 2026)
ACL
- Zhi Rui Tam, Wen Yu Chang, and Yun-Nung Chen. 2026. The Context Trap: Why End-to-End Audio Language Models Fail Multi-turn Dialogues. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 76–82, Trento, Italy. Association for Computational Linguistics.