Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model

Yuto Abe, Mao Saeki, Atsumoto Ohashi, Shinnosuke Takamichi, Shiyna Fujie, Tetsunori Kobayashi, Tetsuji Ogawa, Ryuichiro Higashinaka


Abstract
This study investigates how interactional characteristics of spoken dialogue corpora influence the learning process and resulting behavior of speech language models for full-duplex dialogue systems. While previous research has mainly focused on improving acoustic and linguistic quality, an effective dialogue system must also capture and reproduce task-dependent interactional dynamics such as conversational tempo and turn-taking patterns. To analyze these properties, we evaluated multiple dialogue corpora using NISQA for speech quality, LLM-as-a-Judge for linguistic and semantic appropriateness, and four timing-based indicators: inter-pausal units, pause, gap, and overlap. A curriculum learning strategy was applied to fine-tune a Moshi-based full-duplex dialogue model by incrementally combining corpora with different interactional characteristics. Experimental results on a dialogue continuation task showed that corpus-specific interactional patterns effectively shape model behavior. Chat-style corpora facilitated natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted more stable and deliberate timing. Fine-tuning with high-quality audio improved speech quality, while using task-mismatched data degraded linguistic coherence.
Anthology ID:
2026.iwsds-1.10
Volume:
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Month:
February
Year:
2026
Address:
Trento, Italy
Editors:
Giuseppe Riccardi, Seyed Mahed Mousavi, Maria Ines Torres, Koichiro Yoshino, Zoraida Callejas, Shammur Absar Chowdhury, Yun-Nung Chen, Frederic Bechet, Joakim Gustafson, Géraldine Damnati, Alex Papangelis, Luis Fernando D’Haro, John Mendonça, Raffaella Bernardi, Dilek Hakkani-Tur, Giuseppe "Pino" Di Fabbrizio, Tatsuya Kawahara, Firoj Alam, Gokhan Tur, Michael Johnston
Venue:
IWSDS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
104–108
Language:
URL:
https://aclanthology.org/2026.iwsds-1.10/
DOI:
Bibkey:
Cite (ACL):
Yuto Abe, Mao Saeki, Atsumoto Ohashi, Shinnosuke Takamichi, Shiyna Fujie, Tetsunori Kobayashi, Tetsuji Ogawa, and Ryuichiro Higashinaka. 2026. Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model. In Proceedings of the 16th International Workshop on Spoken Dialogue System Technology, pages 104–108, Trento, Italy. Association for Computational Linguistics.
Cite (Informal):
Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model (Abe et al., IWSDS 2026)
Copy Citation:
PDF:
https://aclanthology.org/2026.iwsds-1.10.pdf