Investigating the Impact of Incremental Processing and Voice Activity Projection on Spoken Dialogue Systems

Yuya Chiba, Ryuichiro Higashinaka


Abstract
The naturalness of responses in spoken dialogue systems has been significantly improved by the introduction of large language models (LLMs), although many challenges remain until human-like turn-taking can be achieved. A turn-taking model called Voice Activity Projection (VAP) is gaining attention because it can be trained in an unsupervised manner using the spoken dialogue data between two speakers. For such a turn-taking model to be fully effective, systems must initiate response generation as soon as a turn-shift is detected. This can be achieved by incremental response generation, which reduces the delay before the system responds. Incremental response generation is done using partial speech recognition results while user speech is incrementally processed. Combining incremental response generation with VAP-based turn-taking will enable spoken dialogue systems to achieve faster and more natural turn-taking. However, their effectiveness remains unclear because they have not yet been evaluated in real-world systems. In this study, we developed spoken dialogue systems that incorporate incremental response generation and VAP-based turn-taking and evaluated their impact on task success and dialogue satisfaction through user assessments.
Anthology ID:
2025.coling-main.249
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3687–3696
Language:
URL:
https://aclanthology.org/2025.coling-main.249/
DOI:
Bibkey:
Cite (ACL):
Yuya Chiba and Ryuichiro Higashinaka. 2025. Investigating the Impact of Incremental Processing and Voice Activity Projection on Spoken Dialogue Systems. In Proceedings of the 31st International Conference on Computational Linguistics, pages 3687–3696, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Investigating the Impact of Incremental Processing and Voice Activity Projection on Spoken Dialogue Systems (Chiba & Higashinaka, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.249.pdf