Mao Saeki
2026
Reproducing Proficiency-Conditioned Dialogue Features with Full-duplex Spoken Dialogue Models
Takao Obi | Sadahiro Yoshikawa | Mao Saeki | Masaki Eguchi | Yoichi Matsuyama
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Takao Obi | Sadahiro Yoshikawa | Mao Saeki | Masaki Eguchi | Yoichi Matsuyama
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Real-time, human-centered conversational AI requires systems that handle spoken dialogue with overlap and rapid turn-taking. Although full-duplex models promise these capabilities, empirical work applying them to conversational AI is still nascent. To fill this gap, this study investigates whether the full-duplex model can reproduce the human dialogue features. We adapt a full-duplex spoken dialogue model to a large corpus of second-language (L2) learner interviews and train proficiency-conditioned models. We then conduct real-time interview sessions between these models and a spoken dialogue system designed to elicit spontaneous learner speech, and analyze reaction time, response frequency, and fluency metrics across aggregated CEFR levels (A/B/C). Our results show that proficiency-conditioned models partially reproduce levelwise trends and distributions observed in human interviews across multiple metrics. These findings suggest that full-duplex models can reproduce dialogue features of human dialogues and offer a promising foundation for conversational AI systems.
Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
This study investigates how interactional characteristics of spoken dialogue corpora influence the learning process and resulting behavior of speech language models for full-duplex dialogue systems. While previous research has mainly focused on improving acoustic and linguistic quality, an effective dialogue system must also capture and reproduce task-dependent interactional dynamics such as conversational tempo and turn-taking patterns. To analyze these properties, we evaluated multiple dialogue corpora using NISQA for speech quality, LLM-as-a-Judge for linguistic and semantic appropriateness, and four timing-based indicators: inter-pausal units, pause, gap, and overlap. A curriculum learning strategy was applied to fine-tune a Moshi-based full-duplex dialogue model by incrementally combining corpora with different interactional characteristics. Experimental results on a dialogue continuation task showed that corpus-specific interactional patterns effectively shape model behavior. Chat-style corpora facilitated natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted more stable and deliberate timing. Fine-tuning with high-quality audio improved speech quality, while using task-mismatched data degraded linguistic coherence.
2024
InteLLA: Intelligent Language Learning Assistant for Assessing Language Proficiency through Interviews and Roleplays
Mao Saeki | Hiroaki Takatsu | Fuma Kurata | Shungo Suzuki | Masaki Eguchi | Ryuki Matsuura | Kotaro Takizawa | Sadahiro Yoshikawa | Yoichi Matsuyama
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Mao Saeki | Hiroaki Takatsu | Fuma Kurata | Shungo Suzuki | Masaki Eguchi | Ryuki Matsuura | Kotaro Takizawa | Sadahiro Yoshikawa | Yoichi Matsuyama
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
In this paper, we propose a multimodal dialogue system designed to elicit spontaneous speech samples from second language learners for reliable oral proficiency assessment. The primary challenge in utilizing dialogue systems for language testing lies in obtaining ratable speech samples that demonstrates the user’s full capabilities of interactional skill. To address this, we developed a virtual agent capable of conducting extended interactions, consisting of a 15-minute interview and 10-minute roleplay. The interview component is a system-led dialogue featuring questions that aim to elicit specific language functions from the user. The system dynamically adjusts the topic difficulty based on real-time assessments to provoke linguistic breakdowns as evidence of their upper limit of proficiency. The roleplay component is a mixed-initiative, collaborative conversation aimed at evaluating the user’s interactional competence. Two experiments were conducted to evaluate our system’s reliability in assessing oral proficiency. In experiment 1, we collected a total of 340 interview sessions, 45-72% of which successfully elicited upper linguistic limit by adjusting the topic difficulty levels. In experiment 2, based on the ropleplay dataset of 75 speakers, the interactional speech elicited by our system was found to be as ratable as those by human examiners, indicated by the reliability index of interactional ratings. These results demonstrates that our system can elicit ratable interactional performances comparable to those elicited by human interviewers. Finally, we report on the deployment of our system with over 10,000 university students in a real-world testing scenario.
Assessing Interactional Competence with Multimodal Dialog Systems
Mao Saeki
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
Mao Saeki
Proceedings of the 20th Workshop of Young Researchers' Roundtable on Spoken Dialogue Systems
My research interests lie in multimodal dialog systems, especially in turn-taking and the understanding and generation of non-verbal cues. I am also interested in bringing dialog system research into industry, and making virtual agents practical in real world setting. I have been working on the Intelligent Language Learning Assistant (InteLLA) system, a virtual agent designed to provide fully automated English proficiency assessments through oral conversations. This project is driven by the practical need to address the lack of opportunities for second-language learners to assess and practice their conversation skills.