Yoichi Matsuyama


2026

Real-time, human-centered conversational AI requires systems that handle spoken dialogue with overlap and rapid turn-taking. Although full-duplex models promise these capabilities, empirical work applying them to conversational AI is still nascent. To fill this gap, this study investigates whether the full-duplex model can reproduce the human dialogue features. We adapt a full-duplex spoken dialogue model to a large corpus of second-language (L2) learner interviews and train proficiency-conditioned models. We then conduct real-time interview sessions between these models and a spoken dialogue system designed to elicit spontaneous learner speech, and analyze reaction time, response frequency, and fluency metrics across aggregated CEFR levels (A/B/C). Our results show that proficiency-conditioned models partially reproduce levelwise trends and distributions observed in human interviews across multiple metrics. These findings suggest that full-duplex models can reproduce dialogue features of human dialogues and offer a promising foundation for conversational AI systems.

2024

In this paper, we propose a multimodal dialogue system designed to elicit spontaneous speech samples from second language learners for reliable oral proficiency assessment. The primary challenge in utilizing dialogue systems for language testing lies in obtaining ratable speech samples that demonstrates the user’s full capabilities of interactional skill. To address this, we developed a virtual agent capable of conducting extended interactions, consisting of a 15-minute interview and 10-minute roleplay. The interview component is a system-led dialogue featuring questions that aim to elicit specific language functions from the user. The system dynamically adjusts the topic difficulty based on real-time assessments to provoke linguistic breakdowns as evidence of their upper limit of proficiency. The roleplay component is a mixed-initiative, collaborative conversation aimed at evaluating the user’s interactional competence. Two experiments were conducted to evaluate our system’s reliability in assessing oral proficiency. In experiment 1, we collected a total of 340 interview sessions, 45-72% of which successfully elicited upper linguistic limit by adjusting the topic difficulty levels. In experiment 2, based on the ropleplay dataset of 75 speakers, the interactional speech elicited by our system was found to be as ratable as those by human examiners, indicated by the reliability index of interactional ratings. These results demonstrates that our system can elicit ratable interactional performances comparable to those elicited by human interviewers. Finally, we report on the deployment of our system with over 10,000 university students in a real-world testing scenario.

2021

We propose a personalized dialogue scenario generation system which transmits efficient and coherent information with a real-time extractive summarization method optimized by an Ising machine. The summarization problem is formulated as a quadratic unconstraint binary optimization (QUBO) problem, which extracts sentences that maximize the sum of the degree of user’s interest in the sentences of documents with the discourse structure of each document and the total utterance time as constraints. To evaluate the proposed method, we constructed a news article corpus with annotations of the discourse structure, users’ profiles, and interests in sentences and topics. The experimental results confirmed that a Digital Annealer, which is a simulated annealing-based Ising machine, can solve our QUBO model in a practical time without violating the constraints using this dataset.

2020

As smart speakers and conversational robots become ubiquitous, the demand for expressive speech synthesis has increased. In this paper, to control the emotional parameters of the speech synthesis according to certain dialogue contents, we construct a news dataset with emotion labels (“positive,” “negative,” or “neutral”) annotated for each sentence. We then propose a method to identify emotion labels using a model combining BERT and BiLSTM-CRF, and evaluate its effectiveness using the constructed dataset. The results showed that the classification model performance can be efficiently improved by preferentially annotating news articles with low confidence in the human-in-the-loop machine learning framework.

2016

2013