Kouki Miyazawa
2025
Paralinguistic Attitude Recognition for Spoken Dialogue Systems
Kouki Miyazawa
|
Zhi Zhu
|
Yoshinao Sato
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Although paralinguistic information is critical for human communication, most spoken dialogue systems ignore such information, hindering natural communication between humans and machines. This study addresses the recognition of paralinguistic attitudes in user speech. Specifically, we focus on four essential attitudes for generating an appropriate system response, namely agreement, disagreement, questions, and stalling. The proposed model can help a dialogue system better understand what the user is trying to convey. In our experiments, we trained and evaluated a model that classified paralinguistic attitudes on a reading-speech dataset without using linguistic information. The proposed model outperformed human perception. Furthermore, experimental results indicate that speech enhancement alleviates the degradation of model performance caused by background noise, whereas reverberation remains a challenge.
Transition Relevance Point Detection for Spoken Dialogue Systems with Self-Attention Transformer
Kouki Miyazawa
|
Yoshinao Sato
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Most conventional spoken dialogue systems determine when to respond based on the elapsed time of silence following user speech utterances. This approach often results in failures of turn-taking, disrupting smooth communications with users. This study addresses the detection of when it is acceptable for the dialogue system to start speaking. Specifically, we aim to detect transition relevant points (TRPs) rather than predict whether the dialogue participants will actually start speaking. To achieve this, we employ a self-supervised speech representation using contrastive predictive coding and a self-attention transformer. The proposed model, TRPDformer, was trained and evaluated on the corpus of everyday Japanese conversation. TRPDformer outperformed a baseline model based on the elapsed time of silence. Furthermore, third-party listeners rated the timing of system responses determined using the proposed model as superior to that of the baseline in a preference test.
2020
Quality Estimation for Partially Subjective Classification Tasks via Crowdsourcing
Yoshinao Sato
|
Kouki Miyazawa
Proceedings of the Twelfth Language Resources and Evaluation Conference
The quality estimation of artifacts generated by creators via crowdsourcing has great significance for the construction of a large-scale data resource. A common approach to this problem is to ask multiple reviewers to evaluate the same artifacts. However, the commonly used majority voting method to aggregate reviewers’ evaluations does not work effectively for partially subjective or purely subjective tasks because reviewers’ sensitivity and bias of evaluation tend to have a wide variety. To overcome this difficulty, we propose a probabilistic model for subjective classification tasks that incorporates the qualities of artifacts as well as the abilities and biases of creators and reviewers as latent variables to be jointly inferred. We applied this method to the partially subjective task of speech classification into the following four attitudes: agreement, disagreement, stalling, and question. The result shows that the proposed method estimates the quality of speech more effectively than a vote aggregation, measured by correlation with a fine-grained classification by experts.