Shinnosuke Takamichi
2026
Effects of Dialogue Corpora Properties on Fine-Tuning a Moshi-Based Spoken Dialogue Model
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
Yuto Abe | Mao Saeki | Atsumoto Ohashi | Shinnosuke Takamichi | Shiyna Fujie | Tetsunori Kobayashi | Tetsuji Ogawa | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology
This study investigates how interactional characteristics of spoken dialogue corpora influence the learning process and resulting behavior of speech language models for full-duplex dialogue systems. While previous research has mainly focused on improving acoustic and linguistic quality, an effective dialogue system must also capture and reproduce task-dependent interactional dynamics such as conversational tempo and turn-taking patterns. To analyze these properties, we evaluated multiple dialogue corpora using NISQA for speech quality, LLM-as-a-Judge for linguistic and semantic appropriateness, and four timing-based indicators: inter-pausal units, pause, gap, and overlap. A curriculum learning strategy was applied to fine-tune a Moshi-based full-duplex dialogue model by incrementally combining corpora with different interactional characteristics. Experimental results on a dialogue continuation task showed that corpus-specific interactional patterns effectively shape model behavior. Chat-style corpora facilitated natural rhythms with moderate overlaps and gaps, whereas consultation-style corpora promoted more stable and deliberate timing. Fine-tuning with high-quality audio improved speech quality, while using task-mismatched data degraded linguistic coherence.
2025
VitaEval: Open-source Human Evaluation Tool for Video-to-Text and Video-to-Audio Systems
Goran Topic | Yuki Saito | Katsuhito Sudoh | Shinnosuke Takamichi | Hiroya Takamura | Graham Neubig | Tatsuya Ishigaki
Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations
Goran Topic | Yuki Saito | Katsuhito Sudoh | Shinnosuke Takamichi | Hiroya Takamura | Graham Neubig | Tatsuya Ishigaki
Proceedings of the 18th International Natural Language Generation Conference: System Demonstrations
Analysis of the Correlation Between Theory of Mind and Dialogue Ability to Identify Essential ToM for Dialogue Systems
Haruhisa Iseno | Atsumoto Ohashi | Tetsuji Ogawa | Shinnosuke Takamichi | Ryuichiro Higashinaka
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation
Haruhisa Iseno | Atsumoto Ohashi | Tetsuji Ogawa | Shinnosuke Takamichi | Ryuichiro Higashinaka
Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation
2022
Personalized Filled-pause Generation with Group-wise Prediction Models
Yuta Matsunaga | Takaaki Saeki | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Yuta Matsunaga | Takaaki Saeki | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In this paper, we propose a method to generate personalized filled pauses (FPs) with group-wise prediction models. Compared with fluent text generation, disfluent text generation has not been widely explored. To generate more human-like texts, we addressed disfluent text generation. The usage of disfluency, such as FPs, rephrases, and word fragments, differs from speaker to speaker, and thus, the generation of personalized FPs is required. However, it is difficult to predict them because of the sparsity of position and the frequency difference between more and less frequently used FPs. Moreover, it is sometimes difficult to adapt FP prediction models to each speaker because of the large variation of the tendency within each speaker. To address these issues, we propose a method to build group-dependent prediction models by grouping speakers on the basis of their tendency to use FPs. This method does not require a large amount of data and time to train each speaker model. We further introduce a loss function and a word embedding model suitable for FP prediction. Our experimental results demonstrate that group-dependent models can predict FPs with higher scores than a non-personalized one and the introduced loss function and word embedding model improve the prediction performance.
2020
DNN-based Speech Synthesis Using Abundant Tags of Spontaneous Speech Corpus
Yuki Yamashita | Tomoki Koriyama | Yuki Saito | Shinnosuke Takamichi | Yusuke Ijima | Ryo Masumura | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Yuki Yamashita | Tomoki Koriyama | Yuki Saito | Shinnosuke Takamichi | Yusuke Ijima | Ryo Masumura | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we investigate the effectiveness of using rich annotations in deep neural network (DNN)-based statistical speech synthesis. DNN-based frameworks typically use linguistic information as input features called context instead of directly using text. In such frameworks, we can synthesize not only reading-style speech but also speech with paralinguistic and nonlinguistic features by adding such information to the context. However, it is not clear what kind of information is crucial for reproducing paralinguistic and nonlinguistic features. Therefore, we investigate the effectiveness of rich tags in DNN-based speech synthesis according to the Corpus of Spontaneous Japanese (CSJ), which has a large amount of annotations on paralinguistic features such as prosody, disfluency, and morphological features. Experimental evaluation results shows that the reproducibility of paralinguistic features of synthetic speech was enhanced by adding such information as context.
SMASH Corpus: A Spontaneous Speech Corpus Recording Third-person Audio Commentaries on Gameplay
Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Yuki Saito | Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Twelfth Language Resources and Evaluation Conference
Developing a spontaneous speech corpus would be beneficial for spoken language processing and understanding. We present a speech corpus named the SMASH corpus, which includes spontaneous speech of two Japanese male commentators that made third-person audio commentaries during the gameplay of a fighting game. Each commentator ad-libbed while watching the gameplay with various topics covering not only explanations of each moment to convey the information on the fight but also comments to entertain listeners. We made transcriptions and topic tags as annotations on the recorded commentaries with our two-step method. We first made automatic and manual transcriptions of the commentaries and then manually annotated the topic tags. This paper describes how we constructed the SMASH corpus and reports some results of the annotations.
2018
CPJD Corpus: Crowdsourced Parallel Speech Corpus of Japanese Dialects
Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Shinnosuke Takamichi | Hiroshi Saruwatari
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2012
A method for translation of paralinguistic information
Takatomo Kano | Sakriani Sakti | Shinnosuke Takamichi | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers
Takatomo Kano | Sakriani Sakti | Shinnosuke Takamichi | Graham Neubig | Tomoki Toda | Satoshi Nakamura
Proceedings of the 9th International Workshop on Spoken Language Translation: Papers
This paper is concerned with speech-to-speech translation that is sensitive to paralinguistic information. From the many different possible paralinguistic features to handle, in this paper we chose duration and power as a first step, proposing a method that can translate these features from input speech to the output speech in continuous space. This is done in a simple and language-independent fashion by training a regression model that maps source language duration and power information into the target language. We evaluate the proposed method on a digit translation task and show that paralinguistic information in input speech appears in output speech, and that this information can be used by target language speakers to detect emphasis.
Search
Fix author
Co-authors
- Hiroshi Saruwatari 4
- Yuki Saito 3
- Ryuichiro Higashinaka 2
- Graham Neubig 2
- Tetsuji Ogawa 2
- Atsumoto Ohashi 2
- Yuto Abe 1
- Shiyna Fujie 1
- Yusuke Ijima 1
- Haruhisa Iseno 1
- Tatsuya Ishigaki 1
- Takatomo Kano 1
- Tetsunori Kobayashi 1
- Tomoki Koriyama 1
- Ryo Masumura 1
- Yuta Matsunaga 1
- Satoshi Nakamura 1
- Takaaki Saeki 1
- Mao Saeki 1
- Sakriani Sakti 1
- Katsuhito Sudoh 1
- Hiroya Takamura 1
- Tomoki Toda 1
- Goran Topić 1
- Yuki Yamashita 1