Norihide Kitaoka


pdf bib
Improving Speech Recognition for the Elderly: A New Corpus of Elderly Japanese Speech and Investigation of Acoustic Modeling for Speech Recognition
Meiko Fukuda | Hiromitsu Nishizaki | Yurie Iribe | Ryota Nishimura | Norihide Kitaoka
Proceedings of the 12th Language Resources and Evaluation Conference

In an aging society like Japan, a highly accurate speech recognition system is needed for use in electronic devices for the elderly, but this level of accuracy cannot be obtained using conventional speech recognition systems due to the unique features of the speech of elderly people. S-JNAS, a corpus of elderly Japanese speech, is widely used for acoustic modeling in Japan, but the average age of its speakers is 67.6 years old. Since average life expectancy in Japan is now 84.2 years, we are constructing a new speech corpus, which currently consists of the utterances of 221 speakers with an average age of 79.2, collected from four regions of Japan. In addition, we expand on our previous study (Fukuda, 2019) by further investigating the construction of acoustic models suitable for elderly speech. We create new acoustic models and train them using a combination of existing Japanese speech corpora (JNAS, S-JNAS, CSJ), with and without our ‘super-elderly’ speech data, and conduct speech recognition experiments. Our new acoustic models achieve word error rates (WER) as low as 13.38%, exceeding the results of our previous study in which we used the CSJ acoustic model adapted for elderly speech (17.4% WER).


pdf bib
Speech Corpus Spoken by Young-old, Old-old and Oldest-old Japanese
Yurie Iribe | Norihide Kitaoka | Shuhei Segawa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We have constructed a new speech data corpus, using the utterances of 100 elderly Japanese people, to improve speech recognition accuracy of the speech of older people. Humanoid robots are being developed for use in elder care nursing homes. Interaction with such robots is expected to help maintain the cognitive abilities of nursing home residents, as well as providing them with companionship. In order for these robots to interact with elderly people through spoken dialogue, a high performance speech recognition system for speech of elderly people is needed. To develop such a system, we recorded speech uttered by 100 elderly Japanese, most of them are living in nursing homes, with an average age of 77.2. Previously, a seniors’ speech corpus named S-JNAS was developed, but the average age of the participants was 67.6 years, but the target age for nursing home care is around 75 years old, much higher than that of the S-JNAS samples. In this paper we compare our new corpus with an existing Japanese read speech corpus, JNAS, which consists of adult speech, and with the above mentioned S-JNAS, the senior version of JNAS.


pdf bib
Causal analysis of task completion errors in spoken music retrieval interactions
Sunao Hara | Norihide Kitaoka | Kazuya Takeda
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we analyze the causes of task completion errors in spoken dialog systems, using a decision tree with N-gram features of the dialog to detect task-incomplete dialogs. The dialog for a music retrieval task is described by a sequence of tags related to user and system utterances and behaviors. The dialogs are manually classified into two classes: completed and uncompleted music retrieval tasks. Differences in tag classification performance between the two classes are discussed. We then construct decision trees which can detect if a dialog finished with the task completed or not, using information gain criterion. Decision trees using N-grams of manual tags and automatic tags achieved 74.2% and 80.4% classification accuracy, respectively, while the tree using interaction parameters achieved an accuracy rate of 65.7%. We also discuss more details of the causality of task incompletion for spoken dialog systems using such trees.


pdf bib
Estimation Method of User Satisfaction Using N-gram-based Dialog History Model for Spoken Dialog System
Sunao Hara | Norihide Kitaoka | Kazuya Takeda
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we propose an estimation method of user satisfaction for a spoken dialog system using an N-gram-based dialog history model. We have collected a large amount of spoken dialog data accompanied by usability evaluation scores by users in real environments. The database is made by a field-test in which naive users used a client-server music retrieval system with a spoken dialog interface on their own PCs. An N-gram model is trained from the sequences that consist of users' dialog acts and/or the system's dialog acts for each one of six user satisfaction levels: from 1 to 5 and φ (task not completed). Then, the satisfaction level is estimated based on the N-gram likelihood. Experiments were conducted on the large real data and the results show that our proposed method achieved good classification performance; the classification accuracy was 94.7% in the experiment on a classification into dialogs with task completion and those without task completion. Even if the classifier detected all of the task incomplete dialog correctly, our proposed method achieved the false detection rate of only 6%.


pdf bib
Evaluation Framework for Distant-talking Speech Recognition under Reverberant Environments: newest Part of the CENSREC Series -
Takanobu Nishiura | Masato Nakayama | Yuki Denda | Norihide Kitaoka | Kazumasa Yamamoto | Takeshi Yamada | Satoru Tsuge | Chiyomi Miyajima | Masakiyo Fujimoto | Tetsuya Takiguchi | Satoshi Tamura | Shingo Kuroiwa | Kazuya Takeda | Satoshi Nakamura
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Recently, speech recognition performance has been drastically improved by statistical methods and huge speech databases. Now performance improvement under such realistic environments as noisy conditions is being focused on. Since October 2001, we from the working group of the Information Processing Society in Japan have been working on evaluation methodologies and frameworks for Japanese noisy speech recognition. We have released frameworks including databases and evaluation tools called CENSREC-1 (Corpus and Environment for Noisy Speech RECognition 1; formerly AURORA-2J), CENSREC-2 (in-car connected digits recognition), CENSREC-3 (in-car isolated word recognition), and CENSREC-1-C (voice activity detection under noisy conditions). In this paper, we newly introduce a collection of databases and evaluation tools named CENSREC-4, which is an evaluation framework for distant-talking speech under hands-free conditions. Distant-talking speech recognition is crucial for a hands-free speech interface. Therefore, we measured room impulse responses to investigate reverberant speech recognition. The results of evaluation experiments proved that CENSREC-4 is an effective database suitable for evaluating the new dereverberation method because the traditional dereverberation process had difficulty sufficiently improving the recognition performance. The framework was released in March 2008, and many studies are being conducted with it in Japan.

pdf bib
In-car Speech Data Collection along with Various Multimodal Signals
Akira Ozaki | Sunao Hara | Takashi Kusakawa | Chiyomi Miyajima | Takanori Nishino | Norihide Kitaoka | Katunobu Itou | Kazuya Takeda
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, a large-scale real-world speech database is introduced along with other multimedia driving data. We designed a data collection vehicle equipped with various sensors to synchronously record twelve-channel speech, three-channel video, driving behavior including gas and brake pedal pressures, steering angles, and vehicle velocities, physiological signals including driver heart rate, skin conductance, and emotion-based sweating on the palms and soles, etc. These multimodal data are collected while driving on city streets and expressways under four different driving task conditions including two kinds of monologues, human-human dialog, and human-machine dialog. We investigated the response timing of drivers against navigator utterances and found that most overlapped with the preceding utterance due to the task characteristics and the features of Japanese. When comparing utterance length, speaking rate, and the filler rate of driver utterances in human-human and human-machine dialogs, we found that drivers tended to use longer and faster utterances with more fillers to talk with humans than machines.