Tatsuya Kawahara

2026

Estimating Relationships between Participants in Multi-Party Chat Corpus
Akane Fukushige | Koji Inoue | Keiko Ochi | Tatsuya Kawahara | Sanae Yamashita | Ryuichiro Higashinaka
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

While most existing dialogue studies focus on dyadic (one-on-one) interactions, research on multi-party dialogues has gained increasing importance. One key challenge in multi-party dialogues is identifying and interpreting the relationships between participants. This study focuses on multi-party chat corpus and aims to estimate participant pairs with specific relationships, such as family and acquaintances. We evaluated the performance of large language models (LLMs) in estimating these relationships, comparing them with a logistic regression model that uses interpretable textual features, including the number of turns and the frequency of honorific expressions. The results show that even advanced LLMs struggle with social relationship estimation, performing worse than a simple heuristic-based approach. This finding highlights the need for further improvement in enabling LLMs to naturally capture social relationships in multi-party dialogues.

pdf bib abs

We present a multilingual, continuous backchannel prediction model for Japanese, English, and Chinese, and use it to investigate cross-linguistic timing behavior. The model is Transformer-based and operates at the frame level, jointly trained with auxiliary tasks on approximately 300 hours of dyadic conversations. Across all three languages, the multilingual model matches or surpasses monolingual baselines, indicating that it learns both language-universal cues and language-specific timing patterns. Zero-shot transfer with two-language training remains limited, underscoring substantive cross-lingual differences. Perturbation analyses reveal distinct cue usage: Japanese relies more on short-term linguistic information, whereas English and Chinese are more sensitive to silence duration and prosodic variation; multilingual training encourages shared yet adaptable representations and reduces overreliance on pitch in Chinese. A context-length study further shows that Japanese is relatively robust to shorter contexts, while Chinese benefits markedly from longer contexts. Finally, we integrate the trained model into a real-time processing software, demonstrating CPU-only inference. Together, these findings provide a unified model and empirical evidence for how backchannel timing differs across languages, informing the design of more natural, culturally-aware spoken dialogue systems.

pdf bib abs

Analysing Next Speaker Prediction in Multi-Party Conversation Using Multimodal Large Language Models
Taiga Mori | Koji Inoue | Divesh Lala | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 16th International Workshop on Spoken Dialogue System Technology

This study analyses how state-of-the-art multimodal large language models (MLLMs) can predict the next speaker in multi-party conversations. Through experimental and qualitative analyses, we found that MLLMs are able to infer a plausible next speaker based solely on linguistic context and their internalized knowledge. However, even in cases where the next speaker is not uniquely determined, MLLMs exhibit a bias toward overpredicting a single participant as the next speaker. We further showed that this bias can be mitigated by explicitly providing knowledge of turn-taking rules. In addition, we observed that visual input can sometimes contribute to more accurate predictions, while in other cases it leads to erroneous judgments. Overall, however, no clear effect of visual input was observed.

2025

pdf bib abs

Human-Like Embodied AI Interviewer: Employing Android ERICA in Real International Conference
Zi Haur Pang | Yahui Fu | Divesh Lala | Mikey Elmers | Koji Inoue | Tatsuya Kawahara
Proceedings of the 31st International Conference on Computational Linguistics: System Demonstrations

This paper introduces the human-like embodied AI interviewer which integrates android robots equipped with advanced conversational capabilities, including attentive listening, conversational repairs, and user fluency adaptation. Moreover, it can analyze and present results post-interview. We conducted a real-world case study at SIGDIAL 2024 with 42 participants, of whom 69% reported positive experiences. This study demonstrated the system’s effectiveness in conducting interviews just like a human and marked the first employment of such a system at an international conference. The demonstration video is available at https://youtu.be/jCuw9g99KuE.

pdf bib abs

Turn-taking prediction models are essential components in spoken dialogue systems and conversational robots. Recent approaches leverage transformer-based architectures to predict speech activity continuously and in real-time. In this study, we propose a novel model that enables turn-taking prediction to be dynamically controlled via textual prompts. This approach allows intuitive and explicit control through instructions such as “faster” or “calmer,” adapting dynamically to conversational partners and contexts. The proposed model builds upon a transformer-based voice activity projection (VAP) model, incorporating textual prompt embeddings into both channel-wise transformers and a cross-channel transformer. We evaluated the feasibility of our approach using over 950 hours of human-human spoken dialogue data. Since textual prompt data for the proposed approach was not available in existing datasets, we utilized a large language model (LLM) to generate synthetic prompt sentences. Experimental results demonstrated that the proposed model improved prediction accuracy and effectively varied turn-taking timing behaviors according to the textual prompts.

pdf bib abs

Implementation of spoken dialogue systems can be time-consuming, in particular for people who are not familiar with managing dialogue states and turn-taking in real-time. A GUI-based system where the user can quickly understand the dialogue flow allows rapid prototyping of experimental and real-world systems. In this demonstration we present ScriptBoard, a tool for creating dialogue scenarios which is independent of any specific robot platform. ScriptBoard has been designed with multi-party scenarios in mind and makes use of large language models to both generate dialogue and make decisions about the dialogue flow. This program promotes both flexibility and reproducibility in spoken dialogue research and provides everyone the opportunity to design and test their own dialogue scenarios.

pdf bib abs

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection
Koji Inoue | Divesh Lala | Gabriel Skantze | Tatsuya Kawahara
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In human conversations, short backchannel utterances such as “yeah” and “oh” play a crucial role in facilitating smooth and engaging dialogue.These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents.This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model.While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets.We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior.Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments.This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

pdf bib abs

Minority-Aware Satisfaction Estimation in Dialogue Systems via Preference-Adaptive Reinforcement Learning
Yahui Fu | Zi Haur Pang | Tatsuya Kawahara
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

User satisfaction in dialogue systems is inherently subjective. When the same response strategy is applied across users, minority users may assign different satisfaction ratings than majority users due to variations in individual intents and preferences. However, existing alignment methods typically train one-size-fits-all models that aim for broad consensus, often overlooking minority perspectives and user-specific adaptation. We propose a unified framework that models both individual- and group-level preferences for user satisfaction estimation. First, we introduce Chain-of-Personalized-Reasoning (CoPeR) to capture individual preferences through interpretable reasoning chains. Second, we propose an expectation-maximization-based Majority-Minority Preference-Aware Clustering (M²PC) algorithm that discovers distinct user groups in an unsupervised manner to learn group-level preferences. Finally, we integrate these components into a preference-adaptive reinforcement learning framework (PAda-PPO) that jointly optimizes alignment with both individual and group preferences. Experiments on the Emotional Support Conversation dataset demonstrate consistent improvements in user satisfaction estimation, particularly for underrepresented user groups.

pdf bib abs

Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation
Koji Inoue | Mikey Elmers | Divesh Lala | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including “Empathy and Affinity” and “Humor and Surprise,” highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4o’s performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.

pdf bib abs

An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

pdf bib abs

Enhancing Long-term RAG Chatbots with Psychological Models of Memory Importance and Forgetting
Ryuichi Sumida | Koji Inoue | Tatsuya Kawahara
Dialogue Discourse Volume 16

This study addresses the issue of what a Retrieval-Augmented Generation (RAG) chatbot should remember and what it should forget, based on findings from psychology. RAG retrieves relevant memories from past interactions to generate responses, and its effectiveness has been demonstrated. As conversations continue, however, the amount of stored memory keeps growing, which not only requires large storage capacity but also risks retaining unnecessary information, potentially reducing retrieval efficiency.To tackle this problem, we propose LUFY (Long-term Understanding and identiFYing key exchanges), a RAG chatbot that evaluates six distinct memory-related metrics derived from psychological models and real-world data. Instead of simply summing these metrics, it uses learned weights to account for the importance of each one. By using these weighted scores, the system can prioritize and retain relevant memories while gradually forgetting less important ones during both retrieval and memory management.To evaluate the effectiveness of LUFY in long-term conversations, we conducted experiments with human participants, who engaged in text-based conversations with three types of chatbots, each using different forgetting mechanisms, for at least two hours. The length of these conversations was more than 4.5 times longer than the longest conversations reported in previous studies. The results showed that prioritizing emotionally engaging memories while forgetting most of the conversation significantly enhanced user satisfaction.

2024

pdf bib abs

Multilingual Turn-taking Prediction Using Voice Activity Projection
Koji Inoue | Bing’er Jiang | Erik Ekstedt | Tatsuya Kawahara | Gabriel Skantze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

pdf bib abs

Video Retrieval System Using Automatic Speech Recognition for the Japanese Diet
Mikitaka Masuyama | Tatsuya Kawahara | Kenjiro Matsuda
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

The Japanese House of Representatives, one of the two houses of the Diet, has adopted an Automatic Speech Recognition (ASR) system, which directly transcribes parliamentary speech with an accuracy of 95 percent. The ASR system also provides a timestamp for every word, which enables retrieval of the video segments of the Parliamentary meetings. The video retrieval system we have developed allows one to pinpoint and play the parliamentary video clips corresponding to the meeting minutes by keyword search. In this paper, we provide its overview and suggest various ways we can utilize the system. The system is currently extended to cover meetings of local governments, which will allow us to investigate dialectal linguistic variations.

pdf bib

Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Tatsuya Kawahara | Vera Demberg | Stefan Ultes | Koji Inoue | Shikib Mehri | David Howcroft | Kazunori Komatani
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib abs

StyEmp: Stylizing Empathetic Response Generation via Multi-Grained Prefix Encoder and Personality Reinforcement
Yahui Fu | Chenhui Chu | Tatsuya Kawahara
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent approaches for empathetic response generation mainly focus on emotional resonance and user understanding, without considering the system’s personality. Consistent personality is evident in real human expression and is important for creating trustworthy systems. To address this problem, we propose StyEmp, which aims to stylize the empathetic response generation with a consistent personality. Specifically, it incorporates a multi-grained prefix mechanism designed to capture the intricate relationship between a system’s personality and its empathetic expressions. Furthermore, we introduce a personality reinforcement module that leverages contrastive learning to calibrate the generation model, ensuring that responses are both empathetic and reflective of a distinct personality. Automatic and human evaluations on the EMPATHETICDIALOGUES benchmark show that StyEmp outperforms competitive baselines in terms of both empathy and personality expressions. Our code is available at https://github.com/fuyahuii/StyEmp.

pdf bib abs

Quantitative Analysis of Editing in Transcription Process in Japanese and European Parliaments and its Diachronic Changes
Tatsuya Kawahara
Proceedings of the IV Workshop on Creating, Analysing, and Increasing Accessibility of Parliamentary Corpora (ParlaCLARIN) @ LREC-COLING 2024

In making official transcripts for meeting records in Parliament, some edits are made from faithful transcripts of utterances for linguistic correction and formality. Classification of these edits is provided in this paper, and quantitative analysis is conducted for Japanese and European Parliamentary meetings by comparing the faithful transcripts of audio recordings against the official meeting records. Different trends are observed between the two Parliaments due to the nature of the language used and the meeting style. Moreover, its diachronic changes in the Japanese transcripts are presented, showing a significant decrease in the edits over the past decades. It was found that a majority of edits in the Japanese Parliament (Diet) simply remove fillers and redundant words, keeping the transcripts as verbatim as possible. This property is useful for the evaluation of the automatic speech transcription system, which was developed by us and has been used in the Japanese Parliament.

2023

pdf bib

RealPersonaChat: A Realistic Persona Chat Corpus with Interlocutors’ Own Personalities
Sanae Yamashita | Koji Inoue | Ao Guo | Shota Mochizuki | Tatsuya Kawahara | Ryuichiro Higashinaka
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf bib abs

Reasoning before Responding: Integrating Commonsense-based Causality Explanation for Empathetic Response Generation
Yahui Fu | Koji Inoue | Chenhui Chu | Tatsuya Kawahara
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Recent approaches to empathetic response generation try to incorporate commonsense knowledge or reasoning about the causes of emotions to better understand the user’s experiences and feelings. However, these approaches mainly focus on understanding the causalities of context from the user’s perspective, ignoring the system’s perspective. In this paper, we propose a commonsense-based causality explanation approach for diverse empathetic response generation that considers both the user’s perspective (user’s desires and reactions) and the system’s perspective (system’s intentions and reactions). We enhance ChatGPT’s ability to reason for the system’s perspective by integrating in-context learning with commonsense knowledge. Then, we integrate the commonsense-based causality explanation with both ChatGPT and a T5-based model. Experimental evaluations demonstrate that our method outperforms other comparable methods on both automatic and human evaluations.

2022

pdf bib abs

Simultaneous Job Interview System Using Multiple Semi-autonomous Agents
Haruki Kawai | Yusuke Muraki | Kenta Yamamoto | Divesh Lala | Koji Inoue | Tatsuya Kawahara
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In recent years, spoken dialogue systems have been applied to job interviews where an applicant talks to a system that asks pre-defined questions, called on-demand and self-paced job interviews. We propose a simultaneous job interview system, where one interviewer can conduct one-on-one interviews with multiple applicants simultaneously by cooperating with the multiple autonomous job interview dialogue systems. However, it is challenging for interviewers to monitor and understand all the parallel interviews done by the autonomous system at the same time. As a solution to this issue, we implemented two automatic dialogue understanding functions: (1) response evaluation of each applicant’s responses and (2) keyword extraction as a summary of the responses. It is expected that interviewers, as needed, can intervene in one dialogue and smoothly ask a proper question that elaborates the interview. We report a pilot experiment where an interviewer conducted simultaneous job interviews with three candidates.

2021

pdf bib abs

Multi-Referenced Training for Dialogue Response Generation
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

In open-domain dialogue response generation, a dialogue context can be continued with diverse responses, and the dialogue models should capture such one-to-many relations. In this work, we first analyze the training objective of dialogue models from the view of Kullback-Leibler divergence (KLD) and show that the gap between the real world probability distribution and the single-referenced data’s probability distribution prevents the model from learning the one-to-many relations efficiently. Then we explore approaches to multi-referenced training in two aspects. Data-wise, we generate diverse pseudo references from a powerful pretrained model to build multi-referenced data that provides a better approximation of the real-world distribution. Model-wise, we propose to equip variational models with an expressive prior, named linear Gaussian model (LGM). Experimental results of automated evaluation and human evaluation show that the methods yield significant improvements over baselines.

pdf bib abs

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation
Hirofumi Inaguma | Tatsuya Kawahara | Shinji Watanabe
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A conventional approach to improving the performance of end-to-end speech translation (E2E-ST) models is to leverage the source transcription via pre-training and joint training with automatic speech recognition (ASR) and neural machine translation (NMT) tasks. However, since the input modalities are different, it is difficult to leverage source language text successfully. In this work, we focus on sequence-level knowledge distillation (SeqKD) from external text-based NMT models. To leverage the full potential of the source language information, we propose backward SeqKD, SeqKD from a target-to-source backward NMT model. To this end, we train a bilingual E2E-ST model to predict paraphrased transcriptions as an auxiliary task with a single decoder. The paraphrases are generated from the translations in bitext via back-translation. We further propose bidirectional SeqKD in which SeqKD from both forward and backward NMT models is combined. Experimental evaluations on both autoregressive and non-autoregressive models show that SeqKD in each direction consistently improves the translation performance, and the effectiveness is complementary regardless of the model capacity.

pdf bib abs

ERICA: An Empathetic Android Companion for Covid-19 Quarantine
Etsuko Ishii | Genta Indra Winata | Samuel Cahyawijaya | Divesh Lala | Tatsuya Kawahara | Pascale Fung
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Over the past year, research in various domains, including Natural Language Processing (NLP), has been accelerated to fight against the COVID-19 pandemic, yet such research has just started on dialogue systems. In this paper, we introduce an end-to-end dialogue system which aims to ease the isolation of people under self-quarantine. We conduct a control simulation experiment to assess the effects of the user interface: a web-based virtual agent, Nora vs. the android ERICA via a video call. The experimental results show that the android can offer a more valuable user experience by giving the impression of being more empathetic and engaging in the conversation due to its nonverbal information, such as facial expressions and body gestures.

pdf bib abs

A multi-party attentive listening robot which stimulates involvement from side participants
Koji Inoue | Hiromi Sakamoto | Kenta Yamamoto | Divesh Lala | Tatsuya Kawahara
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

We demonstrate the moderating abilities of a multi-party attentive listening robot system when multiple people are speaking in turns. Our conventional one-on-one attentive listening system generates listener responses such as backchannels, repeats, elaborating questions, and assessments. In this paper, additional robot responses that stimulate a listening user (side participant) to become more involved in the dialogue are proposed. The additional responses elicit assessments and questions from the side participant, making the dialogue more empathetic and lively.

2020

pdf bib abs

An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions
Koji Inoue | Divesh Lala | Kenta Yamamoto | Shizuka Nakamura | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We describe an attentive listening system for the autonomous android robot ERICA. The proposed system generates several types of listener responses: backchannels, repeats, elaborating questions, assessments, generic sentimental responses, and generic responses. In this paper, we report a subjective experiment with 20 elderly people. First, we evaluated each system utterance excluding backchannels and generic responses, in an offline manner. It was found that most of the system utterances were linguistically appropriate, and they elicited positive reactions from the subjects. Furthermore, 58.2% of the responses were acknowledged as being appropriate listener responses. We also compared the proposed system with a WOZ system where a human operator was operating the robot. From the subjective evaluation, the proposed system achieved comparable scores in basic skills of attentive listening such as encouragement to talk, focused on the talk, and actively listening. It was also found that there is still a gap between the system and the WOZ for more sophisticated skills such as dialogue understanding, showing interest, and empathy towards the user.

pdf bib abs

Topic-relevant Response Generation using Optimal Transport for an Open-domain Dialog System
Shuying Zhang | Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 28th International Conference on Computational Linguistics

Conventional neural generative models tend to generate safe and generic responses which have little connection with previous utterances semantically and would disengage users in a dialog system. To generate relevant responses, we propose a method that employs two types of constraints - topical constraint and semantic constraint. Under the hypothesis that a response and its context have higher relevance when they share the same topics, the topical constraint encourages the topics of a response to match its context by conditioning response decoding on topic words’ embeddings. The semantic constraint, which encourages a response to be semantically related to its context by regularizing the decoding objective function with semantic distance, is proposed. Optimal transport is applied to compute a weighted semantic distance between the representation of a response and the context. Generated responses are evaluated by automatic metrics, as well as human judgment, showing that the proposed method can generate more topic-relevant and content-rich responses than conventional models.

pdf bib abs

Designing Precise and Robust Dialogue Response Evaluators
Tianyu Zhao | Divesh Lala | Tatsuya Kawahara
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (> 0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in https://github.com/ZHAOTING/dialog-processing.

pdf bib abs

Speech Corpus of Ainu Folklore and End-to-end Speech Recognition for Ainu Language
Kohei Matsuura | Sei Ueno | Masato Mimura | Shinsuke Sakai | Tatsuya Kawahara
Proceedings of the Twelfth Language Resources and Evaluation Conference

Ainu is an unwritten language that has been spoken by Ainu people who are one of the ethnic groups in Japan. It is recognized as critically endangered by UNESCO and archiving and documentation of its language heritage is of paramount importance. Although a considerable amount of voice recordings of Ainu folklore has been produced and accumulated to save their culture, only a quite limited parts of them are transcribed so far. Thus, we started a project of automatic speech recognition (ASR) for the Ainu language in order to contribute to the development of annotated language archives. In this paper, we report speech corpus development and the structure and performance of end-to-end ASR for Ainu. We investigated four modeling units (phone, syllable, word piece, and word) and found that the syllable-based model performed best in terms of both word and phone recognition accuracy, which were about 60% and over 85% respectively in speaker-open condition. Furthermore, word and phone accuracy of 80% and 90% has been achieved in a speaker-closed setting. We also found out that a multilingual ASR training with additional speech corpora of English and Japanese further improves the speaker-open test accuracy.

2018

pdf bib abs

A Unified Neural Architecture for Joint Dialog Act Segmentation and Recognition in Spoken Dialog System
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

In spoken dialog systems (SDSs), dialog act (DA) segmentation and recognition provide essential information for response generation. A majority of previous works assumed ground-truth segmentation of DA units, which is not available from automatic speech recognition (ASR) in SDS. We propose a unified architecture based on neural networks, which consists of a sequence tagger for segmentation and a classifier for recognition. The DA recognition model is based on hierarchical neural networks to incorporate the context of preceding sentences. We investigate sharing some layers of the two components so that they can be trained jointly and learn generalized features from both tasks. An evaluation on the Switchboard Dialog Act (SwDA) corpus shows that the jointly-trained models outperform independently-trained models, single-step models, and other reported results in DA segmentation, recognition, and joint tasks.

2017

pdf bib abs

Joint Learning of Dialog Act Segmentation and Recognition in Spoken Dialog Using Neural Networks
Tianyu Zhao | Tatsuya Kawahara
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Dialog act segmentation and recognition are basic natural language understanding tasks in spoken dialog systems. This paper investigates a unified architecture for these two tasks, which aims to improve the model’s performance on both of the tasks. Compared with past joint models, the proposed architecture can (1) incorporate contextual information in dialog act recognition, and (2) integrate models for tasks of different levels as a whole, i.e. dialog act segmentation on the word level and dialog act recognition on the segment level. Experimental results show that the joint training system outperforms the simple cascading system and the joint coding system on both dialog act segmentation and recognition tasks.

pdf bib abs

Attentive listening system with backchanneling, response generation and flexible turn-taking
Divesh Lala | Pierrick Milhorat | Koji Inoue | Masanari Ishida | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Attentive listening systems are designed to let people, especially senior people, keep talking to maintain communication ability and mental health. This paper addresses key components of an attentive listening system which encourages users to talk smoothly. First, we introduce continuous prediction of end-of-utterances and generation of backchannels, rather than generating backchannels after end-point detection of utterances. This improves subjective evaluations of backchannels. Second, we propose an effective statement response mechanism which detects focus words and responds in the form of a question or partial repeat. This can be applied to any statement. Moreover, a flexible turn-taking mechanism is designed which uses backchannels or fillers when the turn-switch is ambiguous. These techniques are integrated into a humanoid robot to conduct attentive listening. We test the feasibility of the system in a pilot experiment and show that it can produce coherent dialogues during conversation.

2016

pdf bib

Talking with ERICA, an autonomous android
Koji Inoue | Pierrick Milhorat | Divesh Lala | Tianyu Zhao | Tatsuya Kawahara
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib abs

Automatic Speech Recognition Errors as a Predictor of L2 Listening Difficulties
Maryam Sadat Mirzaei | Kourosh Meshgi | Tatsuya Kawahara
Proceedings of the Workshop on Computational Linguistics for Linguistic Complexity (CL4LC)

This paper investigates the use of automatic speech recognition (ASR) errors as indicators of the second language (L2) learners’ listening difficulties and in doing so strives to overcome the shortcomings of Partial and Synchronized Caption (PSC) system. PSC is a system that generates a partial caption including difficult words detected based on high speech rate, low frequency, and specificity. To improve the choice of words in this system, and explore a better method to detect speech challenges, ASR errors were investigated as a model of the L2 listener, hypothesizing that some of these errors are similar to those of language learners’ when transcribing the videos. To investigate this hypothesis, ASR errors in transcription of several TED talks were analyzed and compared with PSC’s selected words. Both the overlapping and mismatching cases were analyzed to investigate possible improvement for the PSC system. Those ASR errors that were not detected by PSC as cases of learners’ difficulties were further analyzed and classified into four categories: homophones, minimal pairs, breached boundaries and negatives. These errors were embedded into the baseline PSC to make the enhanced version and were evaluated in an experiment with L2 learners. The results indicated that the enhanced version, which encompasses the ASR errors addresses most of the L2 learners’ difficulties and better assists them in comprehending challenging video segments as compared with the baseline.

2014

pdf bib

Information Navigation System Based on POMDP that Tracks User Focus
Koichiro Yoshino | Tatsuya Kawahara
Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL)

pdf bib abs

Japanese-to-English patent translation system based on domain-adapted word segmentation and post-ordering
Katsuhito Sudoh | Masaaki Nagata | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track

This paper presents a Japanese-to-English statistical machine translation system specialized for patent translation. Patents are practically useful technical documents, but their translation needs different efforts from general-purpose translation. There are two important problems in the Japanese-to-English patent translation: long distance reordering and lexical translation of many domain-specific terms. We integrated novel lexical translation of domain-specific terms with a syntax-based post-ordering framework that divides the machine translation problem into lexical translation and reordering explicitly for efficient syntax-based translation. The proposed lexical translation consists of a domain-adapted word segmentation and an unknown word transliteration. Experimental results show our system achieves better translation accuracy in BLEU and TER compared to the baseline methods.

2013

pdf bib

Predicate Argument Structure Analysis using Partially Annotated Corpora
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the Sixth International Joint Conference on Natural Language Processing

2012

pdf bib abs

Designing an Evaluation Framework for Spoken Term Detection and Spoken Document Retrieval at the NTCIR-9 SpokenDoc Task
Tomoyosi Akiba | Hiromitsu Nishizaki | Kiyoaki Aikawa | Tatsuya Kawahara | Tomoko Matsui
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe the evaluation framework for spoken document retrieval for the IR for the Spoken Documents Task, conducted in the ninth NTCIR Workshop. The two parts of this task were a spoken term detection (STD) subtask and an ad hoc spoken document retrieval subtask (SDR). Both subtasks target search terms, passages and documents included in academic and simulated lectures of the Corpus of Spontaneous Japanese. Seven teams participated in the STD subtask and five in the SDR subtask. The results obtained through the evaluation in the workshop are discussed.

pdf bib

Multi-modal Sensing and Analysis of Poster Conversations: Toward Smart Posterboard
Tatsuya Kawahara
Proceedings of the 13th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib

Machine Translation without Words through Substring Alignment
Graham Neubig | Taro Watanabe | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib

Language Modeling for Spoken Dialogue System based on Filtering using Predicate-Argument Structures
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of COLING 2012

2011

pdf bib

Spoken Dialogue System based on Information Extraction using Similarity of Predicate Argument Structures
Koichiro Yoshino | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the SIGDIAL 2011 Conference

pdf bib

An Unsupervised Model for Joint Phrase Alignment and Extraction
Graham Neubig | Taro Watanabe | Eiichiro Sumita | Shinsuke Mori | Tatsuya Kawahara
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2008

pdf bib abs

The Spoken Document Processing Working Group, which is part of the special interest group of spoken language processing of the Information Processing Society of Japan, is developing a test collection for evaluation of spoken document retrieval systems. A prototype of the test collection consists of a set of textual queries, relevant segment lists, and transcriptions by an automatic speech recognition system, allowing retrieval from the Corpus of Spontaneous Japanese (CSJ). From about 100 initial queries, application of the criteria that a query should have more than five relevant segments that consist of about one minute speech segments yielded 39 queries. Targeting the test collection, an ad hoc retrieval experiment was also conducted to assess the baseline retrieval performance by applying a standard method for spoken document retrieval.

pdf bib

Bayes Risk-based Dialogue Management for Document Retrieval System with Speech Interface
Teruhisa Misu | Tatsuya Kawahara
Coling 2008: Companion volume: Posters

2006

pdf bib abs

Dependency-structure Annotation to Corpus of Spontaneous Japanese
Kiyotaka Uchimoto | Ryoji Hamabe | Takehiko Maruyama | Katsuya Takanashi | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

pdf bib

Detection of Quotations and Inserted Clauses and Its Application to Dependency Structure Analysis in Spontaneous Japanese
Ryoji Hamabe | Kiyotaka Uchimoto | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions