Katsuya Takanashi

2025

Do Multimodal Large Language Models Truly See What We Point At? Investigating Indexical, Iconic, and Symbolic Gesture Comprehension
Noriki Nishida | Koji Inoue | Hideki Nakayama | Mayumi Bono | Katsuya Takanashi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Understanding hand gestures is essential for human communication, yet it remains unclear how well multimodal large language models (MLLMs) comprehend them. In this paper, we examine MLLMs’ ability to interpret indexical gestures, which require external referential grounding, in comparison to iconic gestures, which depict imagery, and symbolic gestures, which are conventionally defined. We hypothesize that MLLMs, lacking real-world referential understanding, will struggle significantly with indexical gestures. To test this, we manually annotated five gesture type labels to 925 gesture instances from the Miraikan SC Corpus and analyzed gesture descriptions generated by state-of-the-art MLLMs, including GPT-4o. Our findings reveal a consistent weakness across models in interpreting indexical gestures, suggesting that MLLMs rely heavily on linguistic priors or commonsense knowledge rather than grounding their interpretations in visual or contextual cues.

2020

pdf bib abs

An Attentive Listening System with Android ERICA: Comparison of Autonomous and WOZ Interactions
Koji Inoue | Divesh Lala | Kenta Yamamoto | Shizuka Nakamura | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue

We describe an attentive listening system for the autonomous android robot ERICA. The proposed system generates several types of listener responses: backchannels, repeats, elaborating questions, assessments, generic sentimental responses, and generic responses. In this paper, we report a subjective experiment with 20 elderly people. First, we evaluated each system utterance excluding backchannels and generic responses, in an offline manner. It was found that most of the system utterances were linguistically appropriate, and they elicited positive reactions from the subjects. Furthermore, 58.2% of the responses were acknowledged as being appropriate listener responses. We also compared the proposed system with a WOZ system where a human operator was operating the robot. From the subjective evaluation, the proposed system achieved comparable scores in basic skills of attentive listening such as encouragement to talk, focused on the talk, and actively listening. It was also found that there is still a gap between the system and the WOZ for more sophisticated skills such as dialogue understanding, showing interest, and empathy towards the user.

2017

pdf bib abs

Attentive listening system with backchanneling, response generation and flexible turn-taking
Divesh Lala | Pierrick Milhorat | Koji Inoue | Masanari Ishida | Katsuya Takanashi | Tatsuya Kawahara
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Attentive listening systems are designed to let people, especially senior people, keep talking to maintain communication ability and mental health. This paper addresses key components of an attentive listening system which encourages users to talk smoothly. First, we introduce continuous prediction of end-of-utterances and generation of backchannels, rather than generating backchannels after end-point detection of utterances. This improves subjective evaluations of backchannels. Second, we propose an effective statement response mechanism which detects focus words and responds in the form of a question or partial repeat. This can be applied to any statement. Moreover, a flexible turn-taking mechanism is designed which uses backchannels or fillers when the turn-switch is ambiguous. These techniques are integrated into a humanoid robot to conduct attentive listening. We test the feasibility of the system in a pilot experiment and show that it can produce coherent dialogues during conversation.

2012

pdf bib abs

Annotation of response tokens and their triggering expressions in Japanese multi-party conversations
Yasuharu Den | Hanae Koiso | Katsuya Takanashi | Nao Yoshida
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

In this paper, we propose a new scheme for annotating response tokens (RTs) and their triggering expressions in Japanese multi-party conversations. In the proposed scheme, RTs are first identified and classified according to their forms, and then sub-classified according to their sequential positions in the discourse. To deeply study the contexts in which RTs are used, the scheme also provides procedures for annotating triggering expressions, which are considered to trigger the listener's production of RTs. RTs are classified according to whether or not there is a particular object or proposition in the speaker's turn for which the listener shows a positive or aligned stance. Triggering expressions are then identified in the speaker's turn; they include surprising facts and other newsworthy things, opinions and assessments, focus of a response to a question or repair initiation, keywords in narratives, and embedded propositions quoted from other's statement or thought, which are to be agreed upon, assessed, or noticed. As an illustrative application of our scheme, we present a preliminary analysis on the distribution of the latency of the listener's response to the triggering expression, showing how it differs according to RT's forms and positions.

2010

pdf bib abs

Two-level Annotation of Utterance-units in Japanese Dialogs: An Empirically Emerged Scheme
Yasuharu Den | Hanae Koiso | Takehiko Maruyama | Kikuo Maekawa | Katsuya Takanashi | Mika Enomoto | Nao Yoshida
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we propose a scheme for annotating utterance-level units in Japanese dialogs, which emerged from an analysis of the interrelationship among four schemes, i) inter-pausal units, ii) intonation units, iii) clause units, and iv) pragmatic units. The associations among the labels of these four units were illustrated by multiple correspondence analysis and hierarchical cluster analysis. Based on these results, we prescribe utterance-unit identification rules, which identify two sorts of utterance-units with different granularities: short and long utterance-units. Short utterance-units are identified by acoustic and prosodic disjuncture, and they are considered to constitute units of speaker's planning and hearer's understanding. Long utterance-units, on the other hand, are recognized by syntactic and pragmatic disjuncture, and they are regarded as units of interaction. We explore some characteristics of these utterance-units, focusing particularly on unit duration and syntactic property, other participants' responses, and mismatch between the two-levels. We also discuss how our two-level utterance-units are useful in analyzing cognitive and communicative aspects of spoken dialogs.

2008

pdf bib

2006

pdf bib abs

Dependency-structure Annotation to Corpus of Spontaneous Japanese
Kiyotaka Uchimoto | Ryoji Hamabe | Takehiko Maruyama | Katsuya Takanashi | Tatsuya Kawahara | Hitoshi Isahara
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus inJapanese, based on a dependency grammar. In the same way, thesyntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), isrepresented by dependency relationships between bunsetsus. This paper describes the criteria and definitions of dependency relationships between bunsetsus in the CSJ. The dependency structure of the CSJ is investigated, and the difference in the dependency structures ofwritten text and spontaneous speech is discussed in terms of thedependency accuracies obtained by using a corpus-based model. It is shown that the accuracy of automatic dependency-structure analysis canbe improved if characteristic phenomena of spontaneous speech such as self-corrections, basic utterance units in spontaneous speech, and bunsetsus that have no modifiee are detected and used for dependency-structure analysis.

Venues

Fix author