Koichiro Yoshino

2025

pdf bib abs
Disambiguating Reference in Visually Grounded Dialogues through Joint Modeling of Textual and Multimodal Semantic Structures
Shun Inadumi | Nobuhiro Ueda | Koichiro Yoshino
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Multimodal reference resolution, including phrase grounding, aims to understand the semantic relations between mentions and real-world objects. Phrase grounding between images and their captions is a well-established task. In contrast, for real-world applications, it is essential to integrate textual and multimodal reference resolution to unravel the reference relations within dialogue, especially in handling ambiguities caused by pronouns and ellipses. This paper presents a framework that unifies textual and multimodal reference resolution by mapping mention embeddings to object embeddings and selecting mentions or objects based on their similarity. Our experiments show that learning textual reference resolution, such as coreference resolution and predicate-argument structure analysis, positively affects performance in multimodal reference resolution. In particular, our model with coreference resolution performs better in pronoun phrase grounding than representative models for this task, MDETR and GLIP. Our qualitative analysis demonstrates that incorporating textual reference relations strengthens the confidence scores between mentions, including pronouns and predicates, and objects, which can reduce the ambiguities that arise in visually grounded dialogues.

pdf bib abs
Teaching Text Agents to Learn Sequential Decision Making from Failure
Canasai Kruengkrai | Koichiro Yoshino
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text-based reinforcement-learning agents improve their policies by interacting with their environments to collect more training data. However, these self-collected data inevitably contain intermediate failed actions caused by attempting physically infeasible behaviors and/or hallucinations. Directly learning a policy from such trajectories can reinforce incorrect behaviors and reduce task success rates. In this paper, we propose a failed action-aware objective that suppresses the negative impact of failed actions during training by assigning zero return based on textual feedback. Building on this objective, we introduce a perturbation method that leverages unsuccessful trajectories to construct new successful ones that share the same goal. This allows agents to benefit from diverse experiences without further interaction with the environment. Experiments in ALFWorld and ScienceWorld demonstrate that our method significantly outperforms strong baselines and generalizes across environments. Code is available at https://github.com/riken-grp/text-agent.

Chat-oriented dialogue systems that deliver tangible benefits, such as sharing news or frailty prevention for seniors, require proactive acquisition of specific user information via chats on user-favored topics. This study proposes the Proactive Information Acquisition (PIA) task to support the development of these systems. In this task, a system needs to acquire a user’s answers to predefined questions without making the user feel abrupt while engaging in a chat on a predefined topic. We created and analyzed a dataset of 650 PIA chats, identifying key challenges and effective strategies for recent LLMs. Our system, designed from these insights, surpassed the performance of LLMs prompted solely with task instructions. Finally, we demonstrate that automatic evaluation of this task is reasonably accurate, suggesting its potential as a framework to efficiently develop techniques for systems dealing with complex dialogue goals, extending beyond the scope of PIA alone. Our dataset is available at: https://github.com/CyberAgentAILab/PIA

pdf bib abs
Pragmatic Theories Enhance Understanding of Implied Meanings in LLMs
Takuma Sato | Seiya Kawano | Koichiro Yoshino
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

The ability to accurately interpret implied meanings plays a crucial role in human communication and language use, and language models are also expected to possess this capability. This study demonstrates that providing language models with pragmatic theories as prompts is an effective in-context learning approach for tasks to understand implied meanings. Specifically, we propose an approach in which an overview of pragmatic theories, such as Gricean pragmatics and Relevance Theory, is presented as a prompt to the language model, guiding it through a step-by-step reasoning process to derive a final interpretation. Experimental results showed that, compared to the baseline, which prompts intermediate reasoning without presenting pragmatic theories (0-shot Chain-of-Thought), our methods enabled language models to achieve up to 9.6% higher scores on pragmatic reasoning tasks. Furthermore, we show that even without explaining the details of pragmatic theories, merely mentioning their names in the prompt leads to a certain performance improvement (around 1-3%) in larger models compared to the baseline.

pdf bib abs
Modeling Turn-Taking Speed and Speaker Characteristics
Kazuyo Onishi | Hien Ohnaka | Koichiro Yoshino
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Modeling turn-taking speed while considering speaker characteristics and the relationships between speakers is essential for realizing dialogue systems capable of natural interactions. In this study, we focused on dialogue participants’ roles, relationships, and personality, analyzing and modeling turn-taking speeds observed in real conversations. The analysis confirmed that the expression of these attributes—role, relationship, and personality—is closely associated with turn-taking speed. Based on these findings, we constructed a model that predicts the distribution of turn-taking speeds according to each attribute using a gamma distribution. Evaluation results demonstrated that appropriate parameter fitting to the three-parameter gamma distribution enables effective modeling of turn-taking speeds based on participants’ roles, relationships, and characteristics.

pdf bib abs
Multi-step or Direct: A Proactive Home-Assistant System Based on Commonsense Reasoning
Konosuke Yamasaki | Shohei Tanaka | Akishige Yuguchi | Seiya Kawano | Koichiro Yoshino
Proceedings of the 26th Annual Meeting of the Special Interest Group on Discourse and Dialogue

There is a growing expectation for the realization of proactive home-assistant robots that can assist users in their daily lives. It is essential to develop a framework that closely observes the user’s surrounding context, selectively extracts relevant information, and infers the user’s needs to proactively propose appropriate assistance. In this study, we first extend the Do-I-Demand dataset to define expected proactive assistance actions in domestic situations, where users make ambiguous utterances. These behaviors were defined based on common patterns of support that a majority of users would expect from a robot. We subsequently constructed a framework that infers users’ expected assistance actions from ambiguous utterances through commonsense reasoning. We explored two approaches: (1) multi-step reasoning using COMET as a commonsense reasoning engine, and (2) direct reasoning using large language models. Our experimental results suggest that both the multi-step and direct reasoning methods can successfully derive necessary assistance actions even when dealing with ambiguous user utterances.

2024

pdf bib abs
A Gaze-grounded Visual Question Answering Dataset for Clarifying Ambiguous Japanese Questions
Shun Inadumi | Seiya Kawano | Akishige Yuguchi | Yasutomo Kawanishi | Koichiro Yoshino
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Situated conversations, which refer to visual information as visual question answering (VQA), often contain ambiguities caused by reliance on directive information. This problem is exacerbated because some languages, such as Japanese, often omit subjective or objective terms. Such ambiguities in questions are often clarified by the contexts in conversational situations, such as joint attention with a user or user gaze information. In this study, we propose the Gaze-grounded VQA dataset (GazeVQA) that clarifies ambiguous questions using gaze information by focusing on a clarification process complemented by gaze information. We also propose a method that utilizes gaze target estimation results to improve the accuracy of GazeVQA tasks. Our experimental results showed that the proposed method improved the performance in some cases of a VQA system on GazeVQA and identified some typical problems of GazeVQA tasks that need to be improved.

pdf bib abs
J-CRe3: A Japanese Conversation Dataset for Real-world Reference Resolution
Nobuhiro Ueda | Hideko Habe | Akishige Yuguchi | Seiya Kawano | Yasutomo Kawanishi | Sadao Kurohashi | Koichiro Yoshino
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Understanding expressions that refer to the physical world is crucial for such human-assisting systems in the real world, as robots that must perform actions that are expected by users. In real-world reference resolution, a system must ground the verbal information that appears in user interactions to the visual information observed in egocentric views. To this end, we propose a multimodal reference resolution task and construct a Japanese Conversation dataset for Real-world Reference Resolution (J-CRe3). Our dataset contains egocentric video and dialogue audio of real-world conversations between two people acting as a master and an assistant robot at home. The dataset is annotated with crossmodal tags between phrases in the utterances and the object bounding boxes in the video frames. These tags include indirect reference relations, such as predicate-argument structures and bridging references as well as direct reference relations. We also constructed an experimental model and clarified the challenges in multimodal reference resolution tasks.

2023

pdf bib abs
Analysis of Style-Shifting on Social Media: Using Neural Language Model Conditioned by Social Meanings
Seiya Kawano | Shota Kanezaki | Angel Fernando Garcia Contreras | Akishige Yuguchi | Marie Katsurai | Koichiro Yoshino
Findings of the Association for Computational Linguistics: EMNLP 2023

In this paper, we propose a novel framework for evaluating style-shifting in social media conversations. Our proposed framework captures changes in an individual’s conversational style based on surprisals predicted by a personalized neural language model for individuals. Our personalized language model integrates not only the linguistic contents of conversations but also non-linguistic factors, such as social meanings, including group membership, personal attributes, and individual beliefs. We incorporate these factors directly or implicitly into our model, leveraging large, pre-trained language models and feature vectors derived from a relationship graph on social media. Compared to existing models, our personalized language model demonstrated superior performance in predicting an individual’s language in a test set. Furthermore, an analysis of style-shifting utilizing our proposed metric based on our personalized neural language model reveals a correlation between our metric and various conversation factors as well as human evaluation of style-shifting.

pdf bib
Language and Robotics: Toward Building Robots Coexisting with Human Society Using Language Interface
Yutaka Nakamura | Shuhei Kurita | Koichiro Yoshino
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: Tutorial Abstract

Narratives include a rich source of events unfolding over time and context. Automatic understanding of these events provides a summarised comprehension of the narrative for further computation (such as reasoning). In this paper, we study the Information Status (IS) of the events and propose a novel challenging task: the automatic identification of new events in a narrative. We define an event as a triplet of subject, predicate, and object. The event is categorized as new with respect to the discourse context and whether it can be inferred through commonsense reasoning. We annotated a publicly available corpus of narratives with the new events at sentence level using human annotators. We present the annotation protocol and study the quality of the annotation and the difficulty of the task. We publish the annotated dataset, annotation materials, and machine learning baseline models for the task of new event extraction for narrative understanding.

2022

pdf bib abs
Pseudo Ambiguous and Clarifying Questions Based on Sentence Structures Toward Clarifying Question Answering System
Yuya Nakano | Seiya Kawano | Koichiro Yoshino | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering

Question answering (QA) with disambiguation questions is essential for practical QA systems because user questions often do not contain information enough to find their answers. We call this task clarifying question answering, a task to find answers to ambiguous user questions by disambiguating their intents through interactions. There are two major problems in building a clarifying question answering system: data preparation of possible ambiguous questions and the generation of clarifying questions. In this paper, we tackle these problems by sentence generation methods using sentence structures. Ambiguous questions are generated by eliminating a part of a sentence considering the sentence structure. Clarifying the question generation method based on case frame dictionary and sentence structure is also proposed. Our experimental results verify that our pseudo ambiguous question generation successfully adds ambiguity to questions. Moreover, the proposed clarifying question generation recovers the performance drop by asking the user for missing information.

2021

pdf bib abs
ARTA: Collection and Classification of Ambiguous Requests and Thoughtful Actions
Shohei Tanaka | Koichiro Yoshino | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Human-assisting systems such as dialogue systems must take thoughtful, appropriate actions not only for clear and unambiguous user requests, but also for ambiguous user requests, even if the users themselves are not aware of their potential requirements. To construct such a dialogue agent, we collected a corpus and developed a model that classifies ambiguous user requests into corresponding system actions. In order to collect a high-quality corpus, we asked workers to input antecedent user requests whose pre-defined actions could be regarded as thoughtful. Although multiple actions could be identified as thoughtful for a single user request, annotating all combinations of user requests and system actions is impractical. For this reason, we fully annotated only the test data and left the annotation of the training data incomplete. In order to train the classification model on such training data, we applied the positive/unlabeled (PU) learning method, which assumes that only a part of the data is labeled with positive examples. The experimental results show that the PU learning method achieved better performance than the general positive/negative (PN) learning method to classify thoughtful actions given an ambiguous user request.

2020

pdf bib abs
Reflection-based Word Attribute Transfer
Yoichi Ishibashi | Katsuhito Sudoh | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Word embeddings, which often represent such analogic relations as king - man + woman queen, can be used to change a word’s attribute, including its gender. For transferring king into queen in this analogy-based manner, we subtract a difference vector man - woman based on the knowledge that king is male. However, developing such knowledge is very costly for words and attributes. In this work, we propose a novel method for word attribute transfer based on reflection mappings without such an analogy operation. Experimental results show that our proposed method can transfer the word attributes of the given words without changing the words that do not have the target attributes.

pdf bib abs
Improving Spoken Language Understanding by Wisdom of Crowds
Koichiro Yoshino | Kana Ikeuchi | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the 28th International Conference on Computational Linguistics

Spoken language understanding (SLU), which converts user requests in natural language to machine-interpretable expressions, is becoming an essential task. The lack of training data is an important problem, especially for new system tasks, because existing SLU systems are based on statistical approaches. In this paper, we proposed to use two sources of the “wisdom of crowds,” crowdsourcing and knowledge community website, for improving the SLU system. We firstly collected paraphrasing variations for new system tasks through crowdsourcing as seed data, and then augmented them using similar questions from a knowledge community website. We investigated the effects of the proposed data augmentation method in SLU task, even with small seed data. In particular, the proposed architecture augmented more than 120,000 samples to improve SLU accuracies.

pdf bib abs
Emotional Speech Corpus for Persuasive Dialogue System
Sara Asai | Koichiro Yoshino | Seitaro Shinagawa | Sakriani Sakti | Satoshi Nakamura
Proceedings of the Twelfth Language Resources and Evaluation Conference

Expressing emotion is known as an efficient way to persuade one’s dialogue partner to accept one’s claim or proposal. Emotional expression in speech can express the speaker’s emotion more directly than using only emotion expression in the text, which will lead to a more persuasive dialogue. In this paper, we built a speech dialogue corpus in a persuasive scenario that uses emotional expressions to build a persuasive dialogue system with emotional expressions. We extended an existing text dialogue corpus by adding variations of emotional responses to cover different combinations of broad dialogue context and a variety of emotional states by crowd-sourcing. Then, we recorded emotional speech consisting of of collected emotional expressions spoken by a voice actor. The experimental results indicate that the collected emotional expressions with their speeches have higher emotional expressiveness for expressing the system’s emotion to users.

2019

pdf bib abs
Conversational Response Re-ranking Based on Event Causality and Role Factored Tensor Event Embedding
Shohei Tanaka | Koichiro Yoshino | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the First Workshop on NLP for Conversational AI

We propose a novel method for selecting coherent and diverse responses for a given dialogue context. The proposed method re-ranks response candidates generated from conversational models by using event causality relations between events in a dialogue history and response candidates (e.g., “be stressed out” precedes “relieve stress”). We use distributed event representation based on the Role Factored Tensor Model for a robust matching of event causality relations due to limited event causality knowledge of the system. Experimental results showed that the proposed method improved coherency and dialogue continuity of system responses.

pdf bib abs
Neural Conversation Model Controllable by Given Dialogue Act Based on Adversarial Learning and Label-aware Objective
Seiya Kawano | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 12th International Conference on Natural Language Generation

Building a controllable neural conversation model (NCM) is an important task. In this paper, we focus on controlling the responses of NCMs by using dialogue act labels of responses as conditions. We introduce an adversarial learning framework for the task of generating conditional responses with a new objective to a discriminator, which explicitly distinguishes sentences by using labels. This change strongly encourages the generation of label-conditioned sentences. We compared the proposed method with some existing methods for generating conditional responses. The experimental results show that our proposed method has higher controllability for dialogue acts even though it has higher or comparable naturalness to existing methods.

2018

pdf bib
Dialogue Scenario Collection of Persuasive Dialogue with Emotional Expressions via Crowdsourcing
Koichiro Yoshino | Yoko Ishikawa | Masahiro Mizukami | Yu Suzuki | Sakriani Sakti | Satoshi Nakamura
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Japanese Dialogue Corpus of Information Navigation and Attentive Listening Annotated with Extended ISO-24617-2 Dialogue Act Tags
Koichiro Yoshino | Hiroki Tanaka | Kyoshiro Sugiyama | Makoto Kondo | Satoshi Nakamura
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib abs
Unsupervised Counselor Dialogue Clustering for Positive Emotion Elicitation in Neural Dialogue System
Nurul Lubis | Sakriani Sakti | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

Positive emotion elicitation seeks to improve user’s emotional state through dialogue system interaction, where a chat-based scenario is layered with an implicit goal to address user’s emotional needs. Standard neural dialogue system approaches still fall short in this situation as they tend to generate only short, generic responses. Learning from expert actions is critical, as these potentially differ from standard dialogue acts. In this paper, we propose using a hierarchical neural network for response generation that is conditioned on 1) expert’s action, 2) dialogue context, and 3) user emotion, encoded from user input. We construct a corpus of interactions between a counselor and 30 participants following a negative emotional exposure to learn expert actions and responses in a positive emotion elicitation scenario. Instead of relying on the expensive, labor intensive, and often ambiguous human annotations, we unsupervisedly cluster the expert’s responses and use the resulting labels to train the network. Our experiments and evaluation show that the proposed approach yields lower perplexity and generates a larger variety of responses.

2017

The IWSLT 2017 evaluation campaign has organised three tasks. The Multilingual task, which is about training machine translation systems handling many-to-many language directions, including so-called zero-shot directions. The Dialogue task, which calls for the integration of context information in machine translation, in order to resolve anaphoric references that typically occur in human-human dialogue turns. And, finally, the Lecture task, which offers the challenge of automatically transcribing and translating real-life university lectures. Following the tradition of these reports, we will described all tasks in detail and present the results of all runs submitted by their participants.

pdf bib abs
Acquisition and Assessment of Semantic Content for the Generation of Elaborateness and Indirectness in Spoken Dialogue Systems
Louisa Pragst | Koichiro Yoshino | Wolfgang Minker | Satoshi Nakamura | Stefan Ultes
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

In a dialogue system, the dialogue manager selects one of several system actions and thereby determines the system’s behaviour. Defining all possible system actions in a dialogue system by hand is a tedious work. While efforts have been made to automatically generate such system actions, those approaches are mostly focused on providing functional system behaviour. Adapting the system behaviour to the user becomes a difficult task due to the limited amount of system actions available. We aim to increase the adaptability of a dialogue system by automatically generating variants of system actions. In this work, we introduce an approach to automatically generate action variants for elaborateness and indirectness. Our proposed algorithm extracts RDF triplets from a knowledge base and rates their relevance to the original system action to find suitable content. We show that the results of our algorithm are mostly perceived similarly to human generated elaborateness and indirectness and can be used to adapt a conversation to the current user and situation. We also discuss where the results of our algorithm are still lacking and how this could be improved: Taking into account the conversation topic as well as the culture of the user is likely to have beneficial effect on the user’s perception.

pdf bib abs
Neural Machine Translation via Binary Code Prediction
Yusuke Oda | Philip Arthur | Graham Neubig | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time/memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed model: using error-correcting codes and combining softmax and binary codes. Experiments on two English-Japanese bidirectional translation tasks show proposed models achieve BLEU scores that approach the softmax, while reducing memory usage to the order of less than 1/10 and improving decoding speed on CPUs by x5 to x10.

pdf bib abs
An Empirical Study of Mini-Batch Creation Strategies for Neural Machine Translation
Makoto Morishita | Yusuke Oda | Graham Neubig | Koichiro Yoshino | Katsuhito Sudoh | Satoshi Nakamura
Proceedings of the First Workshop on Neural Machine Translation

Training of neural machine translation (NMT) models usually uses mini-batches for efficiency purposes. During the mini-batched training process, it is necessary to pad shorter sentences in a mini-batch to be equal in length to the longest sentence therein for efficient computation. Previous work has noted that sorting the corpus based on the sentence length before making mini-batches reduces the amount of padding and increases the processing speed. However, despite the fact that mini-batch creation is an essential step in NMT training, widely used NMT toolkits implement disparate strategies for doing so, which have not been empirically validated or compared. This work investigates mini-batch creation strategies with experiments over two different datasets. Our results suggest that the choice of a mini-batch creation strategy has a large effect on NMT training and some length-based sorting strategies do not always work well compared with simple shuffling.

pdf bib abs
Information Navigation System with Discovering User Interests
Koichiro Yoshino | Yu Suzuki | Satoshi Nakamura
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

We demonstrate an information navigation system for sightseeing domains that has a dialogue interface for discovering user interests for tourist activities. The system discovers interests of a user with focus detection on user utterances, and proactively presents related information to the discovered user interest. A partially observable Markov decision process (POMDP)-based dialogue manager, which is extended with user focus states, controls the behavior of the system to provide information with several dialogue acts for providing information. We transferred the belief-update function and the policy of the manager from other system trained on a different domain to show the generality of defined dialogue acts for our information navigation system.

2016

pdf bib abs
Construction of Japanese Audio-Visual Emotion Database and Its Application in Emotion Recognition
Nurul Lubis | Randy Gomez | Sakriani Sakti | Keisuke Nakamura | Koichiro Yoshino | Satoshi Nakamura | Kazuhiro Nakadai
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Emotional aspects play a vital role in making human communication a rich and dynamic experience. As we introduce more automated system in our daily lives, it becomes increasingly important to incorporate emotion to provide as natural an interaction as possible. To achieve said incorporation, rich sets of labeled emotional data is prerequisite. However, in Japanese, existing emotion database is still limited to unimodal and bimodal corpora. Since emotion is not only expressed through speech, but also visually at the same time, it is essential to include multiple modalities in an observation. In this paper, we present the first audio-visual emotion corpora in Japanese, collected from 14 native speakers. The corpus contains 100 minutes of annotated and transcribed material. We performed preliminary emotion recognition experiments on the corpus and achieved an accuracy of 61.42% for five classes of emotion.

pdf bib abs
Parallel Speech Corpora of Japanese Dialects
Koichiro Yoshino | Naoki Hirayama | Shinsuke Mori | Fumihiko Takahashi | Katsutoshi Itoyama | Hiroshi G. Okuno
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Binary file summaries/549.html matches

pdf bib
Cultural Communication Idiosyncrasies in Human-Computer Interaction
Juliana Miehle | Koichiro Yoshino | Louisa Pragst | Stefan Ultes | Satoshi Nakamura | Wolfgang Minker
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

pdf bib
Analyzing the Effect of Entrainment on Dialogue Acts
Masahiro Mizukami | Koichiro Yoshino | Graham Neubig | David Traum | Satoshi Nakamura
Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue

2015