Seungwhan Moon


pdf bib
KETOD: Knowledge-Enriched Task-Oriented Dialogue
Zhiyu Chen | Bing Liu | Seungwhan Moon | Chinnadhurai Sankar | Paul Crook | William Yang Wang
Findings of the Association for Computational Linguistics: NAACL 2022

Existing studies in dialogue system research mostly treat task-oriented dialogue and chit-chat as separate domains. Towards building a human-like assistant that can converse naturally and seamlessly with users, it is important to build a dialogue system that conducts both types of conversations effectively. In this work, we investigate how task-oriented dialogue and knowledge-grounded chit-chat can be effectively integrated into a single model. To this end, we create a new dataset, KETOD (Knowledge-Enriched Task-Oriented Dialogue), where we naturally enrich task-oriented dialogues with chit-chat based on relevant entity knowledge. We also propose two new models, SimpleToDPlus and Combiner, for the proposed task. Experimental results on both automatic and human evaluations show that the proposed methods can significantly improve the performance in knowledge-enriched response generation while maintaining a competitive task-oriented dialog performance. We believe our new dataset will be a valuable resource for future studies. Our dataset and code are publicly available at


pdf bib
SIMMC 2.0: A Task-oriented Dialog Dataset for Immersive Multimodal Conversations
Satwik Kottur | Seungwhan Moon | Alborz Geramifard | Babak Damavandi
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Next generation task-oriented dialog systems need to understand conversational contexts with their perceived surroundings, to effectively help users in the real-world multimodal environment. Existing task-oriented dialog datasets aimed towards virtual assistance fall short and do not situate the dialog in the user’s multimodal context. To overcome, we present a new dataset for Situated and Interactive Multimodal Conversations, SIMMC 2.0, which includes 11K task-oriented user<->assistant dialogs (117K utterances) in the shopping domain, grounded in immersive and photo-realistic scenes. The dialogs are collection using a two-phase pipeline: (1) A novel multimodal dialog simulator generates simulated dialog flows, with an emphasis on diversity and richness of interactions, (2) Manual paraphrasing of generating utterances to draw from natural language distribution. We provide an in-depth analysis of the collected dataset, and describe in detail the four main benchmark tasks we propose for SIMMC 2.0. Our baseline model, powered by the state-of-the-art language model, shows promising results, and highlights new challenges and directions for the community to study.

pdf bib
Continual Learning in Task-Oriented Dialogue Systems
Andrea Madotto | Zhaojiang Lin | Zhenpeng Zhou | Seungwhan Moon | Paul Crook | Bing Liu | Zhou Yu | Eunjoon Cho | Pascale Fung | Zhiguang Wang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Continual learning in task-oriented dialogue systems allows the system to add new domains and functionalities overtime after deployment, without incurring the high cost of retraining the whole system each time. In this paper, we propose a first-ever continual learning benchmark for task-oriented dialogue systems with 37 domains to be learned continuously in both modularized and end-to-end learning settings. In addition, we implement and compare multiple existing continual learning baselines, and we propose a simple yet effective architectural method based on residual adapters. We also suggest that the upper bound performance of continual learning should be equivalent to multitask learning when data from all domain is available at once. Our experiments demonstrate that the proposed architectural method and a simple replay-based strategy perform better, by a large margin, compared to other continuous learning techniques, and only slightly worse than the multitask learning upper bound while being 20X faster in learning new domains. We also report several trade-offs in terms of parameter usage, memory size and training time, which are important in the design of a task-oriented dialogue system. The proposed benchmark is released to promote more research in this direction.

pdf bib
Zero-Shot Dialogue State Tracking via Cross-Task Transfer
Zhaojiang Lin | Bing Liu | Andrea Madotto | Seungwhan Moon | Zhenpeng Zhou | Paul Crook | Zhiguang Wang | Zhou Yu | Eunjoon Cho | Rajen Subba | Pascale Fung
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Zero-shot transfer learning for dialogue state tracking (DST) enables us to handle a variety of task-oriented dialogue domains without the expense of collecting in-domain data. In this work, we propose to transfer the cross-task knowledge from general question answering (QA) corpora for the zero-shot DST task. Specifically, we propose TransferQA, a transferable generative QA model that seamlessly combines extractive QA and multi-choice QA via a text-to-text transformer framework, and tracks both categorical slots and non-categorical slots in DST. In addition, we introduce two effective ways to construct unanswerable questions, namely, negative question sampling and context truncation, which enable our model to handle none value slots in the zero-shot DST setting. The extensive experiments show that our approaches substantially improve the existing zero-shot and few-shot results on MultiWoz. Moreover, compared to the fully trained baseline on the Schema-Guided Dialogue dataset, our approach shows better generalization ability in unseen domains.

pdf bib
NUANCED: Natural Utterance Annotation for Nuanced Conversation with Estimated Distributions
Zhiyu Chen | Honglei Liu | Hu Xu | Seungwhan Moon | Hao Zhou | Bing Liu
Findings of the Association for Computational Linguistics: EMNLP 2021

Existing conversational systems are mostly agent-centric, which assumes the user utterances will closely follow the system ontology. However, in real-world scenarios, it is highly desirable that users can speak freely and naturally. In this work, we attempt to build a user-centric dialogue system for conversational recommendation. As there is no clean mapping for a user’s free form utterance to an ontology, we first model the user preferences as estimated distributions over the system ontology and map the user’s utterances to such distributions. Learning such a mapping poses new challenges on reasoning over various types of knowledge, ranging from factoid knowledge, commonsense knowledge to the users’ own situations. To this end, we build a new dataset named NUANCED that focuses on such realistic settings, with 5.1k dialogues, 26k turns of high-quality user responses. We conduct experiments, showing both the usefulness and challenges of our problem setting. We believe NUANCED can serve as a valuable resource to push existing research from the agent-centric system to the user-centric system. The code and data are publicly available.

pdf bib
DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Hung Le | Chinnadhurai Sankar | Seungwhan Moon | Ahmad Beirami | Alborz Geramifard | Satwik Kottur
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogue. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset are publicly available.

pdf bib
Adding Chit-Chat to Enhance Task-Oriented Dialogues
Kai Sun | Seungwhan Moon | Paul Crook | Stephen Roller | Becka Silvert | Bing Liu | Zhiguang Wang | Honglei Liu | Eunjoon Cho | Claire Cardie
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Existing dialogue corpora and models are typically designed under two disjoint motives: while task-oriented systems focus on achieving functional goals (e.g., booking hotels), open-domain chatbots aim at making socially engaging conversations. In this work, we propose to integrate both types of systems by Adding Chit-Chat to ENhance Task-ORiented dialogues (ACCENTOR), with the goal of making virtual assistant conversations more engaging and interactive. Specifically, we propose a Human <-> AI collaborative data collection approach for generating diverse chit-chat responses to augment task-oriented dialogues with minimal annotation effort. We then present our new chit-chat-based annotations to 23.8K dialogues from two popular task-oriented datasets (Schema-Guided Dialogue and MultiWOZ 2.1) and demonstrate their advantage over the originals via human evaluation. Lastly, we propose three new models for adding chit-chat to task-oriented dialogues, explicitly trained to predict user goals and to generate contextually relevant chit-chat responses. Automatic and human evaluations show that, compared with the state-of-the-art task-oriented baseline, our models can code-switch between task and chit-chat to be more engaging, interesting, knowledgeable, and humanlike, while maintaining competitive task performance.

pdf bib
Leveraging Slot Descriptions for Zero-Shot Cross-Domain Dialogue StateTracking
Zhaojiang Lin | Bing Liu | Seungwhan Moon | Paul Crook | Zhenpeng Zhou | Zhiguang Wang | Zhou Yu | Andrea Madotto | Eunjoon Cho | Rajen Subba
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Zero-shot cross-domain dialogue state tracking (DST) enables us to handle unseen domains without the expense of collecting in-domain data. In this paper, we propose a slot descriptions enhanced generative approach for zero-shot cross-domain DST. Specifically, our model first encodes a dialogue context and a slot with a pre-trained self-attentive encoder, and generates slot value in auto-regressive manner. In addition, we incorporate Slot Type Informed Descriptions that capture the shared information of different slots to facilitates the cross-domain knowledge transfer. Experimental results on MultiWOZ shows that our model significantly improve existing state-of-the-art results in zero-shot cross-domain setting.

pdf bib
An Analysis of State-of-the-Art Models for Situated Interactive MultiModal Conversations (SIMMC)
Satwik Kottur | Paul Crook | Seungwhan Moon | Ahmad Beirami | Eunjoon Cho | Rajen Subba | Alborz Geramifard
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

There is a growing interest in virtual assistants with multimodal capabilities, e.g., inferring the context of a conversation through scene understanding. The recently released situated and interactive multimodal conversations (SIMMC) dataset addresses this trend by enabling research to create virtual assistants, which are capable of taking into account the scene that user sees when conversing with the user and also interacting with items in the scene. The SIMMC dataset is novel in that it contains fully annotated user-assistant, task-orientated dialogs where the user and an assistant co-observe the same visual elements and the latter can take actions to update the scene. The SIMMC challenge, held as part of theNinth Dialog System Technology Challenge(DSTC9), propelled the development of various models which together set a new state-of-the-art on the SIMMC dataset. In this work, we compare and analyze these models to identify‘what worked?’, and the remaining gaps;‘whatnext?’. Our analysis shows that even though pretrained language models adapted to this set-ting show great promise, there are indications that multimodal context isn’t fully utilised, and there is a need for better and scalable knowledge base integration. We hope this first-of-its-kind analysis for SIMMC models provides useful insights and opportunities for further research in multimodal conversational agents


pdf bib
Situated and Interactive Multimodal Conversations
Seungwhan Moon | Satwik Kottur | Paul Crook | Ankita De | Shivani Poddar | Theodore Levin | David Whitney | Daniel Difranco | Ahmad Beirami | Eunjoon Cho | Rajen Subba | Alborz Geramifard
Proceedings of the 28th International Conference on Computational Linguistics

Next generation virtual assistants are envisioned to handle multimodal inputs (e.g., vision, memories of previous interactions, and the user’s utterances), and perform multimodal actions (, displaying a route while generating the system’s utterance). We introduce Situated Interactive MultiModal Conversations (SIMMC) as a new direction aimed at training agents that take multimodal actions grounded in a co-evolving multimodal input context in addition to the dialog history. We provide two SIMMC datasets totalling ~13K human-human dialogs (~169K utterances) collected using a multimodal Wizard-of-Oz (WoZ) setup, on two shopping domains: (a) furniture – grounded in a shared virtual environment; and (b) fashion – grounded in an evolving set of images. Datasets include multimodal context of the items appearing in each scene, and contextual NLU, NLG and coreference annotations using a novel and unified framework of SIMMC conversational acts for both user and assistant utterances. Finally, we present several tasks within SIMMC as objective evaluation protocols, such as structural API prediction, response generation, and dialog state tracking. We benchmark a collection of existing models on these SIMMC tasks as strong baselines, and demonstrate rich multimodal conversational interactions. Our data, annotations, and models will be made publicly available.

pdf bib
User Memory Reasoning for Conversational Recommendation
Hu Xu | Seungwhan Moon | Honglei Liu | Bing Liu | Pararth Shah | Bing Liu | Philip Yu
Proceedings of the 28th International Conference on Computational Linguistics

We study an end-to-end approach for conversational recommendation that dynamically manages and reasons over users’ past (offline) preferences and current (online) requests through a structured and cumulative user memory knowledge graph. This formulation extends existing state tracking beyond the boundary of a single dialog to user state tracking (UST). For this study, we create a new Memory Graph (MG) <-> Conversational Recommendation parallel corpus called MGConvRex with 7K+ human-to-human role-playing dialogs, grounded on a large-scale user memory bootstrapped from real-world user scenarios. MGConvRex captures human-level reasoning over user memory and has disjoint training/testing sets of users for zero-shot (cold-start) reasoning for recommendation. We propose a simple yet expandable formulation for constructing and updating the MG, and an end-to-end graph-based reasoning model that updates MG from unstructured utterances and predicts optimal dialog policies (eg recommendation) based on updated MG. The prediction of our proposed model inherits the graph structure, providing a natural way to explain policies. Experiments are conducted for both offline metrics and online simulation, showing competitive results.

pdf bib
Information Seeking in the Spirit of Learning: A Dataset for Conversational Curiosity
Pedro Rodriguez | Paul Crook | Seungwhan Moon | Zhiguang Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Open-ended human learning and information-seeking are increasingly mediated by digital assistants. However, such systems often ignore the user’s pre-existing knowledge. Assuming a correlation between engagement and user responses such as “liking” messages or asking followup questions, we design a Wizard-of-Oz dialog task that tests the hypothesis that engagement increases when users are presented with facts related to what they know. Through crowd-sourcing of this experiment, we collect and release 14K dialogs (181K utterances) where users and assistants converse about geographic topics like geopolitical entities and locations. This dataset is annotated with pre-existing user knowledge, message-level dialog acts, grounding to Wikipedia, and user reactions to messages. Responses using a user’s prior knowledge increase engagement. We incorporate this knowledge into a multi-task model that reproduces human assistant policies and improves over a bert content model by 13 mean reciprocal rank points.


pdf bib
Memory Grounded Conversational Reasoning
Seungwhan Moon | Pararth Shah | Rajen Subba | Anuj Kumar
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

We demonstrate a conversational system which engages the user through a multi-modal, multi-turn dialog over the user’s memories. The system can perform QA over memories by responding to user queries to recall specific attributes and associated media (e.g. photos) of past episodic memories. The system can also make proactive suggestions to surface related events or facts from past memories to make conversations more engaging and natural. To implement such a system, we collect a new corpus of memory grounded conversations, which comprises human-to-human role-playing dialogs given synthetic memory graphs with simulated attributes. Our proof-of-concept system operates on these synthetic memory graphs, however it can be trained and applied to real-world user memory data (e.g. photo albums, etc.) We present the architecture of the proposed conversational system, and example queries that the system supports.

pdf bib
OpenDialKG: Explainable Conversational Reasoning with Attention-based Walks over Knowledge Graphs
Seungwhan Moon | Pararth Shah | Anuj Kumar | Rajen Subba
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We study a conversational reasoning model that strategically traverses through a large-scale common fact knowledge graph (KG) to introduce engaging and contextually diverse entities and attributes. For this study, we collect a new Open-ended Dialog <-> KG parallel corpus called OpenDialKG, where each utterance from 15K human-to-human role-playing dialogs is manually annotated with ground-truth reference to corresponding entities and paths from a large-scale KG with 1M+ facts. We then propose the DialKG Walker model that learns the symbolic transitions of dialog contexts as structured traversals over KG, and predicts natural entities to introduce given previous dialog contexts via a novel domain-agnostic, attention-based graph path decoder. Automatic and human evaluations show that our model can retrieve more natural and human-like responses than the state-of-the-art baselines or rule-based models, in both in-domain and cross-domain tasks. The proposed model also generates a KG walk path for each entity retrieved, providing a natural way to explain conversational reasoning.

pdf bib
Memory Graph Networks for Explainable Memory-grounded Question Answering
Seungwhan Moon | Pararth Shah | Anuj Kumar | Rajen Subba
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We introduce Episodic Memory QA, the task of answering personal user questions grounded on memory graph (MG), where episodic memories and related entity nodes are connected via relational edges. We create a new benchmark dataset first by generating synthetic memory graphs with simulated attributes, and by composing 100K QA pairs for the generated MG with bootstrapped scripts. To address the unique challenges for the proposed task, we propose Memory Graph Networks (MGN), a novel extension of memory networks to enable dynamic expansion of memory slots through graph traversals, thus able to answer queries in which contexts from multiple linked episodes and external knowledge are required. We then propose the Episodic Memory QA Net with multiple module networks to effectively handle various question types. Empirical results show improvement over the QA baselines in top-k answer prediction accuracy in the proposed task. The proposed model also generates a graph walk path and attention vectors for each predicted answer, providing a natural way to explain its QA reasoning.


pdf bib
Multimodal Named Entity Disambiguation for Noisy Social Media Posts
Seungwhan Moon | Leonardo Neves | Vitor Carvalho
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce the new Multimodal Named Entity Disambiguation (MNED) task for multimodal social media posts such as Snapchat or Instagram captions, which are composed of short captions with accompanying images. Social media posts bring significant challenges for disambiguation tasks because 1) ambiguity not only comes from polysemous entities, but also from inconsistent or incomplete notations, 2) very limited context is provided with surrounding words, and 3) there are many emerging entities often unseen during training. To this end, we build a new dataset called SnapCaptionsKB, a collection of Snapchat image captions submitted to public and crowd-sourced stories, with named entity mentions fully annotated and linked to entities in an external knowledge base. We then build a deep zeroshot multimodal network for MNED that 1) extracts contexts from both text and image, and 2) predicts correct entity in the knowledge graph embeddings space, allowing for zeroshot disambiguation of entities unseen in training set as well. The proposed model significantly outperforms the state-of-the-art text-only NED models, showing efficacy and potentials of the MNED task.

pdf bib
Multimodal Named Entity Recognition for Short Social Media Posts
Seungwhan Moon | Leonardo Neves | Vitor Carvalho
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We introduce a new task called Multimodal Named Entity Recognition (MNER) for noisy user-generated data such as tweets or Snapchat captions, which comprise short text with accompanying images. These social media posts often come in inconsistent or incomplete syntax and lexical notations with very limited surrounding textual contexts, bringing significant challenges for NER. To this end, we create a new dataset for MNER called SnapCaptions (Snapchat image-caption pairs submitted to public and crowd-sourced stories with fully annotated named entities). We then build upon the state-of-the-art Bi-LSTM word/character based NER models with 1) a deep image network which incorporates relevant visual context to augment textual information, and 2) a generic modality-attention module which learns to attenuate irrelevant modalities while amplifying the most informative ones to extract contexts from, adaptive to each sample and token. The proposed MNER model with modality attention significantly outperforms the state-of-the-art text-only NER models by successfully leveraging provided visual contexts, opening up potential applications of MNER on myriads of social media platforms.


pdf bib
Metaphor Detection with Topic Transition, Emotion and Cognition in Context
Hyeju Jang | Yohan Jo | Qinlan Shen | Michael Miller | Seungwhan Moon | Carolyn Rosé
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)


pdf bib
Metaphor Detection in Discourse
Hyeju Jang | Seungwhan Moon | Yohan Jo | Carolyn Rosé
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue


pdf bib
Identifying Student Leaders from MOOC Discussion Forums through Language Influence
Seungwhan Moon | Saloni Potdar | Lara Martin
Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs