Wenhao Liu


2022

pdf bib
DialFact: A Benchmark for Fact-Checking in Dialogue
Prakhar Gupta | Chien-Sheng Wu | Wenhao Liu | Caiming Xiong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Fact-checking is an essential tool to mitigate the spread of misinformation and disinformation. We introduce the task of fact-checking in dialogue, which is a relatively unexplored area. We construct DialFact, a testing benchmark dataset of 22,245 annotated conversational claims, paired with pieces of evidence from Wikipedia. There are three sub-tasks in DialFact: 1) Verifiable claim detection task distinguishes whether a response carries verifiable factual information; 2) Evidence retrieval task retrieves the most relevant Wikipedia snippets as evidence; 3) Claim verification task predicts a dialogue response to be supported, refuted, or not enough information. We found that existing fact-checking models trained on non-dialogue data like FEVER fail to perform well on our task, and thus, we propose a simple yet data-efficient solution to effectively improve fact-checking performance in dialogue. We point out unique challenges in DialFact such as handling the colloquialisms, coreferences, and retrieval ambiguities in the error analysis to shed light on future research in this direction.

pdf bib
QAConv: Question Answering on Informative Conversations
Chien-Sheng Wu | Andrea Madotto | Wenhao Liu | Pascale Fung | Caiming Xiong
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper introduces QAConv, a new question answering (QA) dataset that uses conversations as a knowledge source. We focus on informative conversations, including business emails, panel discussions, and work channels. Unlike open-domain and task-oriented dialogues, these conversations are usually long, complex, asynchronous, and involve strong domain knowledge. In total, we collect 34,608 QA pairs from 10,259 selected conversations with both human-written and machine-generated questions. We use a question generator and a dialogue summarizer as auxiliary tools to collect and recommend questions. The dataset has two testing scenarios: chunk mode and full mode, depending on whether the grounded partial conversation is provided or retrieved. Experimental results show that state-of-the-art pretrained QA systems have limited zero-shot performance and tend to predict our questions as unanswerable. Our dataset provides a new training and evaluation testbed to facilitate QA on conversations research.

2021

pdf bib
Few-Shot Intent Classification by Gauging Entailment Relationship Between Utterance and Semantic Label
Jin Qu | Kazuma Hashimoto | Wenhao Liu | Caiming Xiong | Yingbo Zhou
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Zhang et al. (2020) proposed to formulate few-shot intent classification as natural language inference (NLI) between query utterances and examples in the training set. The method is known as discriminative nearest neighbor classification or DNNC. Inspired by this work, we propose to simplify the NLI-style classification pipeline to be the entailment prediction on the utterance-semantic-label-pair (USLP). The semantic information in the labels can thus been infused into the classification process. Compared with DNNC, our proposed method is more efficient in both training and serving since it is based upon the entailment between query utterance and labels instead of all the training examples. The DNNC method requires more than one example per intent while the USLP approach does not have such constraint. In the 1-shot experiments on the CLINC150 (Larson et al., 2019) dataset, the USLP method outperforms traditional classification approach by >20 points (in-domain ac- curacy). We also find that longer and semantically meaningful labels tend to benefit model performance, however, the benefit shrinks as more training data is available.

pdf bib
Controllable Abstractive Dialogue Summarization with Sketch Supervision
Chien-Sheng Wu | Linqing Liu | Wenhao Liu | Pontus Stenetorp | Caiming Xiong
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Discriminative Nearest Neighbor Few-Shot Intent Detection by Transferring Natural Language Inference
Jianguo Zhang | Kazuma Hashimoto | Wenhao Liu | Chien-Sheng Wu | Yao Wan | Philip Yu | Richard Socher | Caiming Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Intent detection is one of the core components of goal-oriented dialog systems, and detecting out-of-scope (OOS) intents is also a practically important skill. Few-shot learning is attracting much attention to mitigate data scarcity, but OOS detection becomes even more challenging. In this paper, we present a simple yet effective approach, discriminative nearest neighbor classification with deep self-attention. Unlike softmax classifiers, we leverage BERT-style pairwise encoding to train a binary classifier that estimates the best matched training example for a user input. We propose to boost the discriminative ability by transferring a natural language inference (NLI) model. Our extensive experiments on a large-scale multi-domain intent detection task show that our method achieves more stable and accurate in-domain and OOS detection accuracy than RoBERTa-based classifiers and embedding-based nearest neighbor approaches. More notably, the NLI transfer enables our 10-shot model to perform competitively with 50-shot or even full-shot classifiers, while we can keep the inference time constant by leveraging a faster embedding retrieval model.

pdf bib
Simple Data Augmentation with the Mask Token Improves Domain Adaptation for Dialog Act Tagging
Semih Yavuz | Kazuma Hashimoto | Wenhao Liu | Nitish Shirish Keskar | Richard Socher | Caiming Xiong
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The concept of Dialogue Act (DA) is universal across different task-oriented dialogue domains - the act of “request” carries the same speaker intention whether it is for restaurant reservation or flight booking. However, DA taggers trained on one domain do not generalize well to other domains, which leaves us with the expensive need for a large amount of annotated data in the target domain. In this work, we investigate how to better adapt DA taggers to desired target domains with only unlabeled data. We propose MaskAugment, a controllable mechanism that augments text input by leveraging the pre-trained Mask token from BERT model. Inspired by consistency regularization, we use MaskAugment to introduce an unsupervised teacher-student learning scheme to examine the domain adaptation of DA taggers. Our extensive experiments on the Simulated Dialogue (GSim) and Schema-Guided Dialogue (SGD) datasets show that MaskAugment is useful in improving the cross-domain generalization for DA tagging.