Xiaodong Liu


2022

pdf bib
LiST: Lite Prompted Self-training Makes Parameter-efficient Few-shot Learners
Yaqing Wang | Subhabrata Mukherjee | Xiaodong Liu | Jing Gao | Ahmed Awadallah | Jianfeng Gao
Findings of the Association for Computational Linguistics: NAACL 2022

We present a new method LiST for efficient fine-tuning of large pre-trained language models (PLMs) in few-shot learning settings. LiST improves over recent methods that adopt prompt-based fine-tuning (FN) using two key techniques. The first is the use of self-training to leverage large amounts of unlabeled data for prompt-based FN in few-shot settings. We use self-training in conjunction with meta-learning for re-weighting noisy pseudo-prompt labels. Traditionally, self-training is expensive as it requires updating all the model parameters repetitively. Therefore, we use a second technique for light-weight fine-tuning where we introduce a small number of task-specific parameters that are fine-tuned during self-training while keeping the PLM encoder frozen. Our experiments show that LiST can effectively leverage unlabeled data to improve the model performance for few-shot learning. Additionally, the finetuning process is efficient as it only updates a small percentage of the parameters and the overall model footprint is reduced since several tasks can share a common PLM encoder as backbone. We present a comprehensive study on six NLU tasks to validate the effectiveness of LiST. The results show that LiST improves by 35% over classic fine-tuning methods and 6% over prompt-based FN with 96% reduction in number of trainable parameters when fine-tuned with no more than 30 labeled examples from each task. With only 14M tunable parameters, LiST outperforms GPT-3 in-context learning by 33% on few-shot NLU tasks

pdf bib
Open Domain Question Answering with A Unified Knowledge Interface
Kaixin Ma | Hao Cheng | Xiaodong Liu | Eric Nyberg | Jianfeng Gao
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The retriever-reader framework is popular for open-domain question answering (ODQA) due to its ability to use explicit knowledge.Although prior work has sought to increase the knowledge coverage by incorporating structured knowledge beyond text, accessing heterogeneous knowledge sources through a unified interface remains an open question. While data-to-text generation has the potential to serve as a universal interface for data and text, its feasibility for downstream tasks remains largely unknown. In this work, we bridge this gap and use the data-to-text method as a means for encoding structured knowledge for open-domain question answering. Specifically, we propose a verbalizer-retriever-reader framework for ODQA over data and text where verbalized tables from Wikipedia and graphs from Wikidata are used as augmented knowledge sources. We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines. Notably, our approach sets the single-model state-of-the-art on Natural Questions. Furthermore, our analyses indicate that verbalized knowledge is preferred for answer reasoning for both adapted and hot-swap settings.

2021

pdf bib
Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach
Simiao Zuo | Chen Liang | Haoming Jiang | Xiaodong Liu | Pengcheng He | Jianfeng Gao | Weizhu Chen | Tuo Zhao
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Adversarial regularization has been shown to improve the generalization performance of deep learning models in various natural language processing tasks. Existing works usually formulate the method as a zero-sum game, which is solved by alternating gradient descent/ascent algorithms. Such a formulation treats the adversarial and the defending players equally, which is undesirable because only the defending player contributes to the generalization performance. To address this issue, we propose Stackelberg Adversarial Regularization (SALT), which formulates adversarial regularization as a Stackelberg game. This formulation induces a competition between a leader and a follower, where the follower generates perturbations, and the leader trains the model subject to the perturbations. Different from conventional approaches, in SALT, the leader is in an advantageous position. When the leader moves, it recognizes the strategy of the follower and takes the anticipated follower’s outcomes into consideration. Such a leader’s advantage enables us to improve the model fitting to the unperturbed data. The leader’s strategic information is captured by the Stackelberg gradient, which is obtained using an unrolling algorithm. Our experimental results on a set of machine translation and natural language understanding tasks show that SALT outperforms existing adversarial regularization baselines across all tasks. Our code is publicly available.

pdf bib
HittER: Hierarchical Transformers for Knowledge Graph Embeddings
Sanxing Chen | Xiaodong Liu | Jianfeng Gao | Jian Jiao | Ruofei Zhang | Yangfeng Ji
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

This paper examines the challenging problem of learning representations of entities and relations in a complex multi-relational knowledge graph. We propose HittER, a Hierarchical Transformer model to jointly learn Entity-relation composition and Relational contextualization based on a source entity’s neighborhood. Our proposed model consists of two different Transformer blocks: the bottom block extracts features of each entity-relation pair in the local neighborhood of the source entity and the top block aggregates the relational information from outputs of the bottom block. We further design a masked entity prediction task to balance information from the relational context and the source entity itself. Experimental results show that HittER achieves new state-of-the-art results on multiple link prediction datasets. We additionally propose a simple approach to integrate HittER into BERT and demonstrate its effectiveness on two Freebase factoid question answering datasets.

pdf bib
Reader-Guided Passage Reranking for Open-Domain Question Answering
Yuning Mao | Pengcheng He | Xiaodong Liu | Yelong Shen | Jianfeng Gao | Jiawei Han | Weizhu Chen
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib
Token-wise Curriculum Learning for Neural Machine Translation
Chen Liang | Haoming Jiang | Xiaodong Liu | Pengcheng He | Weizhu Chen | Jianfeng Gao | Tuo Zhao
Findings of the Association for Computational Linguistics: EMNLP 2021

Existing curriculum learning approaches to Neural Machine Translation (NMT) require sampling sufficient amounts of “easy” samples from training data at the early training stage. This is not always achievable for low-resource languages where the amount of training data is limited. To address such a limitation, we propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples. Specifically, the model learns to predict a short sub-sequence from the beginning part of each target sentence at the early stage of training. Then the sub-sequence is gradually expanded as the training progresses. Such a new curriculum design is inspired by the cumulative effect of translation errors, which makes the latter tokens more challenging to predict than the beginning ones. Extensive experiments show that our approach can consistently outperform baselines on five language pairs, especially for low-resource languages. Combining our approach with sentence-level methods further improves the performance of high-resource languages.

pdf bib
ARCH: Efficient Adversarial Regularized Training with Caching
Simiao Zuo | Chen Liang | Haoming Jiang | Pengcheng He | Xiaodong Liu | Jianfeng Gao | Weizhu Chen | Tuo Zhao
Findings of the Association for Computational Linguistics: EMNLP 2021

Adversarial regularization can improve model generalization in many natural language processing tasks. However, conventional approaches are computationally expensive since they need to generate a perturbation for each sample in each epoch. We propose a new adversarial regularization method ARCH (adversarial regularization with caching), where perturbations are generated and cached once every several epochs. As caching all the perturbations imposes memory usage concerns, we adopt a K-nearest neighbors-based strategy to tackle this issue. The strategy only requires caching a small amount of perturbations, without introducing additional training time. We evaluate our proposed method on a set of neural machine translation and natural language understanding tasks. We observe that ARCH significantly eases the computational burden (saves up to 70% of computational time in comparison with conventional approaches). More surprisingly, by reducing the variance of stochastic gradients, ARCH produces a notably better (in most of the tasks) or comparable model generalization. Our code is publicly available.

pdf bib
UnitedQA: A Hybrid Approach for Open Domain Question Answering
Hao Cheng | Yelong Shen | Xiaodong Liu | Pengcheng He | Weizhu Chen | Jianfeng Gao
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

To date, most of recent work under the retrieval-reader framework for open-domain QA focuses on either extractive or generative reader exclusively. In this paper, we study a hybrid approach for leveraging the strengths of both models. We apply novel techniques to enhance both extractive and generative readers built upon recent pretrained neural language models, and find that proper training methods can provide large improvement over previous state-of-the-art models. We demonstrate that a simple hybrid approach by combining answers from both readers can efficiently take advantages of extractive and generative answer inference strategies and outperforms single models as well as homogeneous ensembles. Our approach outperforms previous state-of-the-art models by 3.3 and 2.7 points in exact match on NaturalQuestions and TriviaQA respectively.

pdf bib
Generation-Augmented Retrieval for Open-Domain Question Answering
Yuning Mao | Pengcheng He | Xiaodong Liu | Yelong Shen | Jianfeng Gao | Jiawei Han | Weizhu Chen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose Generation-Augmented Retrieval (GAR) for answering open-domain questions, which augments a query through text generation of heuristically discovered relevant contexts without external resources as supervision. We demonstrate that the generated contexts substantially enrich the semantics of the queries and GAR with sparse representations (BM25) achieves comparable or better performance than state-of-the-art dense retrieval methods such as DPR. We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy. Moreover, as sparse and dense representations are often complementary, GAR can be easily combined with DPR to achieve even better performance. GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader, and consistently outperforms other retrieval methods when the same generative reader is used.

pdf bib
Super Tickets in Pre-Trained Language Models: From Model Compression to Improving Generalization
Chen Liang | Simiao Zuo | Minshuo Chen | Haoming Jiang | Xiaodong Liu | Pengcheng He | Tuo Zhao | Weizhu Chen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The Lottery Ticket Hypothesis suggests that an over-parametrized network consists of ”lottery tickets”, and training a certain collection of them (i.e., a subnetwork) can match the performance of the full model. In this paper, we study such a collection of tickets, which is referred to as ”winning tickets”, in extremely over-parametrized models, e.g., pre-trained language models. We observe that at certain compression ratios, the generalization performance of the winning tickets can not only match but also exceed that of the full model. In particular, we observe a phase transition phenomenon: As the compression ratio increases, generalization performance of the winning tickets first improves then deteriorates after a certain threshold. We refer to the tickets on the threshold as ”super tickets”. We further show that the phase transition is task and model dependent — as the model size becomes larger and the training data set becomes smaller, the transition becomes more pronounced. Our experiments on the GLUE benchmark show that the super tickets improve single task fine-tuning by 0.9 points on BERT-base and 1.0 points on BERT-large, in terms of task-average score. We also demonstrate that adaptively sharing the super tickets across tasks benefits multi-task learning.

pdf bib
Posterior Differential Regularization with f-divergence for Improving Model Robustness
Hao Cheng | Xiaodong Liu | Lis Pereira | Yaoliang Yu | Jianfeng Gao
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We address the problem of enhancing model robustness through regularization. Specifically, we focus on methods that regularize the model posterior difference between clean and noisy inputs. Theoretically, we provide a connection of two recent methods, Jacobian Regularization and Virtual Adversarial Training, under this framework. Additionally, we generalize the posterior differential regularization to the family of f-divergences and characterize the overall framework in terms of the Jacobian matrix. Empirically, we compare those regularizations and standard BERT training on a diverse set of tasks to provide a comprehensive profile of their effect on model generalization. For both fully supervised and semi-supervised settings, we show that regularizing the posterior difference with f-divergence can result in well-improved model robustness. In particular, with a proper f-divergence, a BERT-base model can achieve comparable generalization as its BERT-large counterpart for in-domain, adversarial and domain shift scenarios, indicating the great potential of the proposed framework for enhancing NLP model robustness.

pdf bib
Targeted Adversarial Training for Natural Language Understanding
Lis Pereira | Xiaodong Liu | Hao Cheng | Hoifung Poon | Jianfeng Gao | Ichiro Kobayashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We present a simple yet effective Targeted Adversarial Training (TAT) algorithm to improve adversarial training for natural language understanding. The key idea is to introspect current mistakes and prioritize adversarial training steps to where the model errs the most. Experiments show that TAT can significantly improve accuracy over standard adversarial training on GLUE and attain new state-of-the-art zero-shot results on XNLI. Our code will be released upon acceptance of the paper.

2020

pdf bib
SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
Haoming Jiang | Pengcheng He | Weizhu Chen | Xiaodong Liu | Jianfeng Gao | Tuo Zhao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Transfer learning has fundamentally changed the landscape of natural language processing (NLP). Many state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely high complexity of pre-trained models, aggressive fine-tuning often causes the fine-tuned model to overfit the training data of downstream tasks and fail to generalize to unseen data. To address such an issue in a principled manner, we propose a new learning framework for robust and efficient fine-tuning for pre-trained models to attain better generalization performance. The proposed framework contains two important ingredients: 1. Smoothness-inducing regularization, which effectively manages the complexity of the model; 2. Bregman proximal point optimization, which is an instance of trust-region methods and can prevent aggressive updating. Our experiments show that the proposed framework achieves new state-of-the-art performance on a number of NLP tasks including GLUE, SNLI, SciTail and ANLI. Moreover, it also outperforms the state-of-the-art T5 model, which is the largest pre-trained model containing 11 billion parameters, on GLUE.

pdf bib
RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers
Bailin Wang | Richard Shin | Xiaodong Liu | Oleksandr Polozov | Matthew Richardson
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

When translating natural language questions into SQL queries to answer questions from a database, contemporary semantic parsing models struggle to generalize to unseen database schemas. The generalization challenge lies in (a) encoding the database relations in an accessible way for the semantic parser, and (b) modeling alignment between database columns and their mentions in a given query. We present a unified framework, based on the relation-aware self-attention mechanism, to address schema encoding, schema linking, and feature representation within a text-to-SQL encoder. On the challenging Spider dataset this framework boosts the exact match accuracy to 57.2%, surpassing its best counterparts by 8.7% absolute improvement. Further augmented with BERT, it achieves the new state-of-the-art performance of 65.6% on the Spider leaderboard. In addition, we observe qualitative improvements in the model’s understanding of schema linking and alignment. Our implementation will be open-sourced at https://github.com/Microsoft/rat-sql.

pdf bib
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu | Yu Wang | Jianshu Ji | Hao Cheng | Xueyun Zhu | Emmanuel Awa | Pengcheng He | Weizhu Chen | Hoifung Poon | Guihong Cao | Jianfeng Gao
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at https://github.com/namisan/mt-dnn.

pdf bib
Adversarial Training for Commonsense Inference
Lis Pereira | Xiaodong Liu | Fei Cheng | Masayuki Asahara | Ichiro Kobayashi
Proceedings of the 5th Workshop on Representation Learning for NLP

We apply small perturbations to word embeddings and minimize the resultant adversarial risk to regularize the model. We exploit a novel combination of two different approaches to estimate these perturbations: 1) using the true label and 2) using the model prediction. Without relying on any human-crafted features, knowledge bases, or additional datasets other than the target datasets, our model boosts the fine-tuning performance of RoBERTa, achieving competitive results on multiple reading comprehension datasets that require commonsense inference.

pdf bib
A Tale of Two Linkings: Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL Parsing
Sanxing Chen | Aidan San | Xiaodong Liu | Yangfeng Ji
Proceedings of the 28th International Conference on Computational Linguistics

In Text-to-SQL semantic parsing, selecting the correct entities (tables and columns) for the generated SQL query is both crucial and challenging; the parser is required to connect the natural language (NL) question and the SQL query to the structured knowledge in the database. We formulate two linking processes to address this challenge: schema linking which links explicit NL mentions to the database and structural linking which links the entities in the output SQL with their structural relationships in the database schema. Intuitively, the effectiveness of these two linking processes changes based on the entity being generated, thus we propose to dynamically choose between them using a gating mechanism. Integrating the proposed method with two graph neural network-based semantic parsers together with BERT representations demonstrates substantial gains in parsing accuracy on the challenging Spider dataset. Analyses show that our proposed method helps to enhance the structure of the model output when generating complicated SQL queries and offers more explainable predictions.

pdf bib
Understanding the Difficulty of Training Transformers
Liyuan Liu | Xiaodong Liu | Jianfeng Gao | Weizhu Chen | Jiawei Han
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding carefully designing cutting-edge optimizers and learning rate schedulers (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand __what complicates Transformer training__ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially—for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin (Adaptive model initialization) to stabilize the early stage’s training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance

2019

pdf bib
Adversarial Domain Adaptation for Machine Reading Comprehension
Huazheng Wang | Zhe Gan | Xiaodong Liu | Jingjing Liu | Jianfeng Gao | Hongning Wang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

In this paper, we focus on unsupervised domain adaptation for Machine Reading Comprehension (MRC), where the source domain has a large amount of labeled data, while only unlabeled passages are available in the target domain. To this end, we propose an Adversarial Domain Adaptation framework (AdaMRC), where (i) pseudo questions are first generated for unlabeled passages in the target domain, and then (ii) a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from. The classifier and the passage-question encoder are jointly trained using adversarial learning to enforce domain-invariant representation learning. Comprehensive evaluations demonstrate that our approach (i) is generalizable to different MRC models and datasets, (ii) can be combined with pre-trained large-scale language models (such as ELMo and BERT), and (iii) can be extended to semi-supervised learning.

pdf bib
A Hybrid Neural Network Model for Commonsense Reasoning
Pengcheng He | Xiaodong Liu | Weizhu Chen | Jianfeng Gao
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

This paper proposes a hybrid neural network(HNN) model for commonsense reasoning. An HNN consists of two component models, a masked language model and a semantic similarity model, which share a BERTbased contextual encoder but use different model-specific input and output layers. HNN obtains new state-of-the-art results on three classic commonsense reasoning tasks, pushing the WNLI benchmark to 89%, the Winograd Schema Challenge (WSC) benchmark to 75.1%, and the PDP60 benchmark to 90.0%. An ablation study shows that language models and semantic similarity models are complementary approaches to commonsense reasoning, and HNN effectively combines the strengths of both. The code and pre-trained models will be publicly available at https: //github.com/namisan/mt-dnn.

pdf bib
Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing
Hao Fu | Chunyuan Li | Xiaodong Liu | Jianfeng Gao | Asli Celikyilmaz | Lawrence Carin
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Variational autoencoders (VAE) with an auto-regressive decoder have been applied for many natural language processing (NLP) tasks. VAE objective consists of two terms, the KL regularization term and the reconstruction term, balanced by a weighting hyper-parameter 𝛽. One notorious training difficulty is that the KL term tends to vanish. In this paper we study different scheduling schemes for 𝛽, and show that KL vanishing is caused by the lack of good latent codes in training decoder at the beginning of optimization. To remedy the issue, we propose a cyclical annealing schedule, which simply repeats the process of increasing 𝛽 multiple times. This new procedure allows us to learn more meaningful latent codes progressively by leveraging the results of previous learning cycles as warm re-restart. The effectiveness of cyclical annealing schedule is validated on a broad range of NLP tasks, including language modeling, dialog response generation and semi-supervised text classification.

pdf bib
Unsupervised Deep Structured Semantic Models for Commonsense Reasoning
Shuohang Wang | Sheng Zhang | Yelong Shen | Xiaodong Liu | Jingjing Liu | Jianfeng Gao | Jing Jiang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Commonsense reasoning is fundamental to natural language understanding. While traditional methods rely heavily on human-crafted features and knowledge bases, we explore learning commonsense knowledge from a large amount of raw text via unsupervised learning. We propose two neural network models based on the Deep Structured Semantic Models (DSSM) framework to tackle two classic commonsense reasoning tasks, Winograd Schema challenges (WSC) and Pronoun Disambiguation (PDP). Evaluation shows that the proposed models effectively capture contextual information in the sentence and co-reference information between pronouns and nouns, and achieve significant improvement over previous state-of-the-art approaches.

pdf bib
Multi-task Learning with Sample Re-weighting for Machine Reading Comprehension
Yichong Xu | Xiaodong Liu | Yelong Shen | Jingjing Liu | Jianfeng Gao
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We propose a multi-task learning framework to learn a joint Machine Reading Comprehension (MRC) model that can be applied to a wide range of MRC tasks in different domains. Inspired by recent ideas of data selection in machine translation, we develop a novel sample re-weighting scheme to assign sample-specific weights to the loss. Empirical study shows that our approach can be applied to many existing MRC models. Combined with contextual representations from pre-trained language models (such as ELMo), we achieve new state-of-the-art results on a set of MRC benchmark datasets. We release our code at https://github.com/xycforgithub/MultiTask-MRC.

pdf bib
DoubleTransfer at MEDIQA 2019: Multi-Source Transfer Learning for Natural Language Understanding in the Medical Domain
Yichong Xu | Xiaodong Liu | Chunyuan Li | Hoifung Poon | Jianfeng Gao
Proceedings of the 18th BioNLP Workshop and Shared Task

This paper describes our competing system to enter the MEDIQA-2019 competition. We use a multi-source transfer learning approach to transfer the knowledge from MT-DNN and SciBERT to natural language understanding tasks in the medical domain. For transfer learning fine-tuning, we use multi-task learning on NLI, RQE and QA tasks on general and medical domains to improve performance. The proposed methods are proved effective for natural language understanding in the medical domain, and we rank the first place on the QA task.

pdf bib
Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu | Pengcheng He | Weizhu Chen | Jianfeng Gao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we present a Multi-Task Deep Neural Network (MT-DNN) for learning representations across multiple natural language understanding (NLU) tasks. MT-DNN not only leverages large amounts of cross-task data, but also benefits from a regularization effect that leads to more general representations to help adapt to new tasks and domains. MT-DNN extends the model proposed in Liu et al. (2015) by incorporating a pre-trained bidirectional transformer language model, known as BERT (Devlin et al., 2018). MT-DNN obtains new state-of-the-art results on ten NLU tasks, including SNLI, SciTail, and eight out of nine GLUE tasks, pushing the GLUE benchmark to 82.7% (2.2% absolute improvement) as of February 25, 2019 on the latest GLUE test set. We also demonstrate using the SNLI and SciTail datasets that the representations learned by MT-DNN allow domain adaptation with substantially fewer in-domain labels than the pre-trained BERT representations. Our code and pre-trained models will be made publicly available.

pdf bib
Conversing by Reading: Contentful Neural Conversation with On-demand Machine Reading
Lianhui Qin | Michel Galley | Chris Brockett | Xiaodong Liu | Xiang Gao | Bill Dolan | Yejin Choi | Jianfeng Gao
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Although neural conversational models are effective in learning how to produce fluent responses, their primary challenge lies in knowing what to say to make the conversation contentful and non-vacuous. We present a new end-to-end approach to contentful neural conversation that jointly models response generation and on-demand machine reading. The key idea is to provide the conversation model with relevant long-form text on the fly as a source of external knowledge. The model performs QA-style reading comprehension on this text in response to each conversational turn, thereby allowing for more focused integration of external knowledge than has been possible in prior approaches. To support further research on knowledge-grounded conversation, we introduce a new large-scale conversation dataset grounded in external web pages (2.8M turns, 7.4M sentences of grounding). Both human evaluation and automated metrics show that our approach results in more contentful responses compared to a variety of previous methods, improving both the informativeness and diversity of generated output.

2018

pdf bib
Stochastic Answer Networks for Machine Reading Comprehension
Xiaodong Liu | Yelong Shen | Kevin Duh | Jianfeng Gao
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a simple yet robust stochastic answer network (SAN) that simulates multi-step reasoning in machine reading comprehension. Compared to previous work such as ReasoNet which used reinforcement learning to determine the number of steps, the unique feature is the use of a kind of stochastic prediction dropout on the answer module (final layer) of the neural network during the training. We show that this simple trick improves robustness and achieves results competitive to the state-of-the-art on the Stanford Question Answering Dataset (SQuAD), the Adversarial SQuAD, and the Microsoft MAchine Reading COmprehension Dataset (MS MARCO).

2017

pdf bib
An Empirical Analysis of Multiple-Turn Reasoning Strategies in Reading Comprehension Tasks
Yelong Shen | Xiaodong Liu | Kevin Duh | Jianfeng Gao
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Reading comprehension (RC) is a challenging task that requires synthesis of information across sentences and multiple turns of reasoning. Using a state-of-the-art RC model, we empirically investigate the performance of single-turn and multiple-turn reasoning on the SQuAD and MS MARCO datasets. The RC model is an end-to-end neural network with iterative attention, and uses reinforcement learning to dynamically control the number of turns. We find that multiple-turn reasoning outperforms single-turn reasoning for all question and answer types; further, we observe that enabling a flexible number of turns generally improves upon a fixed multiple-turn strategy. %across all question types, and is particularly beneficial to questions with lengthy, descriptive answers. We achieve results competitive to the state-of-the-art on these two datasets.

pdf bib
Lexical Simplification with the Deep Structured Similarity Model
Lis Pereira | Xiaodong Liu | John Lee
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

We explore the application of a Deep Structured Similarity Model (DSSM) to ranking in lexical simplification. Our results show that the DSSM can effectively capture fine-grained features to perform semantic matching when ranking substitution candidates, outperforming the state-of-the-art on two standard datasets used for the task.

2015

pdf bib
Representation Learning Using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval
Xiaodong Liu | Jianfeng Gao | Xiaodong He | Li Deng | Kevin Duh | Ye-yi Wang
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2013

pdf bib
Topic Models + Word Alignment = A Flexible Framework for Extracting Bilingual Dictionary from Comparable Corpus
Xiaodong Liu | Kevin Duh | Yuji Matsumoto
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf bib
A Hybrid Chinese Spelling Correction Using Language Model and Statistical Machine Translation with Reranking
Xiaodong Liu | Kevin Cheng | Yanyan Luo | Kevin Duh | Yuji Matsumoto
Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing