Yu Su


2021

pdf bib
ReasonBERT: Pre-trained to Reason with Distant Supervision
Xiang Deng | Yu Su | Alyssa Lees | You Wu | Cong Yu | Huan Sun
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present ReasonBert, a pre-training method that augments language models with the ability to reason over long-range relations and multiple, possibly hybrid contexts. Unlike existing pre-training methods that only harvest learning signals from local contexts of naturally occurring texts, we propose a generalized notion of distant supervision to automatically connect multiple pieces of text and tables to create pre-training examples that require long-range reasoning. Different types of reasoning are simulated, including intersecting multiple pieces of evidence, bridging from one piece of evidence to another, and detecting unanswerable cases. We conduct a comprehensive evaluation on a variety of extractive question answering datasets ranging from single-hop to multi-hop and from text-only to table-only to hybrid that require various reasoning capabilities and show that ReasonBert achieves remarkable improvement over an array of strong baselines. Few-shot experiments further demonstrate that our pre-training method substantially improves sample efficiency.

pdf bib
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)
Royi Lachmy | Ziyu Yao | Greg Durrett | Milos Gligoric | Junyi Jessy Li | Ray Mooney | Graham Neubig | Yu Su | Huan Sun | Reut Tsarfaty
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)

pdf bib
An Investigation of Language Model Interpretability via Sentence Editing
Samuel Stevens | Yu Su
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Pre-trained language models (PLMs) like BERT are being used for almost all language-related tasks, but interpreting their behavior still remains a significant challenge and many important questions remain largely unanswered. In this work, we re-purpose a sentence editing dataset, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales, as a new testbed for interpretability. This enables us to conduct a systematic investigation on an array of questions regarding PLMs’ interpretability, including the role of pre-training procedure, comparison of rationale extraction methods, and different layers in the PLM. The investigation generates new insights, for example, contrary to the common understanding, we find that attention weights correlate well with human rationales and work better than gradient-based saliency in extracting model rationales. Both the dataset and code will be released to facilitate future interpretability research.

pdf bib
ITNLP at SemEval-2021 Task 11: Boosting BERT with Sampling and Adversarial Training for Knowledge Extraction
Genyu Zhang | Yu Su | Changhong He | Lei Lin | Chengjie Sun | Lili Shan
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper describes the winning system in the End-to-end Pipeline phase for the NLPContributionGraph task. The system is composed of three BERT-based models and the three models are used to extract sentences, entities and triples respectively. Experiments show that sampling and adversarial training can greatly boost the system. In End-to-end Pipeline phase, our system got an average F1 of 0.4703, significantly higher than the second-placed system which got an average F1 of 0.3828.

pdf bib
Compositional Generalization for Neural Semantic Parsing via Span-level Supervised Attention
Pengcheng Yin | Hao Fang | Graham Neubig | Adam Pauls | Emmanouil Antonios Platanios | Yu Su | Sam Thomson | Jacob Andreas
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of classical word alignment algorithms. Where past work has used word-level alignments, we focus on spans; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Esin Durmus | Vivek Gupta | Nelson Liu | Nanyun Peng | Yu Su
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
A Systematic Investigation of KB-Text Embedding Alignment at Scale
Vardaan Pahuja | Yu Gu | Wenhu Chen | Mehdi Bahrami | Lei Liu | Wei-Peng Chen | Yu Su
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Knowledge bases (KBs) and text often contain complementary knowledge: KBs store structured knowledge that can support long range reasoning, while text stores more comprehensive and timely knowledge in an unstructured way. Separately embedding the individual knowledge sources into vector spaces has demonstrated tremendous successes in encoding the respective knowledge, but how to jointly embed and reason with both knowledge sources to fully leverage the complementary information is still largely an open problem. We conduct a large-scale, systematic investigation of aligning KB and text embeddings for joint reasoning. We set up a novel evaluation framework with two evaluation tasks, few-shot link prediction and analogical reasoning, and evaluate an array of KB-text embedding alignment methods. We also demonstrate how such alignment can infuse textual information into KB embeddings for more accurate link prediction on emerging entities and events, using COVID-19 as a case study.

2020

pdf bib
Task-Oriented Dialogue as Dataflow Synthesis
Jacob Andreas | John Bufe | David Burkett | Charles Chen | Josh Clausman | Jean Crawford | Kate Crim | Jordan DeLoach | Leah Dorner | Jason Eisner | Hao Fang | Alan Guo | David Hall | Kristin Hayes | Kellie Hill | Diana Ho | Wendy Iwaszuk | Smriti Jha | Dan Klein | Jayant Krishnamurthy | Theo Lanman | Percy Liang | Christopher H. Lin | Ilya Lintsbakh | Andy McGovern | Aleksandr Nisnevich | Adam Pauls | Dmitrij Petters | Brent Read | Dan Roth | Subhro Roy | Jesse Rusak | Beth Short | Div Slomin | Ben Snyder | Stephon Striplin | Yu Su | Zachary Tellman | Sam Thomson | Andrei Vorobev | Izabela Witoszko | Jason Wolfe | Abby Wray | Yuchen Zhang | Alexander Zotov
Transactions of the Association for Computational Linguistics, Volume 8

We describe an approach to task-oriented dialogue in which dialogue state is represented as a dataflow graph. A dialogue agent maps each user utterance to a program that extends this graph. Programs include metacomputation operators for reference and revision that reuse dataflow fragments from previous turns. Our graph-based state enables the expression and manipulation of complex user intents, and explicit metacomputation makes these intents easier for learned models to predict. We introduce a new dataset, SMCalFlow, featuring complex dialogues about events, weather, places, and people. Experiments show that dataflow graphs and metacomputation substantially improve representability and predictability in these natural dialogues. Additional experiments on the MultiWOZ dataset show that our dataflow representation enables an otherwise off-the-shelf sequence-to-sequence model to match the best existing task-specific state tracking model. The SMCalFlow dataset, code for replicating experiments, and a public leaderboard are available at https://www.microsoft.com/en-us/research/project/dataflow-based-dialogue-semantic-machines.

pdf bib
Logical Natural Language Generation from Open-Domain Tables
Wenhu Chen | Jianshu Chen | Yu Su | Zhiyu Chen | William Yang Wang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural natural language generation (NLG) models have recently shown remarkable progress in fluency and coherence. However, existing studies on neural NLG are primarily focused on surface-level realizations with limited emphasis on logical inference, an important aspect of human thinking and language. In this paper, we suggest a new NLG task where a model is tasked with generating natural language statements that can be logically entailed by the facts in an open-domain semi-structured table. To facilitate the study of the proposed logical NLG problem, we use the existing TabFact dataset~(CITATION) featured with a wide range of logical/symbolic inferences as our testbed, and propose new automatic metrics to evaluate the fidelity of generation models w.r.t. logical inference. The new task poses challenges to the existing monotonic generation frameworks due to the mismatch between sequence order and logical order. In our experiments, we comprehensively survey different generation architectures (LSTM, Transformer, Pre-Trained LM) trained with different algorithms (RL, Adversarial Training, Coarse-to-Fine) on the dataset and made following observations: 1) Pre-Trained LM can significantly boost both the fluency and logical fidelity metrics, 2) RL and Adversarial Training are trading fluency for fidelity, 3) Coarse-to-Fine generation can help partially alleviate the fidelity issue while maintaining high language fluency. The code and data are available at https://github.com/wenhuchen/LogicNLG.

pdf bib
Document Classification for COVID-19 Literature
Bernal Jiménez Gutiérrez | Juncheng Zeng | Dongdong Zhang | Ping Zhang | Yu Su
Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020

The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset. We find that pre-trained language models outperform other models in both low and high data regimes, achieving a maximum F1 score of around 86%. We note that even the highest performing models still struggle with label correlation, distraction from introductory text and CORD-19 generalization. Both data and code are available on GitHub.

pdf bib
An Imitation Game for Learning Semantic Parsers from User Interaction
Ziyu Yao | Yiqi Tang | Wen-tau Yih | Huan Sun | Yu Su
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Despite the widely successful applications, bootstrapping and fine-tuning semantic parsers are still a tedious process with challenges such as costly data annotation and privacy risks. In this paper, we suggest an alternative, human-in-the-loop methodology for learning semantic parsers directly from users. A semantic parser should be introspective of its uncertainties and prompt for user demonstrations when uncertain. In doing so it also gets to imitate the user behavior and continue improving itself autonomously with the hope that eventually it may become as good as the user in interpreting their questions. To combat the sparsity of demonstrations, we propose a novel annotation-efficient imitation learning algorithm, which iteratively collects new datasets by mixing demonstrated states and confident predictions and retrains the semantic parser in a Dataset Aggregation fashion (Ross et al., 2011). We provide a theoretical analysis of its cost bound and also empirically demonstrate its promising performance on the text-to-SQL problem. Code will be available at https://github.com/sunlab-osu/MISP.

pdf bib
KGPT: Knowledge-Grounded Pre-Training for Data-to-Text Generation
Wenhu Chen | Yu Su | Xifeng Yan | William Yang Wang
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Data-to-text generation has recently attracted substantial interests due to its wide applications. Existing methods have shown impressive performance on an array of tasks. However, they rely on a significant amount of labeled data for each task, which is costly to acquire and thus limits their application to new tasks and domains. In this paper, we propose to leverage pre-training and transfer learning to address this issue. We propose a knowledge-grounded pre-training (KGPT), which consists of two parts, 1) a general knowledge-grounded generation model to generate knowledge-enriched text. 2) a pre-training paradigm on a massive knowledge-grounded text corpus crawled from the web. The pre-trained model can be fine-tuned on various data-to-text generation tasks to generate task-specific text. We adopt three settings, namely fully-supervised, zero-shot, few-shot to evaluate its effectiveness. Under the fully-supervised setting, our model can achieve remarkable gains over the known baselines. Under zero-shot setting, our model without seeing any examples achieves over 30 ROUGE-L on WebNLG while all other baselines fail. Under the few-shot setting, our model only needs about one-fifteenth as many labeled examples to achieve the same level of performance as baseline models. These experiments consistently prove the strong generalization ability of our proposed framework.

pdf bib
Document Classification for COVID-19 Literature
Bernal Jimenez Gutierrez | Jucheng Zeng | Dongdong Zhang | Ping Zhang | Yu Su
Findings of the Association for Computational Linguistics: EMNLP 2020

The global pandemic has made it more important than ever to quickly and accurately retrieve relevant scientific literature for effective consumption by researchers in a wide range of fields. We provide an analysis of several multi-label document classification models on the LitCovid dataset, a growing collection of 23,000 research papers regarding the novel 2019 coronavirus. We find that pre-trained language models fine-tuned on this dataset outperform all other baselines and that BioBERT surpasses the others by a small margin with micro-F1 and accuracy scores of around 86% and 75% respectively on the test set. We evaluate the data efficiency and generalizability of these models as essential features of any system prepared to deal with an urgent situation like the current health crisis. We perform a data ablation study to determine how important article titles are for achieving reasonable performance on this dataset. Finally, we explore 50 errors made by the best performing models on LitCovid documents and find that they often (1) correlate certain labels too closely together and (2) fail to focus on discriminative sections of the articles; both of which are important issues to address in future work. Both data and code are available on GitHub.

pdf bib
Proceedings of the First Workshop on Natural Language Interfaces
Ahmed Hassan Awadallah | Yu Su | Huan Sun | Scott Wen-tau Yih
Proceedings of the First Workshop on Natural Language Interfaces

2019

pdf bib
Model-based Interactive Semantic Parsing: A Unified Framework and A Text-to-SQL Case Study
Ziyu Yao | Yu Su | Huan Sun | Wen-tau Yih
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

As a promising paradigm, interactive semantic parsing has shown to improve both semantic parsing accuracy and user confidence in the results. In this paper, we propose a new, unified formulation of the interactive semantic parsing problem, where the goal is to design a model-based intelligent agent. The agent maintains its own state as the current predicted semantic parse, decides whether and where human intervention is needed, and generates a clarification question in natural language. A key part of the agent is a world model: it takes a percept (either an initial question or subsequent feedback from the user) and transitions to a new state. We then propose a simple yet remarkably effective instantiation of our framework, demonstrated on two text-to-SQL datasets (WikiSQL and Spider) with different state-of-the-art base semantic parsers. Compared to an existing interactive semantic parsing approach that treats the base parser as a black box, our approach solicits less user feedback but yields higher run-time accuracy.

pdf bib
Global Textual Relation Embedding for Relational Understanding
Zhiyu Chen | Hanwen Zha | Honglei Liu | Wenhu Chen | Xifeng Yan | Yu Su
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Pre-trained embeddings such as word embeddings and sentence embeddings are fundamental tools facilitating a wide range of downstream NLP tasks. In this work, we investigate how to learn a general-purpose embedding of textual relations, defined as the shortest dependency path between entities. Textual relation embedding provides a level of knowledge between word/phrase level and sentence level, and we show that it can facilitate downstream tasks requiring relational understanding of the text. To learn such an embedding, we create the largest distant supervision dataset by linking the entire English ClueWeb09 corpus to Freebase. We use global co-occurrence statistics between textual and knowledge base relations as the supervision signal to train the embedding. Evaluation on two relational understanding tasks demonstrates the usefulness of the learned textual relation embedding. The data and code can be found at https://github.com/czyssrs/GloREPlus

pdf bib
How Large a Vocabulary Does Text Classification Need? A Variational Approach to Vocabulary Selection
Wenhu Chen | Yu Su | Yilin Shen | Zhiyu Chen | Xifeng Yan | William Yang Wang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

With the rapid development in deep learning, deep neural networks have been widely adopted in many real-life natural language applications. Under deep neural networks, a pre-defined vocabulary is required to vectorize text inputs. The canonical approach to select pre-defined vocabulary is based on the word frequency, where a threshold is selected to cut off the long tail distribution. However, we observed that such a simple approach could easily lead to under-sized vocabulary or over-sized vocabulary issues. Therefore, we are interested in understanding how the end-task classification accuracy is related to the vocabulary size and what is the minimum required vocabulary size to achieve a specific performance. In this paper, we provide a more sophisticated variational vocabulary dropout (VVD) based on variational dropout to perform vocabulary selection, which can intelligently select the subset of the vocabulary to achieve the required performance. To evaluate different algorithms on the newly proposed vocabulary selection problem, we propose two new metrics: Area Under Accuracy-Vocab Curve and Vocab Size under X% Accuracy Drop. Through extensive experiments on various NLP classification tasks, our variational framework is shown to significantly outperform the frequency-based and other selection baselines on these metrics.

2018

pdf bib
XL-NBT: A Cross-lingual Neural Belief Tracking Framework
Wenhu Chen | Jianshu Chen | Yu Su | Xin Wang | Dong Yu | Xifeng Yan | William Yang Wang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Task-oriented dialog systems are becoming pervasive, and many companies heavily rely on them to complement human agents for customer service in call centers. With globalization, the need for providing cross-lingual customer support becomes more urgent than ever. However, cross-lingual support poses great challenges—it requires a large amount of additional annotated data from native speakers. In order to bypass the expensive human annotation and achieve the first step towards the ultimate goal of building a universal dialog system, we set out to build a cross-lingual state tracking framework. Specifically, we assume that there exists a source language with dialog belief tracking annotations while the target languages have no annotated dialog data of any form. Then, we pre-train a state tracker for the source language as a teacher, which is able to exploit easy-to-access parallel data. We then distill and transfer its own knowledge to the student state tracker in target languages. We specifically discuss two types of common parallel resources: bilingual corpus and bilingual dictionary, and design different transfer learning strategies accordingly. Experimentally, we successfully use English state tracker as the teacher to transfer its knowledge to both Italian and German trackers and achieve promising results.

pdf bib
What It Takes to Achieve 100% Condition Accuracy on WikiSQL
Semih Yavuz | Izzeddin Gur | Yu Su | Xifeng Yan
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

WikiSQL is a newly released dataset for studying the natural language sequence to SQL translation problem. The SQL queries in WikiSQL are simple: Each involves one relation and does not have any join operation. Despite of its simplicity, none of the publicly reported structured query generation models can achieve an accuracy beyond 62%, which is still far from enough for practical use. In this paper, we ask two questions, “Why is the accuracy still low for such simple queries?” and “What does it take to achieve 100% accuracy on WikiSQL?” To limit the scope of our study, we focus on the WHERE clause in SQL. The answers will help us gain insights about the directions we should explore in order to further improve the translation accuracy. We will then investigate alternative solutions to realize the potential ceiling performance on WikiSQL. Our proposed solution can reach up to 88.6% condition accuracy on the WikiSQL dataset.

pdf bib
DialSQL: Dialogue Based Structured Query Generation
Izzeddin Gur | Semih Yavuz | Yu Su | Xifeng Yan
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The recent advance in deep learning and semantic parsing has significantly improved the translation accuracy of natural language questions to structured queries. However, further improvement of the existing approaches turns out to be quite challenging. Rather than solely relying on algorithmic innovations, in this work, we introduce DialSQL, a dialogue-based structured query generation framework that leverages human intelligence to boost the performance of existing algorithms via user interaction. DialSQL is capable of identifying potential errors in a generated SQL query and asking users for validation via simple multi-choice questions. User feedback is then leveraged to revise the query. We design a generic simulator to bootstrap synthetic training dialogues and evaluate the performance of DialSQL on the WikiSQL dataset. Using SQLNet as a black box query generation tool, DialSQL improves its performance from 61.3% to 69.0% using only 2.4 validation questions per dialogue.

pdf bib
Global Relation Embedding for Relation Extraction
Yu Su | Honglei Liu | Semih Yavuz | Izzeddin Gür | Huan Sun | Xifeng Yan
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We study the problem of textual relation embedding with distant supervision. To combat the wrong labeling problem of distant supervision, we propose to embed textual relations with global statistics of relations, i.e., the co-occurrence statistics of textual and knowledge base relations collected from the entire corpus. This approach turns out to be more robust to the training noise introduced by distant supervision. On a popular relation extraction dataset, we show that the learned textual relation embedding can be used to augment existing relation extraction models and significantly improve their performance. Most remarkably, for the top 1,000 relational facts discovered by the best existing model, the precision can be improved from 83.9% to 89.3%.

2017

pdf bib
Recovering Question Answering Errors via Query Revision
Semih Yavuz | Izzeddin Gur | Yu Su | Xifeng Yan
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The existing factoid QA systems often lack a post-inspection component that can help models recover from their own mistakes. In this work, we propose to crosscheck the corresponding KB relations behind the predicted answers and identify potential inconsistencies. Instead of developing a new model that accepts evidences collected from these relations, we choose to plug them back to the original questions directly and check if the revised question makes sense or not. A bidirectional LSTM is applied to encode revised questions. We develop a scoring mechanism over the revised question encodings to refine the predictions of a base QA system. This approach can improve the F1 score of STAGG (Yih et al., 2015), one of the leading QA systems, from 52.5% to 53.9% on WEBQUESTIONS data.

pdf bib
Cross-domain Semantic Parsing via Paraphrasing
Yu Su | Xifeng Yan
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Existing studies on semantic parsing mainly focus on the in-domain setting. We formulate cross-domain semantic parsing as a domain adaptation problem: train a semantic parser on some source domains and then adapt it to the target domain. Due to the diversity of logical forms in different domains, this problem presents unique and intriguing challenges. By converting logical forms into canonical utterances in natural language, we reduce semantic parsing to paraphrasing, and develop an attentive sequence-to-sequence paraphrase model that is general and flexible to adapt to different domains. We discover two problems, small micro variance and large macro variance, of pre-trained word embeddings that hinder their direct use in neural networks, and propose standardization techniques as a remedy. On the popular Overnight dataset, which contains eight domains, we show that both cross-domain training and standardized pre-trained word embeddings can bring significant improvement.

pdf bib
An End-to-End Deep Framework for Answer Triggering with a Novel Group-Level Objective
Jie Zhao | Yu Su | Ziyu Guan | Huan Sun
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Given a question and a set of answer candidates, answer triggering determines whether the candidate set contains any correct answers. If yes, it then outputs a correct one. In contrast to existing pipeline methods which first consider individual candidate answers separately and then make a prediction based on a threshold, we propose an end-to-end deep neural network framework, which is trained by a novel group-level objective function that directly optimizes the answer triggering performance. Our objective function penalizes three potential types of error and allows training the framework in an end-to-end manner. Experimental results on the WikiQA benchmark show that our framework outperforms the state of the arts by a 6.6% absolute gain under F1 measure.

2016

pdf bib
Improving Semantic Parsing via Answer Type Inference
Semih Yavuz | Izzeddin Gur | Yu Su | Mudhakar Srivatsa | Xifeng Yan
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
On Generating Characteristic-rich Question Sets for QA Evaluation
Yu Su | Huan Sun | Brian Sadler | Mudhakar Srivatsa | Izzeddin Gür | Zenghui Yan | Xifeng Yan
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing