Pengcheng Yin


2021

pdf bib
Compositional Generalization for Neural Semantic Parsing via Span-level Supervised Attention
Pengcheng Yin | Hao Fang | Graham Neubig | Adam Pauls | Emmanouil Antonios Platanios | Yu Su | Sam Thomson | Jacob Andreas
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of classical word alignment algorithms. Where past work has used word-level alignments, we focus on spans; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.

2020

pdf bib
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation
Frank F. Xu | Zhengbao Jiang | Pengcheng Yin | Bogdan Vasilescu | Graham Neubig
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

pdf bib
TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data
Pengcheng Yin | Graham Neubig | Wen-tau Yih | Sebastian Riedel
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent years have witnessed the burgeoning of pretrained language models (LMs) for text-based natural language (NL) understanding tasks. Such models are typically trained on free-form NL text, hence may not be suitable for tasks like semantic parsing over structured data, which require reasoning over both free-form NL questions and structured tabular data (e.g., database tables). In this paper we present TaBERT, a pretrained LM that jointly learns representations for NL sentences and (semi-)structured tables. TaBERT is trained on a large corpus of 26 million tables and their English contexts. In experiments, neural semantic parsers using TaBERT as feature representation layers achieve new best results on the challenging weakly-supervised semantic parsing benchmark WikiTableQuestions, while performing competitively on the text-to-SQL dataset Spider.

pdf bib
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing
Ben Bogin | Srinivasan Iyer | Victoria Lin | Dragomir Radev | Alane Suhr | Panupong | Caiming Xiong | Pengcheng Yin | Tao Yu | Rui Zhang | Victor Zhong
Proceedings of the First Workshop on Interactive and Executable Semantic Parsing

2019

pdf bib
Reranking for Neural Semantic Parsing
Pengcheng Yin | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Semantic parsing considers the task of transducing natural language (NL) utterances into machine executable meaning representations (MRs). While neural network-based semantic parsers have achieved impressive improvements over previous methods, results are still far from perfect, and cursory manual inspection can easily identify obvious problems such as lack of adequacy or coherence of the generated MRs. This paper presents a simple approach to quickly iterate and improve the performance of an existing neural semantic parser by reranking an n-best list of predicted MRs, using features that are designed to fix observed problems with baseline models. We implement our reranker in a competitive neural semantic parser and test on four semantic parsing (GEO, ATIS) and Python code generation (Django, CoNaLa) tasks, improving the strong baseline parser by up to 5.7% absolute in BLEU (CoNaLa) and 2.9% in accuracy (Django), outperforming the best published neural parser results on all four datasets.

pdf bib
Improving Open Information Extraction via Iterative Rank-Aware Learning
Zhengbao Jiang | Pengcheng Yin | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Open information extraction (IE) is the task of extracting open-domain assertions from natural language sentences. A key step in open IE is confidence modeling, ranking the extractions based on their estimated quality to adjust precision and recall of extracted assertions. We found that the extraction likelihood, a confidence measure used by current supervised open IE systems, is not well calibrated when comparing the quality of assertions extracted from different sentences. We propose an additional binary classification loss to calibrate the likelihood to make it more globally comparable, and an iterative learning process, where extractions generated by the open IE model are incrementally included as training samples to help the model learn from trial and error. Experiments on OIE2016 demonstrate the effectiveness of our method. Code and data are available at https://github.com/jzbjyb/oie_rank.

2018

pdf bib
StructVAE: Tree-structured Latent Variable Models for Semi-supervised Semantic Parsing
Pengcheng Yin | Chunting Zhou | Junxian He | Graham Neubig
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Semantic parsing is the task of transducing natural language (NL) utterances into formal meaning representations (MRs), commonly represented as tree structures. Annotating NL utterances with their corresponding MRs is expensive and time-consuming, and thus the limited availability of labeled data often becomes the bottleneck of data-driven, supervised models. We introduce StructVAE, a variational auto-encoding model for semi-supervised semantic parsing, which learns both from limited amounts of parallel data, and readily-available unlabeled NL utterances. StructVAE models latent MRs not observed in the unlabeled data as tree-structured latent variables. Experiments on semantic parsing on the ATIS domain and Python code generation show that with extra unlabeled data, StructVAE outperforms strong supervised models.

pdf bib
Retrieval-Based Neural Code Generation
Shirley Anugrah Hayati | Raphael Olivier | Pravalika Avvaru | Pengcheng Yin | Anthony Tomasic | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce RECODE, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU.

pdf bib
A Tree-based Decoder for Neural Machine Translation
Xinyi Wang | Hieu Pham | Pengcheng Yin | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent advances in Neural Machine Translation (NMT) show that adding syntactic information to NMT systems can improve the quality of their translations. Most existing work utilizes some specific types of linguistically-inspired tree structures, like constituency and dependency parse trees. This is often done via a standard RNN decoder that operates on a linearized target tree structure. However, it is an open question of what specific linguistic formalism, if any, is the best structural representation for NMT. In this paper, we (1) propose an NMT model that can naturally generate the topology of an arbitrary tree structure on the target side, and (2) experiment with various target tree structures. Our experiments show the surprising result that our model delivers the best improvements with balanced binary trees constructed without any linguistic knowledge; this model outperforms standard seq2seq models by up to 2.1 BLEU points, and other methods for incorporating target-side syntax by up to 0.7 BLEU.

pdf bib
TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation
Pengcheng Yin | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present TRANX, a transition-based neural semantic parser that maps natural language (NL) utterances into formal meaning representations (MRs). TRANX uses a transition system based on the abstract syntax description language for the target MR, which gives it two major advantages: (1) it is highly accurate, using information from the syntax of the target MR to constrain the output space and model the information flow, and (2) it is highly generalizable, and can easily be applied to new types of MR by just writing a new abstract syntax description corresponding to the allowable structures in the MR. Experiments on four different semantic parsing and code generation tasks show that our system is generalizable, extensible, and effective, registering strong results compared to existing neural semantic parsers.

2017

pdf bib
A Syntactic Neural Model for General-Purpose Code Generation
Pengcheng Yin | Graham Neubig
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.

2016

pdf bib
Neural Enquirer: Learning to Query Tables in Natural Language
Pengcheng Yin | Zhengdong Lu | Hang Li | Kao Ben
Proceedings of the Workshop on Human-Computer Question Answering