J. Edward Hu


2021

pdf bib
Iterative Paraphrastic Augmentation with Discriminative Span Alignment
Ryan Culkin | J. Edward Hu | Elias Stengel-Eskin | Guanghui Qin | Benjamin Van Durme
Transactions of the Association for Computational Linguistics, Volume 9

We introduce a novel paraphrastic augmentation strategy based on sentence-level lexically constrained paraphrasing and discriminative span alignment. Our approach allows for the large-scale expansion of existing datasets or the rapid creation of new datasets using a small, manually produced seed corpus. We demonstrate our approach with experiments on the Berkeley FrameNet Project, a large-scale language understanding effort spanning more than two decades of human labor. With four days of training data collection for a span alignment model and one day of parallel compute, we automatically generate and release to the community 495,300 unique (Frame,Trigger) pairs in diverse sentential contexts, a roughly 50-fold expansion atop FrameNet v1.7. The resulting dataset is intrinsically and extrinsically evaluated in detail, showing positive results on a downstream task.

2019

pdf bib
Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering
J. Edward Hu | Abhinav Singh | Nils Holzenberger | Matt Post | Benjamin Van Durme
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.

pdf bib
Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting
J. Edward Hu | Huda Khayrallah | Ryan Culkin | Patrick Xia | Tongfei Chen | Matt Post | Benjamin Van Durme
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Lexically-constrained sequence decoding allows for explicit positive or negative phrase-based constraints to be placed on target output strings in generation tasks such as machine translation or monolingual text rewriting. We describe vectorized dynamic beam allocation, which extends work in lexically-constrained decoding to work with batching, leading to a five-fold improvement in throughput when working with positive constraints. Faster decoding enables faster exploration of constraint strategies: we illustrate this via data augmentation experiments with a monolingual rewriter applied to the tasks of natural language inference, question answering and machine translation, showing improvements in all three.

2018

pdf bib
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
Adam Poliak | Aparajita Haldar | Rachel Rudinger | J. Edward Hu | Ellie Pavlick | Aaron Steven White | Benjamin Van Durme
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We present a large scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation encoded by a neural network captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. Our collection of diverse datasets is available at http://www.decomp.net/, and will grow over time as additional resources are recast and added from novel sources.

pdf bib
Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation
Adam Poliak | Aparajita Haldar | Rachel Rudinger | J. Edward Hu | Ellie Pavlick | Aaron Steven White | Benjamin Van Durme
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present a large-scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. We refer to our collection as the DNC: Diverse Natural Language Inference Collection. The DNC is available online at https://www.decomp.net, and will grow over time as additional resources are recast and added from novel sources.