Angelica Chen


2022

pdf bib
BBQ: A hand-built bias benchmark for question answering
Alicia Parrish | Angelica Chen | Nikita Nangia | Vishakh Padmakumar | Jason Phang | Jana Thompson | Phu Mon Htut | Samuel Bowman
Findings of the Association for Computational Linguistics: ACL 2022

It is well documented that NLP models learn social biases, but little work has been done on how these biases manifest in model outputs for applied tasks like question answering (QA). We introduce the Bias Benchmark for QA (BBQ), a dataset of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine social dimensions relevant for U.S. English-speaking contexts. Our task evaluate model responses at two levels: (i) given an under-informative context, we test how strongly responses reflect social biases, and (ii) given an adequately informative context, we test whether the model’s biases override a correct answer choice. We find that models often rely on stereotypes when the context is under-informative, meaning the model’s outputs consistently reproduce harmful biases in this setting. Though models are more accurate when the context provides an informative answer, they still rely on stereotypes and average up to 3.4 percentage points higher accuracy when the correct answer aligns with a social bias than when it conflicts, with this difference widening to over 5 points on examples targeting gender for most models tested.

pdf bib
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions
Alicia Parrish | Harsh Trivedi | Ethan Perez | Angelica Chen | Nikita Nangia | Jason Phang | Samuel Bowman
Proceedings of the First Workshop on Learning with Natural Language Supervision

Current QA systems can generate reasonable-sounding yet false answers without explanation or evidence for the generated answer, which is especially problematic when humans cannot readily check the model’s answers. This presents a challenge for building trust in machine learning systems. We take inspiration from real-world situations where difficult questions are answered by considering opposing sides (see Irving et al., 2018). For multiple-choice QA examples, we build a dataset of single arguments for both a correct and incorrect answer option in a debate-style set-up as an initial step in training models to produce explanations for two candidate answers. We use long contexts—humans familiar with the context write convincing explanations for pre-selected correct and incorrect answers, and we test if those explanations allow humans who have not read the full context to more accurately determine the correct answer. We do not find that explanations in our set-up improve human accuracy, but a baseline condition shows that providing human-selected text snippets does improve accuracy. We use these findings to suggest ways of improving the debate set up for future data collection efforts.

2019

pdf bib
Generating Logical Forms from Graph Representations of Text and Entities
Peter Shaw | Philip Massey | Angelica Chen | Francesco Piccinno | Yasemin Altun
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Structured information about entities is critical for many semantic parsing tasks. We present an approach that uses a Graph Neural Network (GNN) architecture to incorporate information about relevant entities and their relations during parsing. Combined with a decoder copy mechanism, this approach provides a conceptually simple mechanism to generate logical forms with entities. We demonstrate that this approach is competitive with the state-of-the-art across several tasks without pre-training, and outperforms existing approaches when combined with BERT pre-training.