David Reitter - ACL Anthology

David Reitter

2024

Investigating Content Planning for Navigating Trade-offs in Knowledge-Grounded Dialogue
Kushal Chawla | Hannah Rashkin | Gaurav Singh Tomar | David Reitter
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Knowledge-grounded dialogue generation is a challenging task because it requires satisfying two fundamental, yet often competing constraints: being responsive in a manner that is specific to what the conversation partner has said while also being attributable to an underlying source document. In this work, we bring this trade-off between these two objectives (specificity and attribution) to light, and ask the question: Can explicit content planning before the response generation help the model to address this challenge? To answer this question, we design a framework called PLEDGE, which allows us to experiment with various plan variables explored in prior work supporting both metric-agnostic and metric-aware approaches. While content planning shows promise, our results on whether it can actually help to navigate this trade-off are mixed – planning mechanisms that are metric-aware (use automatic metrics during training) are better at automatic evaluations but underperform in human judgment compared to metric-agnostic mechanisms. We discuss how this may be caused by over-fitting to automatic metrics, and the need for future work to better calibrate these metrics towards human judgment. We hope the observations from our analysis will inform future work that aims to apply content planning in this context.

2023

Measuring Attribution in Natural Language Generation Models
Hannah Rashkin | Vitaly Nikolaev | Matthew Lamm | Lora Aroyo | Michael Collins | Dipanjan Das | Slav Petrov | Gaurav Singh Tomar | Iulia Turc | David Reitter
Computational Linguistics, Volume 49, Issue 4 - December 2023

Large neural models have brought a new challenge to natural language generation (NLG): It has become imperative to ensure the safety and reliability of the output of models that generate freely. To this end, we present an evaluation framework, Attributable to Identified Sources (AIS), stipulating that NLG output pertaining to the external world is to be verified against an independent, provided source. We define AIS and a two-stage annotation pipeline for allowing annotators to evaluate model output according to annotation guidelines. We successfully validate this approach on generation datasets spanning three tasks (two conversational QA datasets, a summarization dataset, and a table-to-text dataset). We provide full annotation guidelines in the appendices and publicly release the annotated data at https://github.com/google-research-datasets/AIS.

Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)
Jing Jiang | David Reitter | Shumin Deng
Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL)

How do decoding algorithms distribute information in dialogue responses?
Saranya Venkatraman | He He | David Reitter
Findings of the Association for Computational Linguistics: EACL 2023

Humans tend to follow the Uniform Information Density (UID) principle by distributing information evenly in utterances. We study if decoding algorithms implicitly follow this UID principle, and under what conditions adherence to UID might be desirable for dialogue generation. We generate responses using different decoding algorithms with GPT-2 on the Persona-Chat dataset and collect human judgments on their quality using Amazon Mechanical Turk. We find that (i) surprisingly, model-generated responses follow the UID principle to a greater extent than human responses, and (ii) decoding algorithms that promote UID do not generate higher-quality responses. Instead, when we control for surprisal, non-uniformity of information density correlates with the quality of responses with very low/high surprisal. Our findings indicate that encouraging non-uniform responses is a potential solution to the “likelihood trap” problem (quality degradation in very high-likelihood text). Our dataset containing multiple candidate responses per dialog history along with human-annotated quality ratings is available at: https://huggingface.co/datasets/saranya132/dialog_uid_gpt2.

2022

Dungeons and Dragons as a Dialog Challenge for Artificial Intelligence
Chris Callison-Burch | Gaurav Singh Tomar | Lara J. Martin | Daphne Ippolito | Suma Bailis | David Reitter
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

AI researchers have posited Dungeons and Dragons (D&D) as a challenge problem to test systems on various language-related capabilities. In this paper, we frame D&D specifically as a dialogue system challenge, where the tasks are to both generate the next conversational turn in the game and predict the state of the game given the dialogue history. We create a gameplay dataset consisting of nearly 900 games, with a total of 7,000 players, 800,000 dialogue turns, 500,000 dice rolls, and 58 million words. We automatically annotate the data with partial state information about the game play. We train a large language model (LM) to generate the next game turn, conditioning it on different information. The LM can respond as a particular character or as the player who runs the game—i.e., the Dungeon Master (DM). It is trained to produce dialogue that is either in-character (roleplaying in the fictional world) or out-of-character (discussing rules or strategy). We perform a human evaluation to determine what factors make the generated output plausible and interesting. We further perform an automatic evaluation to determine how well the model can predict the game state given the history and examine how well tracking the game state improves its ability to produce plausible conversational output.

CONQRR: Conversational Query Rewriting for Retrieval with Reinforcement Learning
Zeqiu Wu | Yi Luan | Hannah Rashkin | David Reitter | Hannaneh Hajishirzi | Mari Ostendorf | Gaurav Singh Tomar
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Compared to standard retrieval tasks, passage retrieval for conversational question answering (CQA) poses new challenges in understanding the current user question, as each question needs to be interpreted within the dialogue context. Moreover, it can be expensive to re-train well-established retrievers such as search engines that are originally developed for non-conversational queries. To facilitate their use, we develop a query rewriting model CONQRR that rewrites a conversational question in the context into a standalone question. It is trained with a novel reward function to directly optimize towards retrieval using reinforcement learning and can be adapted to any off-the-shelf retriever. CONQRR achieves state-of-the-art results on a recent open-domain CQA dataset containing conversations from three different sources, and is effective for two different off-the-shelf retrievers. Our extensive analysis also shows the robustness of CONQRR to out-of-domain dialogues as well as to zero query rewriting supervision.

Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Nouha Dziri | Hannah Rashkin | Tal Linzen | David Reitter
Transactions of the Association for Computational Linguistics, Volume 10

Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.

2021

Increasing Faithfulness in Knowledge-Grounded Dialogue with Controllable Features
Hannah Rashkin | David Reitter | Gaurav Singh Tomar | Dipanjan Das
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Knowledge-grounded dialogue systems are intended to convey information that is based on evidence provided in a given source text. We discuss the challenges of training a generative neural dialogue model for such systems that is controlled to stay faithful to the evidence. Existing datasets contain a mix of conversational responses that are faithful to selected evidence as well as more subjective or chit-chat style responses. We propose different evaluation measures to disentangle these different styles of responses by quantifying the informativeness and objectivity. At training time, additional inputs based on these evaluation measures are given to the dialogue model. At generation time, these additional inputs act as stylistic controls that encourage the model to generate responses that are faithful to the provided evidence. We also investigate the usage of additional controls at decoding time using resampling techniques. In addition to automatic metrics, we perform a human evaluation study where raters judge the output of these controlled generation models to be generally more objective and faithful to the evidence compared to baseline dialogue systems.

Are BERTs Sensitive to Native Interference in L2 Production?
Zixin Tang | Prasenjit Mitra | David Reitter
Proceedings of the Second Workshop on Insights from Negative Results in NLP

With the essays part from The International Corpus Network of Asian Learners of English (ICNALE) and the TOEFL11 corpus, we fine-tuned neural language models based on BERT to predict English learners’ native languages. Results showed neural models can learn to represent and detect such native language impacts, but multilingually trained models have no advantage in doing so.

2020

Surprisal Predicts Code-Switching in Chinese-English Bilingual Text
Jesús Calvillo | Le Fang | Jeremy Cole | David Reitter
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Why do bilinguals switch languages within a sentence? The present observational study asks whether word surprisal and word entropy predict code-switching in bilingual written conversation. We describe and model a new dataset of Chinese-English text with 1476 clean code-switched sentences, translated back into Chinese. The model includes known control variables together with word surprisal and word entropy. We found that word surprisal, but not entropy, is a significant predictor that explains code-switching above and beyond other well-known predictors. We also found sentence length to be a significant predictor, which has been related to sentence complexity. We propose high cognitive effort as a reason for code-switching, as it leaves fewer resources for inhibition of the alternative language. We also corroborate previous findings, but this time using a computational model of surprisal, a new language pair, and doing so for written language.

2019

Fusion of Detected Objects in Text for Visual Question Answering
Chris Alberti | Jeffrey Ling | Michael Collins | David Reitter
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The “Bounding Boxes in Text Transformer” (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark, achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided.

Like a Baby: Visually Situated Neural Language Acquisition
Alexander Ororbia | Ankur Mali | Matthew Kelly | David Reitter
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We examine the benefits of visual context in training neural language models to perform next-word prediction. A multi-modal neural architecture is introduced that outperform its equivalent trained on language alone with a 2% decrease in perplexity, even when no visual context is available at test. Fine-tuning the embeddings of a pre-trained state-of-the-art bidirectional language model (BERT) in the language modeling framework yields a 3.5% improvement. The advantage for training with visual context when testing without is robust across different languages (English, German and Spanish) and different models (GRU, LSTM, Delta-RNN, as well as those that use BERT embeddings). Thus, language models perform better when they learn like a baby, i.e, in a multi-modal environment. This finding is compatible with the theory of situated cognition: language is inseparable from its physical context.

Treat the Word As a Whole or Look Inside? Subword Embeddings Model Language Change and Typology
Yang Xu | Jiasheng Zhang | David Reitter
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

We use a variant of word embedding model that incorporates subword information to characterize the degree of compositionality in lexical semantics. Our models reveal some interesting yet contrastive patterns of long-term change in multiple languages: Indo-European languages put more weight on subword units in newer words, while conversely Chinese puts less weights on the subwords, but more weight on the word as a whole. Our method provides novel evidence and methodology that enriches existing theories in evolutionary linguistics. The resulting word vectors also has decent performance in NLP-related tasks.

2018

The Timing of Lexical Memory Retrievals in Language Production
Jeremy Cole | David Reitter
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

This paper explores the time course of lexical memory retrieval by modeling fluent language production. The duration of retrievals is predicted using the ACT-R cognitive architecture. In a large-scale observational study of a spoken corpus, we find that language production at a time point preceding a word is sped up or slowed down depending on activation of that word. This computational analysis has consequences for the theoretical model of language production. The results point to interference between lexical and phonological stages as well as a quantifiable buffer for lexical information that opens up the possibility of non-sequential retrievals.

Not that much power: Linguistic alignment is influenced more by low-level linguistic features rather than social power
Yang Xu | Jeremy Cole | David Reitter
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Linguistic alignment between dialogue partners has been claimed to be affected by their relative social power. A common finding has been that interlocutors of higher power tend to receive more alignment than those of lower power. However, these studies overlook some low-level linguistic features that can also affect alignment, which casts doubts on these findings. This work characterizes the effect of power on alignment with logistic regression models in two datasets, finding that the effect vanishes or is reversed after controlling for low-level features such as utterance length. Thus, linguistic alignment is explained better by low-level features than by social power. We argue that a wider range of factors, especially cognitive factors, need to be taken into account for future studies on observational data when social factors of language use are in question.

2017

Event Ordering with a Generalized Model for Sieve Prediction Ranking
Bill McDowell | Nathanael Chambers | Alexander Ororbia II | David Reitter
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

This paper improves on several aspects of a sieve-based event ordering architecture, CAEVO (Chambers et al., 2014), which creates globally consistent temporal relations between events and time expressions. First, we examine the usage of word embeddings and semantic role features. With the incorporation of these new features, we demonstrate a 5% relative F1 gain over our replicated version of CAEVO. Second, we reformulate the architecture’s sieve-based inference algorithm as a prediction reranking method that approximately optimizes a scoring function computed using classifier precisions. Within this prediction reranking framework, we propose an alternative scoring function, showing an 8.8% relative gain over the original CAEVO. We further include an in-depth analysis of one of the main datasets that is used to evaluate temporal classifiers, and we show how despite using the densest corpus, there is still a danger of overfitting. While this paper focuses on temporal ordering, its results are applicable to other areas that use sieve-based architectures.

Spectral Analysis of Information Density in Dialogue Predicts Collaborative Task Performance
Yang Xu | David Reitter
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We propose a perspective on dialogue that focuses on relative information contributions of conversation partners as a key to successful communication. We predict the success of collaborative task in English and Danish corpora of task-oriented dialogue. Two features are extracted from the frequency domain representations of the lexical entropy series of each interlocutor, power spectrum overlap (PSO) and relative phase (RP). We find that PSO is a negative predictor of task success, while RP is a positive one. An SVM with these features significantly improved on previous task success prediction models. Our findings suggest that the strategic distribution of information density between interlocutors is relevant to task success.

2016

Entropy Converges Between Dialogue Participants: Explanations from an Information-Theoretic Perspective
Yang Xu | David Reitter
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Convergence of Syntactic Complexity in Conversation
Yang Xu | David Reitter
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

Learning a Deep Hybrid Model for Semi-Supervised Text Classification
Alexander Ororbia II | C. Lee Giles | David Reitter
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Pragmatic Alignment on Social Support Type in Health Forum Conversations
Yafei Wang | John Yen | David Reitter
Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics

An Evaluation and Comparison of Linguistic Alignment Measures
Yang Xu | David Reitter
Proceedings of the 6th Workshop on Cognitive Modeling and Computational Linguistics

2014

A Model to Qualify the Linguistic Adaptation Phenomenon in Online Conversation Threads: Analyzing Priming Effect in Online Health Community
Yafei Wang | David Reitter | John Yen
Proceedings of the Fifth Workshop on Cognitive Modeling and Computational Linguistics

2012

Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012)
David Reitter | Roger Levy
Proceedings of the 3rd Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2012)

2011

Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics
Frank Keller | David Reitter
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics

2010

Did Social Networks Shape Language Evolution? A Multi-Agent Cognitive Simulation
David Reitter | Christian Lebiere
Proceedings of the 2010 Workshop on Cognitive Modeling and Computational Linguistics

2007

Predicting Success in Dialogue
David Reitter | Johanna D. Moore
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

Computational Modelling of Structural Priming in Dialogue
David Reitter | Frank Keller | Johanna D. Moore
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

Dimensionality Reduction Aids Term Co-Occurrence Based Multi-Document Summarization
Ben Hachey | Gabriel Murray | David Reitter
Proceedings of the Workshop on Task-Focused Summarization and Question Answering

Priming Effects in Combinatory Categorial Grammar
David Reitter | Julia Hockenmaier | Frank Keller
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

2004

UI on the Fly: Generating a Multimodal User Interface
David Reitter | Erin Panttaja | Fred Cummins
Proceedings of HLT-NAACL 2004: Short Papers

2003

Step by step: underspecified markup in incremental rhetorical analysis
David Reitter | Manfred Stede
Proceedings of 4th International Workshop on Linguistically Interpreted Corpora (LINC-03) at EACL 2003

2002

XML/XSL in the Dictionary: The Case of Discourse Markers
Daniela Berger | David Reitter | Manfred Stede
COLING-02: The 2nd Workshop on NLP and XML (NLPXML-2002)

Venues