Jordan Boyd-Graber


2021

pdf bib
Fool Me Twice: Entailment from Wikipedia Gamification
Julian Eisenschlos | Bhuwan Dhingra | Jannis Bulian | Benjamin Börschinger | Jordan Boyd-Graber
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We release FoolMeTwice (FM2 for short), a large dataset of challenging entailment pairs collected through a fun multi-player game. Gamification encourages adversarial examples, drastically lowering the number of examples that can be solved using “shortcuts” compared to other popular entailment datasets. Players are presented with two tasks. The first task asks the player to write a plausible claim based on the evidence from a Wikipedia page. The second one shows two plausible claims written by other players, one of which is false, and the goal is to identify it before the time runs out. Players “pay” to see clues retrieved from the evidence pool: the more evidence the player needs, the harder the claim. Game-play between motivated players leads to diverse strategies for crafting claims, such as temporal inference and diverting to unrelated evidence, and results in higher quality data for the entailment and evidence retrieval tasks. We open source the dataset and the game code.

pdf bib
Multi-Step Reasoning Over Unstructured Text with Beam Dense Retrieval
Chen Zhao | Chenyan Xiong | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Complex question answering often requires finding a reasoning chain that consists of multiple evidence pieces. Current approaches incorporate the strengths of structured knowledge and unstructured text, assuming text corpora is semi-structured. Building on dense retrieval methods, we propose a new multi-step retrieval approach (BeamDR) that iteratively forms an evidence chain through beam search in dense representations. When evaluated on multi-hop question answering, BeamDR is competitive to state-of-the-art systems, without using any semi-structured information. Through query composition in dense space, BeamDR captures the implicit relationships between evidence in the reasoning chain. The code is available at https://github.com/ henryzhao5852/BeamDR.

pdf bib
Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
Pedro Rodriguez | Joe Barrow | Alexander Miserlis Hoyle | John P. Lalor | Robin Jia | Jordan Boyd-Graber
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks.

2020

pdf bib
Which Evaluations Uncover Sense Representations that Actually Make Sense?
Jordan Boyd-Graber | Fenfei Guo | Leah Findlater | Mohit Iyyer
Proceedings of the 12th Language Resources and Evaluation Conference

Text representations are critical for modern natural language processing. One form of text representation, sense-specific embeddings, reflect a word’s sense in a sentence better than single-prototype word embeddings tied to each type. However, existing sense representations are not uniformly better: although they work well for computer-centric evaluations, they fail for human-centric tasks like inspecting a language’s sense inventory. To expose this discrepancy, we propose a new coherence evaluation for sense embeddings. We also describe a minimal model (Gumbel Attention for Sense Induction) optimized for discovering interpretable sense representations that are more coherent than existing sense embeddings.

pdf bib
An Attentive Recurrent Model for Incremental Prediction of Sentence-final Verbs
Wenyan Li | Alvin Grissom II | Jordan Boyd-Graber
Findings of the Association for Computational Linguistics: EMNLP 2020

Verb prediction is important for understanding human processing of verb-final languages, with practical applications to real-time simultaneous interpretation from verb-final to verb-medial languages. While previous approaches use classical statistical models, we introduce an attention-based neural model to incrementally predict final verbs on incomplete sentences in Japanese and German SOV sentences. To offer flexibility to the model, we further incorporate synonym awareness. Our approach both better predicts the final verbs in Japanese and German and provides more interpretable explanations of why those verbs are selected.

pdf bib
On the Potential of Lexico-logical Alignments for Semantic Parsing to SQL Queries
Tianze Shi | Chen Zhao | Jordan Boyd-Graber | Hal Daumé III | Lillian Lee
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale semantic parsing datasets annotated with logical forms have enabled major advances in supervised approaches. But can richer supervision help even more? To explore the utility of fine-grained, lexical-level supervision, we introduce SQUALL, a dataset that enriches 11,276 WIKITABLEQUESTIONS English-language questions with manually created SQL equivalents plus alignments between SQL and question fragments. Our annotation enables new training possibilities for encoderdecoder models, including approaches from machine translation previously precluded by the absence of alignments. We propose and test two methods: (1) supervised attention; (2) adopting an auxiliary objective of disambiguating references in the input queries to table columns. In 5-fold cross validation, these strategies improve over strong baselines by 4.4% execution accuracy. Oracle experiments suggest that annotated alignments can support further accuracy gains of up to 23.9%.

pdf bib
Why Overfitting Isn’t Always Bad: Retrofitting Cross-Lingual Word Embeddings to Dictionaries
Mozhi Zhang | Yoshinari Fujinuma | Michael J. Paul | Jordan Boyd-Graber
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Cross-lingual word embeddings (CLWE) are often evaluated on bilingual lexicon induction (BLI). Recent CLWE methods use linear projections, which underfit the training dictionary, to generalize on BLI. However, underfitting can hinder generalization to other downstream tasks that rely on words from the training dictionary. We address this limitation by retrofitting CLWE to the training dictionary, which pulls training translation pairs closer in the embedding space and overfits the training dictionary. This simple post-processing step often improves accuracy on two downstream tasks, despite lowering BLI test accuracy. We also retrofit to both the training dictionary and a synthetic dictionary induced from CLWE, which sometimes generalizes even better on downstream tasks. Our results confirm the importance of fully exploiting training dictionary in downstream tasks and explains why BLI is a flawed CLWE evaluation.

pdf bib
It Takes Two to Lie: One to Lie, and One to Listen
Denis Peskov | Benny Cheng | Ahmed Elgohary | Joe Barrow | Cristian Danescu-Niculescu-Mizil | Jordan Boyd-Graber
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Trust is implicit in many online text conversations—striking up new friendships, or asking for tech support. But trust can be betrayed through deception. We study the language and dynamics of deception in the negotiation-based game Diplomacy, where seven players compete for world domination by forging and breaking alliances with each other. Our study with players from the Diplomacy community gathers 17,289 messages annotated by the sender for their intended truthfulness and by the receiver for their perceived truthfulness. Unlike existing datasets, this captures deception in long-lasting relationships, where the interlocutors strategically combine truth with lies to advance objectives. A model that uses power dynamics and conversational contexts can predict when a lie occurs nearly as well as human players.

pdf bib
What Question Answering can Learn from Trivia Nerds
Jordan Boyd-Graber | Benjamin Börschinger
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In addition to the traditional task of machines answering questions, question answering (QA) research creates interesting, challenging questions that help systems how to answer questions and reveal the best systems. We argue that creating a QA dataset—and the ubiquitous leaderboard that goes with it—closely resembles running a trivia tournament: you write questions, have agents (either humans or machines) answer the questions, and declare a winner. However, the research community has ignored the hard-learned lessons from decades of the trivia community creating vibrant, fair, and effective question answering competitions. After detailing problems with existing QA datasets, we outline the key lessons—removing ambiguity, discriminating skill, and adjudicating disputes—that can transfer to QA research and how they might be implemented.

pdf bib
Interactive Refinement of Cross-Lingual Word Embeddings
Michelle Yuan | Mozhi Zhang | Benjamin Van Durme | Leah Findlater | Jordan Boyd-Graber
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Cross-lingual word embeddings transfer knowledge between languages: models trained on high-resource languages can predict in low-resource languages. We introduce CLIME, an interactive system to quickly refine cross-lingual word embeddings for a given classification problem. First, CLIME ranks words by their salience to the downstream task. Then, users mark similarity between keywords and their nearest neighbors in the embedding space. Finally, CLIME updates the embeddings using the annotations. We evaluate CLIME on identifying health-related text in four low-resource languages: Ilocano, Sinhalese, Tigrinya, and Uyghur. Embeddings refined by CLIME capture more nuanced word semantics and have higher test accuracy than the original embeddings. CLIME often improves accuracy faster than an active learning baseline and can be easily combined with active learning to improve results.

pdf bib
Cold-start Active Learning through Self-supervised Language Modeling
Michelle Yuan | Hsuan-Tien Lin | Jordan Boyd-Graber
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Active learning strives to reduce annotation costs by choosing the most critical examples to label. Typically, the active learning strategy is contingent on the classification model. For instance, uncertainty sampling depends on poorly calibrated model confidence scores. In the cold-start setting, active learning is impractical because of model instability and data scarcity. Fortunately, modern NLP provides an additional source of information: pre-trained language models. The pre-training loss can find examples that surprise the model and should be labeled for efficient fine-tuning. Therefore, we treat the language modeling loss as a proxy for classification uncertainty. With BERT, we develop a simple strategy based on the masked language modeling loss that minimizes labeling costs for text classification. Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and computation time.

2019

pdf bib
A Multilingual Topic Model for Learning Weighted Topic Links Across Corpora with Low Comparability
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Multilingual topic models (MTMs) learn topics on documents in multiple languages. Past models align topics across languages by implicitly assuming the documents in different languages are highly comparable, often a false assumption. We introduce a new model that does not rely on this assumption, particularly useful in important low-resource language scenarios. Our MTM learns weighted topic links and connects cross-lingual topics only when the dominant words defining them are similar, outperforming LDA and previous MTMs in classification tasks using documents’ topic posteriors as features. It also learns coherent topics on documents with low comparability.

pdf bib
Can You Unpack That? Learning to Rewrite Questions-in-Context
Ahmed Elgohary | Denis Peskov | Jordan Boyd-Graber
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Question answering is an AI-complete problem, but existing datasets lack key elements of language understanding such as coreference and ellipsis resolution. We consider sequential question answering: multiple questions are asked one-by-one in a conversation between a questioner and an answerer. Answering these questions is only possible through understanding the conversation history. We introduce the task of question-in-context rewriting: given the context of a conversation’s history, rewrite a context-dependent into a self-contained question with the same answer. We construct, CANARD, a dataset of 40,527 questions based on QuAC (Choi et al., 2018) and train Seq2Seq models for incorporating context into standalone questions.

pdf bib
How Pre-trained Word Representations Capture Commonsense Physical Comparisons
Pranav Goel | Shi Feng | Jordan Boyd-Graber
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Understanding common sense is important for effective natural language reasoning. One type of common sense is how two objects compare on physical properties such as size and weight: e.g., ‘is a house bigger than a person?’. We probe whether pre-trained representations capture comparisons and find they, in fact, have higher accuracy than previous approaches. They also generalize to comparisons involving objects not seen during training. We investigate how such comparisons are made: models learn a consistent ordering over all the objects in the comparisons. Probing models have significantly higher accuracy than those baseline models which use dataset artifacts: e.g., memorizing some words are larger than any other word.

pdf bib
Automatic Evaluation of Local Topic Quality
Jeffrey Lund | Piper Armstrong | Wilson Fearn | Stephen Cowley | Courtni Byun | Jordan Boyd-Graber | Kevin Seppi
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Topic models are typically evaluated with respect to the global topic distributions that they generate, using metrics such as coherence, but without regard to local (token-level) topic assignments. Token-level assignments are important for downstream tasks such as classification. Even recent models, which aim to improve the quality of these token-level topic assignments, have been evaluated only with respect to global metrics. We propose a task designed to elicit human judgments of token-level topic assignments. We use a variety of topic model types and parameters and discover that global metrics agree poorly with human assignments. Since human evaluation is expensive we propose a variety of automated metrics to evaluate topic models at a local level. Finally, we correlate our proposed metrics with human judgments from the task on several datasets. We show that an evaluation based on the percent of topic switches correlates most strongly with human judgment of local topic quality. We suggest that this new metric, which we call consistency, be adopted alongside global metrics such as topic coherence when evaluating new topic models.

pdf bib
Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization
Mozhi Zhang | Keyulu Xu | Ken-ichi Kawarabayashi | Stefanie Jegelka | Jordan Boyd-Graber
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual word embeddings (CLWE) underlie many multilingual natural language processing systems, often through orthogonal transformations of pre-trained monolingual embeddings. However, orthogonal mapping only works on language pairs whose embeddings are naturally isomorphic. For non-isomorphic pairs, our method (Iterative Normalization) transforms monolingual embeddings to make orthogonal alignment easier by simultaneously enforcing that (1) individual word vectors are unit length, and (2) each language’s average vector is zero. Iterative Normalization consistently improves word translation accuracy of three CLWE methods, with the largest improvement observed on English-Japanese (from 2% to 44% test accuracy).

pdf bib
A Resource-Free Evaluation Metric for Cross-Lingual Word Embeddings Based on Graph Modularity
Yoshinari Fujinuma | Jordan Boyd-Graber | Michael J. Paul
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual word embeddings encode the meaning of words from different languages into a shared low-dimensional space. An important requirement for many downstream tasks is that word similarity should be independent of language—i.e., word vectors within one language should not be more similar to each other than to words in another language. We measure this characteristic using modularity, a network measurement that measures the strength of clusters in a graph. Modularity has a moderate to strong correlation with three downstream tasks, even though modularity is based only on the structure of embeddings and does not require any external resources. We show through experiments that modularity can serve as an intrinsic validation metric to improve unsupervised cross-lingual word embeddings, particularly on distant language pairs in low-resource settings.

pdf bib
Misleading Failures of Partial-input Baselines
Shi Feng | Eric Wallace | Jordan Boyd-Graber
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent work establishes dataset difficulty and removes annotation artifacts via partial-input baselines (e.g., hypothesis-only model for SNLI or question-only model for VQA). A successful partial-input baseline indicates that the dataset is cheatable. But the converse is not necessarily true: failures of partial-input baselines do not mean the dataset is free of artifacts. We first design artificial datasets to illustrate how the trivial patterns that are only visible in the full input can evade any partial-input baseline. Next, we identify such artifacts in the SNLI dataset—a hypothesis-only model augmented with trivial patterns in the premise can solve 15% of previously-thought “hard” examples. Our work provides a caveat for the use and creation of partial-input baselines for datasets.

pdf bib
Why Didn’t You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models
Varun Kumar | Alison Smith-Renner | Leah Findlater | Kevin Seppi | Jordan Boyd-Graber
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

To address the lack of comparative evaluation of Human-in-the-Loop Topic Modeling (HLTM) systems, we implement and evaluate three contrasting HLTM modeling approaches using simulation experiments. These approaches extend previously proposed frameworks, including constraints and informed prior-based methods. Users should have a sense of control in HLTM systems, so we propose a control metric to measure whether refinement operations’ results match users’ expectations. Informed prior-based methods provide better control than constraints, but constraints yield higher quality topics.

pdf bib
Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering
Eric Wallace | Pedro Rodriguez | Shi Feng | Ikuya Yamada | Jordan Boyd-Graber
Transactions of the Association for Computational Linguistics, Volume 7

Adversarial evaluation stress-tests a model’s understanding of natural language. Because past approaches expose superficial patterns, the resulting adversarial examples are limited in complexity and diversity. We propose human- in-the-loop adversarial generation, where human authors are guided to break models. We aid the authors with interpretations of model predictions through an interactive user interface. We apply this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions. The resulting questions are validated via live human–computer matches: Although the questions appear ordinary to humans, they systematically stump neural and information retrieval models. The adversarial questions cover diverse phenomena from multi-hop reasoning to entity type distractors, exposing open challenges in robust question answering.

2018

pdf bib
Learning from Measurements in Crowdsourcing Models: Inferring Ground Truth from Diverse Annotation Types
Paul Felt | Eric Ringger | Jordan Boyd-Graber | Kevin Seppi
Proceedings of the 27th International Conference on Computational Linguistics

Annotated corpora enable supervised machine learning and data analysis. To reduce the cost of manual annotation, tasks are often assigned to internet workers whose judgments are reconciled by crowdsourcing models. We approach the problem of crowdsourcing using a framework for learning from rich prior knowledge, and we identify a family of crowdsourcing models with the novel ability to combine annotations with differing structures: e.g., document labels and word labels. Annotator judgments are given in the form of the predicted expected value of measurement functions computed over annotations and the data, unifying annotation models. Our model, a specific instance of this framework, compares favorably with previous work. Furthermore, it enables active sample selection, jointly selecting annotator, data item, and annotation structure to reduce annotation effort.

pdf bib
Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation
Shudong Hao | Jordan Boyd-Graber | Michael J. Paul
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Multilingual topic models enable document analysis across languages through coherent multilingual summaries of the data. However, there is no standard and effective metric to evaluate the quality of multilingual topics. We introduce a new intrinsic evaluation of multilingual topic models that correlates well with human judgments of multilingual topic coherence as well as performance in downstream applications. Importantly, we also study evaluation for low-resource languages. Because standard metrics fail to accurately measure topic quality when robust external resources are unavailable, we propose an adaptation model that improves the accuracy and reliability of these metrics in low-resource settings.

pdf bib
Learning to Color from Language
Varun Manjunatha | Mohit Iyyer | Jordan Boyd-Graber | Larry Davis
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Automatic colorization is the process of adding color to greyscale images. We condition this process on language, allowing end users to manipulate a colorized image by feeding in different captions. We present two different architectures for language-conditioned colorization, both of which produce more accurate and plausible colorizations than a language-agnostic version. Furthermore, we demonstrate through crowdsourced experiments that we can dramatically alter colorizations simply by manipulating descriptive color words in captions.

pdf bib
Automatic Estimation of Simultaneous Interpreter Performance
Craig Stewart | Nikolai Vogler | Junjie Hu | Jordan Boyd-Graber | Graham Neubig
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Simultaneous interpretation, translation of the spoken word in real-time, is both highly challenging and physically demanding. Methods to predict interpreter confidence and the adequacy of the interpreted message have a number of potential applications, such as in computer-assisted interpretation interfaces or pedagogical tools. We propose the task of predicting simultaneous interpreter performance by building on existing methodology for quality estimation (QE) of machine translation output. In experiments over five settings in three language pairs, we extend a QE pipeline to estimate interpreter performance (as approximated by the METEOR evaluation metric) and propose novel features reflecting interpretation strategy and evaluation measures that further improve prediction accuracy.

pdf bib
Trick Me If You Can: Adversarial Writing of Trivia Challenge Questions
Eric Wallace | Jordan Boyd-Graber
Proceedings of ACL 2018, Student Research Workshop

Modern question answering systems have been touted as approaching human performance. However, existing question answering datasets are imperfect tests. Questions are written with humans in mind, not computers, and often do not properly expose model limitations. To address this, we develop an adversarial writing setting, where humans interact with trained models and try to break them. This annotation process yields a challenge set, which despite being easy for trivia players to answer, systematically stumps automated question answering systems. Diagnosing model errors on the evaluation data provides actionable insights to explore in developing robust and generalizable question answering systems.

pdf bib
Interpreting Neural Networks with Nearest Neighbors
Eric Wallace | Shi Feng | Jordan Boyd-Graber
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Local model interpretation methods explain individual predictions by assigning an importance value to each input feature. This value is often determined by measuring the change in confidence when a feature is removed. However, the confidence of neural networks is not a robust measure of model uncertainty. This issue makes reliably judging the importance of the input features difficult. We address this by changing the test-time behavior of neural networks using Deep k-Nearest Neighbors. Without harming text classification accuracy, this algorithm provides a more robust uncertainty metric which we use to generate feature importance values. The resulting interpretations better align with human perception than baseline methods. Finally, we use our interpretation method to analyze model predictions on dataset annotation artifacts.

pdf bib
A dataset and baselines for sequential open-domain question answering
Ahmed Elgohary | Chen Zhao | Jordan Boyd-Graber
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Previous work on question-answering systems mainly focuses on answering individual questions, assuming they are independent and devoid of context. Instead, we investigate sequential question answering, asking multiple related questions. We present QBLink, a new dataset of fully human-authored questions. We extend existing strong question answering frameworks to include previous questions to improve the overall question-answering accuracy in open-domain question answering. The dataset is publicly available at http://sequential.qanta.org.

pdf bib
Pathologies of Neural Models Make Interpretations Difficult
Shi Feng | Eric Wallace | Alvin Grissom II | Mohit Iyyer | Pedro Rodriguez | Jordan Boyd-Graber
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

One way to interpret neural model predictions is to highlight the most important input features—for example, a heatmap visualization over the words in an input sentence. In existing interpretation methods for NLP, a word’s importance is determined by either input perturbation—measuring the decrease in model confidence when that word is removed—or by the gradient with respect to that word. To understand the limitations of these methods, we use input reduction, which iteratively removes the least important word from the input. This exposes pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods. As we confirm with human experiments, the reduced examples lack information to support the prediction of any label, but models still make the same predictions with high confidence. To explain these counterintuitive results, we draw connections to adversarial examples and confidence calibration: pathological behaviors reveal difficulties in interpreting neural models trained with maximum likelihood. To mitigate their deficiencies, we fine-tune the models by encouraging high entropy outputs on reduced examples. Fine-tuned models become more interpretable under input reduction, without accuracy loss on regular examples.

2017

pdf bib
Tandem Anchoring: a Multiword Anchor Approach for Interactive Topic Modeling
Jeffrey Lund | Connor Cook | Kevin Seppi | Jordan Boyd-Graber
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Interactive topic models are powerful tools for those seeking to understand large collections of text. However, existing sampling-based interactive topic modeling approaches scale poorly to large data sets. Anchor methods, which use a single word to uniquely identify a topic, offer the speed needed for interactive work but lack both a mechanism to inject prior knowledge and lack the intuitive semantics needed for user-facing applications. We propose combinations of words as anchors, going beyond existing single word anchor algorithms—an approach we call “Tandem Anchors”. We begin with a synthetic investigation of this approach then apply the approach to interactive topic modeling in a user study and compare it to interactive and non-interactive approaches. Tandem anchors are faster and more intuitive than existing interactive approaches.

pdf bib
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Maja Popović | Jordan Boyd-Graber
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Why ADAGRAD Fails for Online Topic Modeling
You Lu | Jeffrey Lund | Jordan Boyd-Graber
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Online topic modeling, i.e., topic modeling with stochastic variational inference, is a powerful and efficient technique for analyzing large datasets, and ADAGRAD is a widely-used technique for tuning learning rates during online gradient optimization. However, these two techniques do not work well together. We show that this is because ADAGRAD uses accumulation of previous gradients as the learning rates’ denominators. For online topic modeling, the magnitude of gradients is very large. It causes learning rates to shrink very quickly, so the parameters cannot fully converge until the training ends

pdf bib
Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
Khanh Nguyen | Hal Daumé III | Jordan Boyd-Graber
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Machine translation is a natural candidate problem for reinforcement learning from human feedback: users provide quick, dirty ratings on candidate translations to guide a system to improve. Yet, current neural machine translation training focuses on expensive human-generated reference translations. We describe a reinforcement learning algorithm that improves neural machine translation systems from simulated human feedback. Our algorithm combines the advantage actor-critic algorithm (Mnih et al., 2016) with the attention-based neural encoder-decoder architecture (Luong et al., 2015). This algorithm (a) is well-designed for problems with a large action space and delayed rewards, (b) effectively optimizes traditional corpus-level machine translation metrics, and (c) is robust to skewed, high-variance, granular feedback modeled after actual human behaviors.

pdf bib
Adapting Topic Models using Lexical Associations with Tree Priors
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Models work best when they are optimized taking into account the evaluation criteria that people care about. For topic models, people often care about interpretability, which can be approximated using measures of lexical association. We integrate lexical association into topic optimization using tree priors, which provide a flexible framework that can take advantage of both first order word associations and the higher-order associations captured by word embeddings. Tree priors improve topic interpretability without hurting extrinsic performance.

pdf bib
Evaluating Visual Representations for Topic Understanding and Their Effects on Manually Generated Topic Labels
Alison Smith | Tak Yeon Lee | Forough Poursabzi-Sangdeh | Jordan Boyd-Graber | Niklas Elmqvist | Leah Findlater
Transactions of the Association for Computational Linguistics, Volume 5

Probabilistic topic models are important tools for indexing, summarizing, and analyzing large document collections by their themes. However, promoting end-user understanding of topics remains an open research problem. We compare labels generated by users given four topic visualization techniques—word lists, word lists with bars, word clouds, and network graphs—against each other and against automatically generated labels. Our basis of comparison is participant ratings of how well labels describe documents from the topic. Our study has two phases: a labeling phase where participants label visualized topics and a validation phase where different participants select which labels best describe the topics’ documents. Although all visualizations produce similar quality labels, simple visualizations such as word lists allow participants to quickly understand topics, while complex visualizations take longer but expose multi-word expressions that simpler visualizations obscure. Automatic labels lag behind user-created labels, but our dataset of manually labeled topics highlights linguistic patterns (e.g., hypernyms, phrases) that can be used to improve automatic topic labeling algorithms.

2016

pdf bib
A Discriminative Topic Model using Document Network Structure
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
ALTO: Active Learning with Topic Overviews for Speeding Label Induction and Document Labeling
Forough Poursabzi-Sangdeh | Jordan Boyd-Graber | Leah Findlater | Kevin Seppi
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Learning Text Pair Similarity with Context-sensitive Autoencoders
Hadi Amiri | Philip Resnik | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the Workshop on Human-Computer Question Answering
Mohit Iyyer | He He | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the Workshop on Human-Computer Question Answering

pdf bib
“A Distorted Skull Lies in the Bottom Center...” Identifying Paintings from Text Descriptions
Anupam Guha | Mohit Iyyer | Jordan Boyd-Graber
Proceedings of the Workshop on Human-Computer Question Answering

pdf bib
Using Confusion Graphs to Understand Classifier Error
Davis Yoshida | Jordan Boyd-Graber
Proceedings of the Workshop on Human-Computer Question Answering

pdf bib
Incremental Prediction of Sentence-final Verbs: Humans versus Machines
Alvin Grissom II | Naho Orita | Jordan Boyd-Graber
Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib
Bayesian Supervised Domain Adaptation for Short Text Similarity
Md Arafat Sultan | Jordan Boyd-Graber | Tamara Sumner
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Interpretese vs. Translationese: The Uniqueness of Human Strategies in Simultaneous Interpretation
He He | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Feuding Families and Former Friends: Unsupervised Learning for Dynamic Fictional Relationships
Mohit Iyyer | Anupam Guha | Snigdha Chaturvedi | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Leveraging VerbNet to build Corpus-Specific Verb Clusters
Daniel Peterson | Jordan Boyd-Graber | Martha Palmer | Daisuke Kawahara
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

2015

pdf bib
Syntax-based Rewriting for Simultaneous Machine Translation
He He | Alvin Grissom II | John Morgan | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Birds of a Feather Linked Together: A Discriminative Topic Model using Link-based Priors
Weiwei Yang | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Efficient Methods for Incorporating Knowledge into Topic Models
Yi Yang | Doug Downey | Jordan Boyd-Graber
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Beyond LDA: Exploring Supervised Topic Modeling for Depression-Related Language in Twitter
Philip Resnik | William Armstrong | Leonardo Claudino | Thang Nguyen | Viet-An Nguyen | Jordan Boyd-Graber
Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality

pdf bib
Tea Party in the House: A Hierarchical Ideal Point Topic Model and Its Application to Republican Legislators in the 112th Congress
Viet-An Nguyen | Jordan Boyd-Graber | Philip Resnik | Kristina Miler
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Linguistic Harbingers of Betrayal: A Case Study on an Online Strategy Game
Vlad Niculae | Srijan Kumar | Jordan Boyd-Graber | Cristian Danescu-Niculescu-Mizil
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Deep Unordered Composition Rivals Syntactic Methods for Text Classification
Mohit Iyyer | Varun Manjunatha | Jordan Boyd-Graber | Hal Daumé III
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Making the Most of Crowdsourced Document Annotations: Confused Supervised LDA
Paul Felt | Eric Ringger | Jordan Boyd-Graber | Kevin Seppi
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
Is Your Anchor Going Up or Down? Fast and Accurate Supervised Topic Models
Thang Nguyen | Jordan Boyd-Graber | Jeffrey Lund | Kevin Seppi | Eric Ringger
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers
Anupam Guha | Mohit Iyyer | Danny Bouman | Jordan Boyd-Graber
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Speeding Document Annotation with Topic Models
Forough Poursabzi-Sangdeh | Jordan Boyd-Graber
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

2014

pdf bib
Online Adaptor Grammars with Hybrid Inference
Ke Zhai | Jordan Boyd-Graber | Shay B. Cohen
Transactions of the Association for Computational Linguistics, Volume 2

Adaptor grammars are a flexible, powerful formalism for defining nonparametric, unsupervised models of grammar productions. This flexibility comes at the cost of expensive inference. We address the difficulty of inference through an online algorithm which uses a hybrid of Markov chain Monte Carlo and variational inference. We show that this inference strategy improves scalability without sacrificing performance on unsupervised word segmentation and topic modeling tasks.

pdf bib
Quantifying the role of discourse topicality in speakers’ choices of referring expressions
Naho Orita | Naomi Feldman | Jordan Boyd-Graber | Eliana Vornov
Proceedings of the Fifth Workshop on Cognitive Modeling and Computational Linguistics

pdf bib
Concurrent Visualization of Relationships between Words and Topics in Topic Models
Alison Smith | Jason Chuang | Yuening Hu | Jordan Boyd-Graber | Leah Findlater
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

pdf bib
A Neural Network for Factoid Question Answering over Paragraphs
Mohit Iyyer | Jordan Boyd-Graber | Leonardo Claudino | Richard Socher | Hal Daumé III
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Don’t Until the Final Verb Wait: Reinforcement Learning for Simultaneous Machine Translation
Alvin Grissom II | He He | Jordan Boyd-Graber | John Morgan | Hal Daumé III
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Sometimes Average is Best: The Importance of Averaging for Prediction using MCMC Inference in Topic Modeling
Viet-An Nguyen | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Anchors Regularized: Adding Robustness and Extensibility to Scalable Topic-Modeling Algorithms
Thang Nguyen | Yuening Hu | Jordan Boyd-Graber
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Political Ideology Detection Using Recursive Neural Networks
Mohit Iyyer | Peter Enns | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Polylingual Tree-Based Topic Models for Translation Domain Adaptation
Yuening Hu | Ke Zhai | Vladimir Eidelman | Jordan Boyd-Graber
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the ACL 2014 Student Research Workshop
Ekaterina Kochmar | Annie Louis | Svitlana Volkova | Jordan Boyd-Graber | Bill Byrne
Proceedings of the ACL 2014 Student Research Workshop

2013

pdf bib
Argviz: Interactive Visualization of Topic Dynamics in Multi-party Conversations
Viet-An Nguyen | Yuening Hu | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2013 NAACL HLT Demonstration Session

2012

pdf bib
Besting the Quiz Master: Crowdsourcing Incremental Classification Games
Jordan Boyd-Graber | Brianna Satinoff | He He | Hal Daumé III
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Grammatical structures for word-level sentiment detection
Asad Sayeed | Jordan Boyd-Graber | Bryan Rusk | Amy Weinberg
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
SITS: A Hierarchical Nonparametric Model using Speaker Identity for Topic Segmentation in Multiparty Conversations
Viet-An Nguyen | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Topic Models for Dynamic Translation Model Adaptation
Vladimir Eidelman | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Efficient Tree-Based Topic Modeling
Yuening Hu | Jordan Boyd-Graber
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf bib
Interactive Topic Modeling
Yuening Hu | Jordan Boyd-Graber | Brianna Satinoff
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

pdf bib
Measuring Transitivity Using Untrained Annotators
Nitin Madnani | Jordan Boyd-Graber | Philip Resnik
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

pdf bib
Holistic Sentiment Analysis Across Languages: Multilingual Supervised Latent Dirichlet Allocation
Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

pdf bib
Modeling Perspective Using Adaptor Grammars
Eric Hardisty | Jordan Boyd-Graber | Philip Resnik
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
PUTOP: Turning Predominant Senses into a Topic Model for Word Sense Disambiguation
Jordan Boyd-Graber | David Blei
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
A Topic Model for Word Sense Disambiguation
Jordan Boyd-Graber | David Blei | Xiaojin Zhu
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Search
Co-authors