Noah A. Smith - ACL Anthology

Noah A. Smith

Also published as: Noah Smith

2025

Hybrid Preferences: Learning to Route Instances for Human vs. AI Feedback
Lester James Validad Miranda | Yizhong Wang | Yanai Elazar | Sachin Kumar | Valentina Pyatkin | Faeze Brahman | Noah A. Smith | Hannaneh Hajishirzi | Pradeep Dasigi
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning from human feedback has enabled the alignment of language models (LMs) with human preferences. However, collecting human preferences is expensive and time-consuming, with highly variable annotation quality. An appealing alternative is to distill preferences from LMs as a source of synthetic annotations, offering a cost-effective and scalable alternative, albeit susceptible to other biases and errors. In this work, we introduce HyPER, a Hybrid Preference routER that defers an annotation to either humans or LMs, achieving better annotation quality while reducing the cost of human-only annotation. We formulate this as an optimization problem: given a preference dataset and an evaluation metric, we (1) train a performance prediction model (PPM) to predict a reward model’s (RM) performance on an arbitrary combination of human and LM annotations and (2) employ a routing strategy that selects a combination that maximizes predicted performance. We train the PPM on MultiPref, a new preference dataset with 10K instances paired with human and LM labels. We show that the selected hybrid mixture of synthetic and direct human preferences using HyPER achieves better RM performance compared to using either one exclusively by 7-13% on RewardBench and generalizes across unseen preference datasets and other base models. We also observe the same trend in other benchmarks using Best-of-N reranking, where the hybrid mix has 2-3% better performance. Finally, we analyze features from HyPER and find that prompts with moderate safety concerns or complexity benefit the most from human feedback.

We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.

Infini-gram mini: Exact n-gram Search at the Internet Scale with FM-Index
Hao Xu | Jiacheng Liu | Yejin Choi | Noah A. Smith | Hannaneh Hajishirzi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Language models are trained mainly on massive text data from the Internet, and it becomes increasingly important to understand this data source. Exact-match search engines enable searching in large text corpora – counting string appearances and retrieving the enclosing documents – yet the high storage overhead hinders their application on Internet-scale data. We present Infini-gram mini, an efficient and scalable system that can make petabyte-level text corpora searchable. Based on the FM-index data structure (Ferragina and Manzini, 2000), which simultaneously indexes and compresses text, our system creates indexes with size only 44% of the corpus. Infini-gram mini greatly improves upon the best existing implementation of FM-index in terms of indexing speed (18×) and memory use during both indexing (3.2× reduction) and querying (down to a negligible amount). We index 83TB of Internet text in 99 days with a single 128-core CPU node (or 19 hours if using 137 such nodes). We show one important use case of Infini-gram mini in a large-scale analysis of benchmark contamination. We find several core LM evaluation benchmarks to be heavily contaminated in Internet crawls (up to 74.2% in GSM8K), which could lead to overestimating the capabilities of language models if trained on such data. We host a benchmark contamination bulletin to share the contamination rate of many core and community-contributed benchmarks. We also release a web interface and an API endpoint to serve general search queries on Infini-gram mini indexes.

RewardBench: Evaluating Reward Models for Language Modeling
Nathan Lambert | Valentina Pyatkin | Jacob Morrison | LJ Miranda | Bill Yuchen Lin | Khyathi Chandu | Nouha Dziri | Sachin Kumar | Tom Zick | Yejin Choi | Noah A. Smith | Hannaneh Hajishirzi
Findings of the Association for Computational Linguistics: NAACL 2025

Reward models (RMs) are at the crux of successfully using RLHF to align pretrained models to human preferences, yet there has been relatively little study that focuses on evaluation of those models. Evaluating reward models presents an opportunity to understand the opaque technologies used for alignment of language models and which values are embedded in them. Resources for reward model training and understanding are sparse in the nascent open-source community around them. To enhance scientific understanding of reward models, we present RewardBench, a benchmark dataset and code-base for evaluation. The RewardBench dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety, to benchmark how reward models perform on challenging, structured and out-of-distribution queries. We create specific comparison datasets for RMs that have subtle, but verifiable reasons (e.g. bugs, incorrect facts) why one answer should be preferred to another. On the RewardBench leaderboard, we evaluate RMs trained with a variety of methods, such as the direct MLE training of classifiers and the implicit reward modeling of Direct Preference Optimization (DPO). We present many findings on propensity for refusals, reasoning limitations, and instruction following shortcomings of various reward models towards a better understanding of the RLHF process.

LlamaPIE: Proactive In-Ear Conversation Assistants
Tuochao Chen | Nicholas Scott Batchelder | Alisa Liu | Noah A. Smith | Shyamnath Gollakota
Findings of the Association for Computational Linguistics: ACL 2025

We introduce LlamaPIE, the first real-time proactive assistant designed to enhance human conversations through discreet, concise guidance delivered via hearable devices. Unlike traditional language models that require explicit user invocation, this assistant operates in the background, anticipating user needs without interrupting conversations. We address several challenges, including determining when to respond, crafting concise responses that enhance conversations, leveraging knowledge of the user for context-aware assistance, and real-time, on-device processing. To achieve this, we construct a semi-synthetic dialogue dataset and propose a two-model pipeline: a small model that decides when to respond and a larger model that generates the response. We evaluate our approach on real-world datasets, demonstrating its effectiveness in providing helpful, unobtrusive assistance. User studies with our assistant, implemented on Apple Silicon M2 hardware, show a strong preference for the proactive assistant over both a baseline with no assistance and a reactive AI assistant, highlighting the potential of LlamaPIE to enhance live conversations.

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models
Hila Gonen | Terra Blevins | Alisa Liu | Luke Zettlemoyer | Noah A. Smith
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Despite their wide adoption, the biases and unintended behaviors of language models remain poorly understood. In this paper, we identify and characterize a phenomenon never discussed before, which we call semantic leakage, where models leak irrelevant information from the prompt into the generation in unexpected ways. We propose an evaluation setting to detect semantic leakage both by humans and automatically, curate a diverse test suite for diagnosing this behavior, and measure significant semantic leakage in 13 flagship models. We also show that models exhibit semantic leakage in languages besides English and across different settings and generation scenarios. This discovery highlights yet another type of bias in language models that affects their generation patterns and behavior.

ComPO: Community Preferences for Language Model Personalization
Sachin Kumar | Chan Young Park | Yulia Tsvetkov | Noah A. Smith | Hannaneh Hajishirzi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Conventional algorithms for training language models (LMs) with human feedback rely on preferences that are assumed to account for an “average” user, disregarding subjectivity and finer-grained variations. Recent studies have raised concerns that aggregating such diverse and often contradictory human feedback to finetune models results in generic models that generate outputs not preferred by many user groups, as they tend to average out styles and norms. To address this issue, we draw inspiration from recommendation systems and propose ComPO, a method to personalize preference optimization in LMs by contextualizing the probability distribution of model outputs with the preference provider. Focusing on group-level preferences rather than individuals, we collect and release ComPRed, a question answering dataset with community-level preferences from Reddit. This dataset facilitates studying diversity in preferences without incurring privacy concerns associated with individual feedback. Our experiments reveal that conditioning language models on a community identifier (i.e., subreddit name) during preference tuning substantially enhances model performance. Conversely, replacing this context with random subreddit identifiers significantly diminishes performance, highlighting the effectiveness of our approach in tailoring responses to communities’ preferences.

Learning Syntax Without Planting Trees: Understanding Hierarchical Generalization in Transformers
Kabir Ahuja | Vidhisha Balachandran | Madhur Panwar | Tianxing He | Noah A. Smith | Navin Goyal | Yulia Tsvetkov
Transactions of the Association for Computational Linguistics, Volume 13

Transformers trained on natural language data have been shown to exhibit hierarchical generalization without explicitly encoding any structural bias. In this work, we investigate sources of inductive bias in transformer models and their training that could cause such preference for hierarchical generalization. We extensively experiment with transformers trained on five synthetic, controlled datasets using several training objectives and show that, while objectives such as sequence-to-sequence modeling, classification, etc., often fail to lead to hierarchical generalization, the language modeling objective consistently leads to transformers generalizing hierarchically. We then study how different generalization behaviors emerge during the training by conducting pruning experiments that reveal the joint existence of subnetworks within the model implementing different generalizations. Finally, we take a Bayesian perspective to understand transformers’ preference for hierarchical generalization: We establish a correlation between whether transformers generalize hierarchically on a dataset and if the simplest explanation of that dataset is provided by a hierarchical grammar compared to regular grammars exhibiting linear generalization. Overall, our work presents new insights on the origins of hierarchical generalization in transformers and provides a theoretical framework for studying generalization in language models.

2024

Time is Encoded in the Weights of Finetuned Language Models
Kai Nylund | Suchin Gururangan | Noah Smith
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present time vectors, a simple tool to customize language models to new time periods. Time vectors are created by finetuning a language model on data from a single time (e.g., a year or month), and then subtracting the weights of the original pretrained model. This vector specifies a direction in weight space that, as our experiments show, improves performance on text from that time period. Time vectors specialized to adjacent time periods appear to be positioned closer together in a manifold. Using this structure, we interpolate between time vectors to induce new models that perform better on intervening and future time periods, without any additional training. We demonstrate the consistency of our findings across different tasks, domains, model sizes, and time scales. Our results suggest that time is encoded in the weight space of finetuned models.

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

OLMo: Accelerating the Science of Language Models
Dirk Groeneveld | Iz Beltagy | Evan Walsh | Akshita Bhagia | Rodney Kinney | Oyvind Tafjord | Ananya Jha | Hamish Ivison | Ian Magnusson | Yizhong Wang | Shane Arora | David Atkinson | Russell Authur | Khyathi Chandu | Arman Cohan | Jennifer Dumas | Yanai Elazar | Yuling Gu | Jack Hessel | Tushar Khot | William Merrill | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Valentina Pyatkin | Abhilasha Ravichander | Dustin Schwenk | Saurabh Shah | William Smith | Emma Strubell | Nishant Subramani | Mitchell Wortsman | Pradeep Dasigi | Nathan Lambert | Kyle Richardson | Luke Zettlemoyer | Jesse Dodge | Kyle Lo | Luca Soldaini | Noah Smith | Hannaneh Hajishirzi
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.

Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Sachin Kumar | Vidhisha Balachandran | Chan Young Park | Weijia Shi | Shirley Anugrah Hayati | Yulia Tsvetkov | Noah Smith | Hannaneh Hajishirzi | Dongyeop Kang | David Jurgens
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

CPS-TaskForge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks
Nikita Haduong | Irene Wang | Bo-Ru Lu | Prithviraj Ammanabrolu | Noah A. Smith
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, through textual records of team interactions, is hindered by a major data challenge: available CPS corpora are predominantly dyadic, and adapting pre-existing CPS tasks to more agents is non-trivial. We address this data challenge by developing a CPS task generator, CPS-TaskForge, that can produce environments for studying CPS under a wide array of conditions, and releasing a CPS task design checklist grounded in the theoretical PISA 2015 CPS framework to help facilitate the development of CPS corpora with more agents. CPS-TaskForge takes the form of a resource management (tower defense) game, and different CPS tasks can be studied by manipulating game design parameters. We conduct a case study with groups of 3–4 humans to validate production of diverse natural language CPS communication in a game instance produced by CPS-TaskForge. We discuss opportunities for advancing research in CPS (both with human-only and human-AI teams) using different task configurations. We release all data and code.

Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Orevaoghene Ahia | Anuoluwapo Aremu | Diana Abagyan | Hila Gonen | David Ifeoluwa Adelani | Daud Abolade | Noah A. Smith | Yulia Tsvetkov
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Yoruba—an African language with roughly 47 million speakers—encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus; YORULECT across three domains and four regional yoruba dialects. To develop this corpus, we engaged native speakers, traveling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard yoruba and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yoruba and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We will release YORULECT dataset and models publicly under an open license.

Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models
Terra Blevins | Tomasz Limisiewicz | Suchin Gururangan | Margaret Li | Hila Gonen | Noah A. Smith | Luke Zettlemoyer
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Despite their popularity in non-English NLP, multilingual language models often underperform monolingual ones due to inter-language competition for model parameters. We propose Cross-lingual Expert Language Models (X-ELM), which mitigate this competition by independently training language models on subsets of the multilingual corpus. This process specializes X-ELMs to different languages while remaining effective as a multilingual ensemble. Our experiments show that when given the same compute budget, X-ELM outperforms jointly trained multilingual models across all 16 considered languages and that these gains transfer to downstream tasks. X-ELM provides additional benefits over performance improvements: new experts can be iteratively added, adapting X-ELM to new languages without catastrophic forgetting. Furthermore, training is asynchronous, reducing the hardware requirements for multilingual training and democratizing multilingual modeling.

Evaluating n-Gram Novelty of Language Models Using Rusty-DAWG
William Merrill | Noah A. Smith | Yanai Elazar
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate n-grams from their training data, evaluating both (i) the probability LMs assign to complete training n-grams and (ii) n-novelty, the proportion of n-grams generated by an LM that did not appear in the training data (for arbitrarily large n). To enable arbitrary-length n-gram search over a corpus in constant time w.r.t. corpus size, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for n > 4, LM-generated text is less novel than human-written text, though it is more novel for smaller n. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete n-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.

Set the Clock: Temporal Alignment of Pretrained Language Models
Bowen Zhao | Zander Brumbaugh | Yizhong Wang | Hannaneh Hajishirzi | Noah Smith
Findings of the Association for Computational Linguistics: ACL 2024

Language models (LMs) are trained on web text originating from many points in time and, in general, without any explicit temporal grounding. This work investigates the temporal chaos of pretrained LMs and explores various methods to align their internal knowledge to a target time, which we call “temporal alignment.” To do this, we first automatically construct a dataset containing 20K time-sensitive questions and their answers for each year from 2000 to 2023. Based on this dataset, we empirically show that pretrained LMs (e.g., LLaMa2), despite having a recent pretraining cutoff (e.g., 2022), mostly answer questions using earlier knowledge (e.g., in 2019). We then develop several methods, from prompting to finetuning, to align LMs to use their most recent knowledge when answering questions, and investigate various factors in this alignment. Our experiments demonstrate that aligning LLaMa2 to the year 2022 can enhance its performance by up to 62% according to that year’s answers. This improvement occurs even without explicitly mentioning time information, indicating the possibility of aligning models’ internal sense of time after pretraining. Finally, we find that alignment to a historical time is also possible, with up to 2.8× the performance of the unaligned LM in 2010 if finetuning models to that year. These findings hint at the sophistication of LMs’ internal knowledge organization and the necessity of tuning them properly.

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals
Yanai Elazar | Bhargavi Paranjape | Hao Peng | Sarah Wiegreffe | Khyathi Chandu | Vivek Srikumar | Sameer Singh | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2024

The inevitable appearance of spurious correlations in training datasets hurts the generalization of NLP models on unseen data. Previous work has found that datasets with paired inputs are prone to correlations between a specific part of the input (e.g., the hypothesis in NLI) and the label; consequently, models trained only on those outperform chance. Are these correlations picked up by models trained on the full input data? To address this question, we propose a new evaluation method, Counterfactual Attentiveness Test (CAT). CAT uses counterfactuals by replacing part of the input with its counterpart from a different example (subject to some restrictions), expecting an attentive model to change its prediction. Using CAT, we systematically investigate established supervised and in-context learning models on ten datasets spanning four tasks: natural language inference, reading comprehension, paraphrase detection, and visual & language reasoning. CAT reveals that reliance on such correlations is mainly data-dependent. Surprisingly, we find that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves. Our results demonstrate that augmenting training or demonstration data with counterfactuals is effective in improving models’ attentiveness. We show that models’ attentiveness measured by CAT reveals different conclusions from solely measuring correlations in data.

Merge to Learn: Efficiently Adding Skills to Language Models with Model Merging
Jacob Morrison | Noah A. Smith | Hannaneh Hajishirzi | Pang Wei Koh | Jesse Dodge | Pradeep Dasigi
Findings of the Association for Computational Linguistics: EMNLP 2024

Adapting general-purpose language models to new skills is currently an expensive process that must be repeated as new instruction datasets targeting new skills are created, or can cause the models to forget older skills. In this work, we investigate the effectiveness of adding new skills to preexisting models by training on the new skills in isolation and later merging with the general model (e.g. using task vectors). In experiments focusing on scientific literature understanding, safety, and coding, we find that the parallel-train-then-merge procedure, which is significantly cheaper than retraining the models on updated data mixtures, is often comparably effective. Our experiments also show that parallel training is especially well-suited for enabling safety features in LMs relative to continued finetuning and retraining, as it dramatically improves model compliance with safe prompts while preserving its ability to refuse dangerous or harmful prompts.

A Call for Clarity in Beam Search: How It Works and When It Stops
Jungo Kasai | Keisuke Sakaguchi | Ronan Le Bras | Dragomir Radev | Yejin Choi | Noah A. Smith
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Text generation with beam search has proven successful in a wide range of applications. We point out that, though largely overlooked in the literature, the commonly-used implementation of beam decoding (e.g., Hugging Face Transformers and fairseq) uses a first come, first served heuristic: it keeps a set of already completed sequences over time steps and stops when the size of this set reaches the beam size. Based on this finding, we introduce a patience factor, a simple modification to this beam decoding implementation, that generalizes the stopping criterion and provides flexibility to the depth of search. Empirical results demonstrate that adjusting this patience factor improves decoding performance of strong pretrained models on news text summarization and machine translation over diverse language pairs, with a negligible inference slowdown. Our approach only modifies one line of code and can be thus readily incorporated in any implementation. Further, we find that different versions of beam decoding result in large performance differences in summarization, demonstrating the need for clarity in specifying the beam search implementation in research work. Our code will be available upon publication.

Summarization-Based Document IDs for Generative Retrieval with Language Models
Alan Li | Daniel Cheng | Phillip Keung | Jungo Kasai | Noah A. Smith
Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia

Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document’s ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.

2023

Elaboration-Generating Commonsense Question Answering at Scale
Wenya Wang | Vivek Srikumar | Hannaneh Hajishirzi | Noah A. Smith
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge that helps improve performance. Yet the cost of working with such models is very high; in this work, we finetune smaller language models to generate useful intermediate context, referred to here as elaborations. Our framework alternates between updating two language models—an elaboration generator and an answer predictor—allowing each to influence the other. Using less than 0.5% of the parameters of GPT-3, our model outperforms alternatives with similar sizes and closes the gap with GPT-3 on four commonsense question answering benchmarks. Human evaluations show that the quality of the generated elaborations is high.

Self-Instruct: Aligning Language Models with Self-Generated Instructions
Yizhong Wang | Yeganeh Kordi | Swaroop Mishra | Alisa Liu | Noah A. Smith | Daniel Khashabi | Hannaneh Hajishirzi
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Large “instruction-tuned” language models (i.e., finetuned to respond to instructions) have demonstrated a remarkable ability to generalize zero-shot to new tasks. Nevertheless, they depend heavily on human-written instruction data that is often limited in quantity, diversity, and creativity, therefore hindering the generality of the tuned model. We introduce Self-Instruct, a framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations. Our pipeline generates instructions, input, and output samples from a language model, then filters invalid or similar ones before using them to finetune the original model. Applying our method to the vanilla GPT3, we demonstrate a 33% absolute improvement over the original model on Super-NaturalInstructions, on par with the performance of InstructGPT-001, which was trained with private user data and human annotations. For further evaluation, we curate a set of expert-written instructions for novel tasks, and show through human evaluation that tuning GPT3 with Self-Instruct outperforms using existing public instruction datasets by a large margin, leaving only a 5% absolute gap behind InstructGPT-001. Self-Instruct provides an almost annotation-free method for aligning pre-trained language models with instructions, and we release our large synthetic dataset to facilitate future studies on instruction tuning.

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference
Haoxin Li | Phillip Keung | Daniel Cheng | Jungo Kasai | Noah A. Smith
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Large-scale language model pretraining is a very successful form of self-supervised learning in natural language processing, but it is increasingly expensive to perform as the models and pretraining corpora have become larger over time. We propose NarrowBERT, a modified transformer encoder that increases the throughput for masked language model pretraining by more than 2x. NarrowBERT sparsifies the transformer model such that the self-attention queries and feedforward layers only operate on the masked tokens of each sentence during pretraining, rather than all of the tokens as with the usual transformer encoder. We also show that NarrowBERT increases the throughput at inference time by as much as 3.5x with minimal (or no) performance degradation on sentence encoding tasks like MNLI. Finally, we examine the performance of NarrowBERT on the IMDB and Amazon reviews classification and CoNLL NER tasks and show that it is also comparable to standard BERT performance.

Proceedings of the Big Picture Workshop
Yanai Elazar | Allyson Ettinger | Nora Kassner | Sebastian Ruder | Noah A. Smith
Proceedings of the Big Picture Workshop

We’re Afraid Language Models Aren’t Modeling Ambiguity
Alisa Liu | Zhaofeng Wu | Julian Michael | Alane Suhr | Peter West | Alexander Koller | Swabha Swayamdipta | Noah Smith | Yejin Choi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Ambiguity is an intrinsic feature of natural language. Managing ambiguity is a key part of human language understanding, allowing us to anticipate misunderstanding as communicators and revise our interpretations as listeners. As language models are increasingly employed as dialogue interfaces and writing aids, handling ambiguous language is critical to their success. We capture ambiguity in a sentence through its effect on entailment relations with another sentence, and collect AmbiEnt, a linguist-annotated benchmark of 1,645 examples with diverse kinds of ambiguity. We design a suite of tests based on AmbiEnt, presenting the first evaluation of pretrained LMs to recognize ambiguity and disentangle possible meanings. We find that the task remains extremely challenging, including for GPT-4, whose generated disambiguations are considered correct only 32% of the time in crowdworker evaluation, compared to 90% for disambiguations in our dataset. Finally, to illustrate the value of ambiguity-sensitive tools, we show that a multilabel NLI model can flag political claims in the wild that are misleading due to ambiguity. We encourage the field to rediscover the importance of ambiguity for NLP.

Vera: A General-Purpose Plausibility Estimation Model for Commonsense Statements
Jiacheng Liu | Wenya Wang | Dianzhuo Wang | Noah Smith | Yejin Choi | Hannaneh Hajishirzi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Today’s language models can be remarkably intelligent yet still produce text that contains trivial commonsense errors. Therefore, we seek a retrospective verification approach that can reflect on the commonsense plausibility of the machine text, and introduce Vera, a general-purpose model that learns to estimate the commonsense plausibility of declarative statements. To support diverse commonsense domains, Vera is trained on ~7M commonsense statements that are automatically converted from 19 QA datasets and two commonsense knowledge bases, and using a combination of three training objectives. When applied to solving commonsense problems in the verification format, Vera substantially outperforms existing models that can be repurposed for commonsense verification, even including GPT-3.5/ChatGPT/GPT-4, and it further exhibits generalization capabilities to unseen tasks and provides well-calibrated outputs. We find that Vera excels at filtering machine-generated commonsense knowledge and is useful in detecting erroneous commonsense statements generated by models like ChatGPT in real-world settings.

Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models
Orevaoghene Ahia | Sachin Kumar | Hila Gonen | Jungo Kasai | David Mortensen | Noah Smith | Yulia Tsvetkov
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Language models have graduated from being research prototypes to commercialized products offered as web APIs, and recent works have highlighted the multilingual capabilities of these products. The API vendors charge their users based on usage, more specifically on the number of “tokens” processed or generated by the underlying language models. What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages. In this work, we analyze the effect of this non-uniformity on the fairness of an API’s pricing policy across languages. We conduct a systematic analysis of the cost and utility of OpenAI’s language model API on multilingual benchmarks in 22 typologically diverse languages. We show evidence that speakers of a large number of the supported languages are overcharged while obtaining poorer results. These speakers tend to also come from regions where the APIs are less affordable, to begin with. Through these analyses, we aim to increase transparency around language model APIs’ pricing policies and encourage the vendors to make them more equitable.

One Embedder, Any Task: Instruction-Finetuned Text Embeddings
Hongjin Su | Weijia Shi | Jungo Kasai | Yizhong Wang | Yushi Hu | Mari Ostendorf | Wen-tau Yih | Noah A. Smith | Luke Zettlemoyer | Tao Yu
Findings of the Association for Computational Linguistics: ACL 2023

We introduce INSTRUCTOR, a new method for computing text embeddings given task instructions: every text input is embedded together with instructions explaining the use case (e.g., task and domain descriptions). Unlike encoders from prior work that are more specialized, INSTRUCTOR is a single embedder that can generate text embeddings tailored to different downstream tasks and domains, without any further training. We first annotate instructions for 330 diverse tasks and train INSTRUCTOR on this multitask mixture with a contrastive loss. We evaluate INSTRUCTOR on 70 embedding evaluation tasks (66 of which are unseen during training), ranging from classification and information retrieval to semantic textual similarity and text generation evaluation. INSTRUCTOR, while having an order of magnitude fewer parameters than the previous best model, achieves state-of-the-art performance, with an average improvement of 3.4% compared to the previous best results on the 70 diverse datasets. Our analysis suggests that INSTRUCTOR is robust to changes in instructions, and that instruction finetuning mitigates the challenge of training a single model on diverse datasets. Our model, code, and data are available at https://instructor-embedding.github.io.

Risks and NLP Design: A Case Study on Procedural Document QA
Nikita Haduong | Alice Gao | Noah A. Smith
Findings of the Association for Computational Linguistics: ACL 2023

As NLP systems are increasingly deployed at scale, concerns about their potential negative impacts have attracted the attention of the research community, yet discussions of risk have mostly been at an abstract level and focused on generic AI or NLP applications. We argue that clearer assessments of risks and harms to users—and concrete strategies to mitigate them—will be possible when we specialize the analysis to more concrete applications and their plausible users. As an illustration, this paper is grounded in cooking recipe procedural document question answering (ProcDocQA), where there are well-defined risks to users such as injuries or allergic reactions. Our case study shows that an existing language model, applied in “zero-shot” mode, quantitatively answers real-world questions about recipes as well or better than the humans who have answered the questions on the web. Using a novel questionnaire informed by theoretical work on AI risk, we conduct a risk-oriented error analysis that could then inform the design of a future system to be deployed with lower risk of harm and better performance.

Stubborn Lexical Bias in Data and Models
Sofia Serrano | Jesse Dodge | Noah A. Smith
Findings of the Association for Computational Linguistics: ACL 2023

In NLP, recent work has seen increased focus on spurious correlations between various features and labels in training data, and how these influence model behavior. However, the presence and effect of such correlations are typically examined feature by feature. We investigate the cumulative impact on a model of many such intersecting features. Using a new statistical method, we examine whether such spurious patterns in data appear in models trained on the data. We select two tasks— natural language inference and duplicate-question detection— for which any unigram feature on its own should ideally be uninformative, which gives us a large pool of automatically extracted features with which to experiment. The large size of this pool allows us to investigate the intersection of features spuriously associated with (potentially different) labels. We then apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations, and examine how doing so affects models trained on the reweighted data. Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models, including worsened bias for slightly more complex features (bigrams). We close with discussion about the implications of our results on what it means to “debias” training data, and how issues of data quality can affect model bias.

Data-Efficient Finetuning Using Cross-Task Nearest Neighbors
Hamish Ivison | Noah A. Smith | Hannaneh Hajishirzi | Pradeep Dasigi
Findings of the Association for Computational Linguistics: ACL 2023

Obtaining labeled data to train a model for a task of interest is often expensive. Prior work shows training models on multitask data augmented with task descriptions (prompts) effectively transfers knowledge to new tasks. Towards efficiently building task-specific models, we assume access to a small number (32-1000) of unlabeled target-task examples and use those to retrieve the most similar labeled examples from a large pool of multitask data augmented with prompts. Compared to the current practice of finetuning models on uniformly sampled prompted multitask data (e.g.: FLAN, T0), our approach of finetuning on cross-task nearest neighbors is significantly more data-efficient. Using only 2% of the data from the P3 pool without any labeled target-task data, our models outperform strong baselines trained on all available data by 3-30% on 12 out of 14 datasets representing held-out tasks including legal and scientific document QA. Similarly, models trained on cross-task nearest neighbors from SuperNaturalInstructions, representing about 5% of the pool, obtain comparable performance to state-of-the-art models on 12 held-out tasks from that pool. Moreover, the models produced by our approach also provide a better initialization than single multitask finetuned models for few-shot finetuning on target-task data, as shown by a 2-23% relative improvement over few-shot finetuned T0-3B models on 8 datasets.

Reproducibility in NLP: What Have We Learned from the Checklist?
Ian Magnusson | Noah A. Smith | Jesse Dodge
Findings of the Association for Computational Linguistics: ACL 2023

Scientific progress in NLP rests on the reproducibility of researchers’ claims. The *CL conferences created the NLP Reproducibility Checklist in 2020 to be completed by authors at submission to remind them of key information to include. We provide the first analysis of the Checklist by examining 10,405 anonymous responses to it. First, we find evidence of an increase in reporting of information on efficiency, validation performance, summary statistics, and hyperparameters after the Checklist’s introduction. Further, we show acceptance rate grows for submissions with more Yes responses. We find that the 44% of submissions that gather new data are 5% less likely to be accepted than those that did not; the average reviewer-rated reproducibility of these submissions is also 2% lower relative to the rest. We find that only 46% of submissions claim to open-source their code, though submissions that do have 8% higher reproducibility score relative to those that do not, the most for any item. We discuss what can be inferred about the state of reproducibility in NLP, and provide a set of recommendations for future conferences, including: a) allowing submitting code and appendices one week after the deadline, and b) measuring dataset reproducibility by a checklist of data collection practices.

That was the last straw, we need more: Are Translation Systems Sensitive to Disambiguating Context?
Jaechan Lee | Alisa Liu | Orevaoghene Ahia | Hila Gonen | Noah Smith
Findings of the Association for Computational Linguistics: EMNLP 2023

The translation of ambiguous text presents a challenge for translation systems, as it requires using the surrounding context to disambiguate the intended meaning as much as possible. While prior work has studied ambiguities that result from different grammatical features of the source and target language, we study semantic ambiguities that exist in the source (English in this work) itself. In particular, we focus on idioms that are open to both literal and figurative interpretations (e.g., goose egg), and collect TIDE, a dataset of 512 pairs of English sentences containing idioms with disambiguating context such that one is literal (it laid a goose egg) and another is figurative (they scored a goose egg, as in a score of zero). In experiments, we compare MT-specific models and language models for (i) their preference when given an ambiguous subsentence, (ii) their sensitivity to disambiguating context, and (iii) the performance disparity between figurative and literal source sentences. We find that current MT models consistently translate English idioms literally, even when the context suggests a figurative interpretation. On the other hand, LMs are far more context-aware, although there remain disparities across target languages. Our findings underline the potential of LMs as a strong backbone for context-aware translation.

Measuring and Narrowing the Compositionality Gap in Language Models
Ofir Press | Muru Zhang | Sewon Min | Ludwig Schmidt | Noah Smith | Mike Lewis
Findings of the Association for Computational Linguistics: EMNLP 2023

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly instead of implicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and then answers) follow-up questions before answering the initial question. We finally show that self-ask’s structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

Demystifying Prompts in Language Models via Perplexity Estimation
Hila Gonen | Srini Iyer | Terra Blevins | Noah Smith | Luke Zettlemoyer
Findings of the Association for Computational Linguistics: EMNLP 2023

Language models can be prompted to perform a wide variety of tasks with zero- and few-shot in-context learning. However, performance varies significantly with the choice of prompt, and we do not yet understand why this happens. In this paper, we analyze the factors that contribute to this variance and establish a new empirical hypothesis: the performance of a prompt is predicted by the extent to which the model is familiar with the language it contains. Over a wide range of tasks, we show that the lower the perplexity of the prompt, the better it is able to perform the task, when considering reasonable prompts that are related to it. As part of our analysis, we also devise a method to automatically extend a small seed set of manually written prompts by paraphrasing with GPT3 and backtranslation. This larger set allows us to verify that perplexity is a strong predictor of the success of a prompt and we show that the lowest perplexity prompts are consistently effective.

LEXPLAIN: Improving Model Explanations via Lexicon Supervision
Orevaoghene Ahia | Hila Gonen | Vidhisha Balachandran | Yulia Tsvetkov | Noah A. Smith
Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023)

Model explanations that shed light on the model’s predictions are becoming a desired additional output of NLP models, alongside their predictions. Challenges in creating these explanations include making them trustworthy and faithful to the model’s predictions. In this work, we propose a novel framework for guiding model explanations by supervising them explicitly. To this end, our method, LEXplain, uses task-related lexicons to directly supervise model explanations. This approach consistently improves the model’s explanations without sacrificing performance on the task, as we demonstrate on sentiment analysis and toxicity detection. Our analyses show that our method also demotes spurious correlations (i.e., with respect to African American English dialect) when performing the task, improving fairness.

Transparency Helps Reveal When Language Models Learn Meaning
Zhaofeng Wu | William Merrill | Hao Peng | Iz Beltagy | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 11

Many current NLP systems are built from language models trained to optimize unsupervised objectives on large amounts of raw text. Under what conditions might such a procedure acquire meaning? Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations (i.e., languages with strong transparency), both autoregressive and masked language models successfully learn to emulate semantic relations between expressions. However, when denotations are changed to be context-dependent with the language otherwise unmodified, this ability degrades. Turning to natural language, our experiments with a specific phenomenon—referential opacity—add to the growing body of evidence that current language models do not represent natural language semantics well. We show this failure relates to the context-dependent nature of natural language form-meaning mappings.

2022

Is GPT-3 Text Indistinguishable from Human Text? Scarecrow: A Framework for Scrutinizing Machine Text
Yao Dou | Maxwell Forbes | Rik Koncel-Kedziorski | Noah A. Smith | Yejin Choi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Modern neural language models can produce remarkably fluent and grammatical text. So much, in fact, that recent work by Clark et al. (2021) has reported that conventional crowdsourcing can no longer reliably distinguish between machine-authored (GPT-3) and human-authored writing. As errors in machine generations become ever subtler and harder to spot, it poses a new challenge to the research community for robust machine text evaluation. We propose a new framework called Scarecrow for scrutinizing machine text via crowd annotation. To support the broad range of real machine errors that can be identified by laypeople, the ten error categories of Scarecrow—such as redundancy, commonsense errors, and incoherence—are identified through several rounds of crowd annotation experiments without a predefined ontology. We then use Scarecrow to collect over 41k error spans in human-written and machine-generated paragraphs of English language news text. We isolate factors for detailed analysis, including parameter count, training data, and various decoding-time configurations. Our approach successfully quantifies measurable gaps between human authored text and generations from models of several sizes, including fourteen configurations of GPT-3. In addition, our analysis unveils new insights, with detailed rationales provided by laypeople, e.g., that the commonsense capabilities have been improving with larger models while math capabilities have not, and that the choices of simple decoding hyperparameters can make remarkable differences on the perceived quality of machine text. We release our training material, annotation toolkit and dataset at https://yao-dou.github.io/scarecrow/.

ABC: Attention with Bounded-memory Control
Hao Peng | Jungo Kasai | Nikolaos Pappas | Dani Yogatama | Zhaofeng Wu | Lingpeng Kong | Roy Schwartz | Noah A. Smith
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Transformer architectures have achieved state- of-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights—an established approach (Wang et al., 2020b) previously thought to not be applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.

Generating Scientific Definitions with Controllable Complexity
Tal August | Katharina Reinecke | Noah A. Smith
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Unfamiliar terminology and complex language can present barriers to understanding science. Natural language processing stands to help address these issues by automatically defining unfamiliar terms. We introduce a new task and dataset for defining scientific terms and controlling the complexity of generated definitions as a way of adapting to a specific reader’s background knowledge. We test four definition generation methods for this new task, finding that a sequence-to-sequence approach is most successful. We then explore the version of the task in which definitions are generated at a target complexity level. We introduce a novel reranking approach and find in human evaluations that it offers superior fluency while also controlling complexity, compared to several controllable generation baselines.

Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs and outputs of SKG tasks are heterogeneous, they have been studied separately by different communities, which limits systematic and compatible research on SKG. In this paper, we overcome this limitation by proposing the UnifiedSKG framework, which unifies 21 SKG tasks into a text-to-text format, aiming to promote systematic SKG research, instead of being exclusive to a single task, domain, or dataset. We use UnifiedSKG to benchmark T5 with different sizes and show that T5, with simple modifications when necessary, achieves state-of-the-art performance on almost all of the 21 tasks. We further demonstrate that multi-task prefix-tuning improves the performance on most tasks, largely improving the overall performance. UnifiedSKG also facilitates the investigation of zero-shot and few-shot learning, and we show that T0, GPT-3, and Codex struggle in zero-shot and few-shot learning for SKG. We also use UnifiedSKG to conduct a series of controlled experiments on structured knowledge encoding variants across SKG tasks. UnifiedSKG is easily extensible to more tasks, and it is open-sourced at https://github.com/hkunlp/unifiedskg.

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection
Suchin Gururangan | Dallas Card | Sarah Dreier | Emily Gade | Leroy Wang | Zeyu Wang | Luke Zettlemoyer | Noah A. Smith
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Language models increasingly rely on massive web crawls for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and news often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles—written by students from across the country—we investigate whose language is preferred by the quality filter used for GPT-3. We find that newspapers from larger schools, located in wealthier, educated, and urban zones (ZIP codes) are more likely to be classified as high quality. We also show that this quality measurement is unaligned with other sensible metrics, such as factuality or literary acclaim. We argue that privileging any corpus as high quality entails a language ideology, and more care is needed to construct training corpora for language models, with better transparency and justification for the inclusion or exclusion of various texts.

Twist Decoding: Diverse Generators Guide Each Other
Jungo Kasai | Keisuke Sakaguchi | Ronan Le Bras | Hao Peng | Ximing Lu | Dragomir Radev | Yejin Choi | Noah A. Smith
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Many language generation models are now available for a wide range of generation tasks, including machine translation and summarization. Combining such diverse models may lead to further progress, but ensembling generation models is challenging during inference: conventional ensembling methods (e.g., shallow fusion) require that the models share vocabulary/tokenization schemes. We introduce Twist decoding, a simple and general text generation algorithm that benefits from diverse models at inference time. Our method does not assume the vocabulary, tokenization or even generation order is shared. Our extensive evaluations on machine translation and scientific paper summarization demonstrate that Twist decoding substantially outperforms each model decoded in isolation over various scenarios, including cases where domain-specific and general-purpose models are both available. Twist decoding also consistently outperforms the popular reranking heuristic where output candidates from one model are rescored by another. We hope that our work will encourage researchers and practitioners to examine generation models collectively, not just independently, and to seek out models with complementary strengths to the currently available models.

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation
Daniel Khashabi | Gabriel Stanovsky | Jonathan Bragg | Nicholas Lourie | Jungo Kasai | Yejin Choi | Noah A. Smith | Daniel Weld
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While often assumed a gold standard, effective human evaluation of text generation remains an important, open area for research.We revisit this problem with a focus on producing consistent evaluations that are reproducible—over time and across different populations. We study this goal in different stages of the human evaluation pipeline. In particular, we consider design choices for the annotation interface used to elicit human judgments and their impact on reproducibility. Furthermore, we develop an automated mechanism for maintaining annotator quality via a probabilistic model that detects and excludes noisy annotators. Putting these lessons together, we introduce GENIE: a system for running standardized human evaluations across different generation tasks.We instantiate GENIE with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension.For each task, GENIE offers a leaderboard that automatically crowdsources annotations for submissions, evaluating them along axes such as correctness, conciseness, and fluency.We have made the GENIE leaderboards publicly available, and have already ranked 50 submissions from 10 different research groups. We hope GENIE encourages further progress toward effective, standardized evaluations for text generation.

How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers
Michael Hassid | Hao Peng | Daniel Rotem | Jungo Kasai | Ivan Montero | Noah A. Smith | Roy Schwartz
Findings of the Association for Computational Linguistics: EMNLP 2022

The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones—the average attention weights over multiple inputs. We use PAPA to analyze several established pretrained Transformers on six downstream tasks. We find that without any input-dependent attention, all models achieve competitive performance—an average relative drop of only 8% from the probing baseline. Further, little or no performance drop is observed when replacing half of the input-dependent attention matrices with constant (input-independent) ones. Interestingly, we show that better-performing models lose more from applying our method than weaker models, suggesting that the utilization of the input-dependent attention mechanism might be a factor in their success. Our results motivate research on simpler alternatives to input-dependent attention, as well as on methods for better utilization of this mechanism in the Transformer architecture.

In-Context Learning for Few-Shot Dialogue State Tracking
Yushi Hu | Chia-Hsuan Lee | Tianbao Xie | Tao Yu | Noah A. Smith | Mari Ostendorf
Findings of the Association for Computational Linguistics: EMNLP 2022

Collecting and annotating task-oriented dialogues is time-consuming and costly. Thus, zero and few shot learning for dialogue tasks presents an exciting opportunity. In this work, we propose an in-context (IC) learning framework for zero-shot and few-shot learning dialogue state tracking (DST), where a large pretrained language model (LM) takes a test instance and a few exemplars as input, and directly decodes the dialogue state without any parameter updates. This approach is more flexible and scalable than prior DST work when adapting to new domains and scenarios. To better leverage a tabular domain description in the LM prompt, we reformulate DST into a text-to-SQL problem. We also propose a novel approach to retrieve annotated dialogues as exemplars. Empirical results on MultiWOZ show that our method IC-DST substantially outperforms previous fine-tuned state-of-the-art models in few-shot settings. In addition, we test IC-DST in zero-shot settings, in which the model only takes a fixed task instruction as input, finding that it outperforms previous zero-shot methods by a large margin.

Unsupervised Learning of Hierarchical Conversation Structure
Bo-Ru Lu | Yushi Hu | Hao Cheng | Noah A. Smith | Mari Ostendorf
Findings of the Association for Computational Linguistics: EMNLP 2022

Human conversations can evolve in many different ways, creating challenges for automatic understanding and summarization. Goal-oriented conversations often have meaningful sub-dialogue structure, but it can be highly domain-dependent. This work introduces an unsupervised approach to learning hierarchical conversation structure, including turn and sub-dialogue segment labels, corresponding roughly to dialogue acts and sub-tasks, respectively. The decoded structure is shown to be useful in enhancing neural models of language for three conversation-level understanding tasks. Further, the learned finite-state sub-dialogue network is made interpretable through automatic summarization.

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation
Alisa Liu | Swabha Swayamdipta | Noah A. Smith | Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2022

A recurring challenge of crowdsourcing NLP datasets at scale is that human writers often rely on repetitive patterns when crafting examples, leading to a lack of linguistic diversity. We introduce a novel approach for dataset creation based on worker and AI collaboration, which brings together the generative strength of language models and the evaluative strength of humans. Starting with an existing dataset, MultiNLI for natural language inference (NLI), our approach uses dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instructs GPT-3 to compose new examples with similar patterns. Machine generated examples are then automatically filtered, and finally revised and labeled by human crowdworkers. The resulting dataset, WANLI, consists of 107,885 NLI examples and presents unique empirical strengths over existing NLI datasets. Remarkably, training a model on WANLI improves performance on eight out-of-domain test sets we consider, including by 11% on HANS and 9% on Adversarial NLI, compared to training on the 4x larger MultiNLI. Moreover, it continues to be more effective than MultiNLI augmented with other NLI datasets. Our results demonstrate the promise of leveraging natural language generation techniques and re-imagining the role of humans in the dataset creation process.

Modeling Context With Linear Attention for Scalable Document-Level Translation
Zhaofeng Wu | Hao Peng | Nikolaos Pappas | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2022

Document-level machine translation leverages inter-sentence dependencies to produce more coherent and consistent translations. However, these models, predominantly based on transformers, are difficult to scale to long documents as their attention layers have quadratic complexity in the sequence length. Recent efforts on efficient attention improve scalability, but their effect on document translation remains unexplored. In this work, we investigate the efficacy of a recent linear attention model by Peng et al. (2021) on document translation and augment it with a sentential gate to promote a recency inductive bias. We evaluate the model on IWSLT 2015 and OpenSubtitles 2018 against the transformer, demonstrating substantially increased decoding speed on long sequences with similar or better BLEU scores. We show that sentential gating further improves translation quality on IWSLT.

Domain Mismatch Doesn’t Always Prevent Cross-lingual Transfer Learning
Daniel Edmiston | Phillip Keung | Noah A. Smith
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Cross-lingual transfer learning without labeled target language data or parallel text has been surprisingly effective in zero-shot cross-lingual classification, question answering, unsupervised machine translation, etc. However, some recent publications have claimed that domain mismatch prevents cross-lingual transfer, and their results show that unsupervised bilingual lexicon induction (UBLI) and unsupervised neural machine translation (UNMT) do not work well when the underlying monolingual corpora come from different domains (e.g., French text from Wikipedia but English text from UN proceedings). In this work, we show how a simple initialization regimen can overcome much of the effect of domain mismatch in cross-lingual transfer. We pre-train word and contextual embeddings on the concatenated domain-mismatched corpora, and use these as initializations for three tasks: MUSE UBLI, UN Parallel UNMT, and the SemEval 2017 cross-lingual word similarity task. In all cases, our results challenge the conclusions of prior work by showing that proper initialization can recover a large portion of the losses incurred by domain mismatch.

The Engage Corpus: A Social Media Dataset for Text-Based Recommender Systems
Daniel Cheng | Kyle Yan | Phillip Keung | Noah A. Smith
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Social media platforms play an increasingly important role as forums for public discourse. Many platforms use recommendation algorithms that funnel users to online groups with the goal of maximizing user engagement, which many commentators have pointed to as a source of polarization and misinformation. Understanding the role of NLP in recommender systems is an interesting research area, given the role that social media has played in world events. However, there are few standardized resources which researchers can use to build models that predict engagement with online groups on social media; each research group constructs datasets from scratch without releasing their version for reuse. In this work, we present a dataset drawn from posts and comments on the online message board Reddit. We develop baseline models for recommending subreddits to users, given the user’s post and comment history. We also study the behavior of our recommender models on subreddits that were banned in June 2020 as part of Reddit’s efforts to stop the dissemination of hate speech.

NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics
Ximing Lu | Sean Welleck | Peter West | Liwei Jiang | Jungo Kasai | Daniel Khashabi | Ronan Le Bras | Lianhui Qin | Youngjae Yu | Rowan Zellers | Noah A. Smith | Yejin Choi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The dominant paradigm for neural text generation is left-to-right decoding from autoregressive language models. Constrained or controllable generation under complex lexical constraints, however, requires foresight to plan ahead feasible future paths. Drawing inspiration from the A^* search algorithm, we propose NeuroLogic A*esque, a decoding algorithm that incorporates heuristic estimates of future cost. We develop lookahead heuristics that are efficient for large-scale language models, making our method a drop-in replacement for common techniques such as beam search and top-k sampling. To enable constrained generation, we build on NeuroLogic decoding (Lu et al., 2021), combining its flexibility in incorporating logical constraints with A*esque estimates of future constraint satisfaction. Our approach outperforms competitive baselines on five generation tasks, and achieves new state-of-the-art performance on table-to-text generation, constrained machine translation, and keyword-constrained generation. The improvements are particularly notable on tasks that require complex constraint satisfaction or in few-shot or zero-shot settings. NeuroLogic A*esque illustrates the power of decoding for improving and enabling new capabilities of large-scale language models.

Transparent Human Evaluation for Image Captioning
Jungo Kasai | Keisuke Sakaguchi | Lavinia Dunagan | Jacob Morrison | Ronan Le Bras | Yejin Choi | Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We establish THumB, a rubric-based human evaluation protocol for image captioning models. Our scoring rubrics and their definitions are carefully developed based on machine- and human-generated captions on the MSCOCO dataset. Each caption is evaluated along two main dimensions in a tradeoff (precision and recall) as well as other aspects that measure the text quality (fluency, conciseness, and inclusive language). Our evaluations demonstrate several critical problems of the current evaluation practice. Human-generated captions show substantially higher quality than machine-generated ones, especially in coverage of salient information (i.e., recall), while most automatic metrics say the opposite. Our rubric-based results reveal that CLIPScore, a recent metric that uses image features, better correlates with human judgments than conventional text-only metrics because it is more sensitive to recall. We hope that this work will promote a more transparent evaluation protocol for image captioning and its automatic metrics.

Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand
Jungo Kasai | Keisuke Sakaguchi | Ronan Le Bras | Lavinia Dunagan | Jacob Morrison | Alexander Fabbri | Yejin Choi | Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Natural language processing researchers have identified limitations of evaluation methodology for generation tasks, with new questions raised about the validity of automatic metrics and of crowdworker judgments. Meanwhile, efforts to improve generation models tend to depend on simple n-gram overlap metrics (e.g., BLEU, ROUGE). We argue that new advances on models and metrics should each more directly benefit and inform the other. We therefore propose a generalization of leaderboards, bidimensional leaderboards (Billboards), that simultaneously tracks progress in language generation models and metrics for their evaluation. Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. A Billboard automatically creates an ensemble metric that selects and linearly combines a few metrics based on a global analysis across generators. Further, metrics are ranked based on their correlation with human judgments. We release four Billboards for machine translation, summarization, and image captioning. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation. Our mixed-effects model analysis shows that most automatic metrics, especially the reference-based ones, overrate machine over human generation, demonstrating the importance of updating metrics as generation models become stronger (and perhaps more similar to humans) in the future.

DEMix Layers: Disentangling Domains for Modular Language Modeling
Suchin Gururangan | Mike Lewis | Ari Holtzman | Noah A. Smith | Luke Zettlemoyer
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We introduce a new domain expert mixture (DEMix) layer that enables conditioning a language model (LM) on the domain of the input text. A DEMix layer includes a collection of expert feedforward networks, each specialized to a domain, that makes the LM modular: experts can be mixed, added, or removed after initial training. Extensive experiments with autoregressive transformer LMs (up to 1.3B parameters) show that DEMix layers reduce test-time perplexity (especially for out-of-domain data), increase training efficiency, and enable rapid adaptation. Mixing experts during inference, using a parameter-free weighted ensemble, enables better generalization to heterogeneous or unseen domains. We also show it is possible to add experts to adapt to new domains without forgetting older ones, and remove experts to restrict access to unwanted domains. Overall, these results demonstrate benefits of domain modularity in language models.

Annotators with Attitudes: How Annotator Beliefs And Identities Bias Toxic Language Detection
Maarten Sap | Swabha Swayamdipta | Laura Vianna | Xuhui Zhou | Yejin Choi | Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The perceived toxicity of language can vary based on someone’s identity and beliefs, but this variation is often ignored when collecting toxic language datasets, resulting in dataset and model biases. We seek to understand the *who*, *why*, and *what* behind biases in toxicity annotations. In two online studies with demographically and politically diverse participants, we investigate the effect of annotator identities (*who*) and beliefs (*why*), drawing from social psychology research about hate speech, free speech, racist beliefs, political leaning, and more. We disentangle *what* is annotated as toxic by considering posts with three characteristics: anti-Black language, African American English (AAE) dialect, and vulgarity. Our results show strong associations between annotator identity and beliefs and their ratings of toxicity. Notably, more conservative annotators and those who scored highly on our scale for racist beliefs were less likely to rate anti-Black language as toxic, but more likely to rate AAE as toxic. We additionally present a case study illustrating how a popular toxicity detection system’s ratings inherently reflect only specific beliefs and perspectives. Our findings call for contextualizing toxicity labels in social variables, which raises immense implications for toxic language annotation and detection.

Time Waits for No One! Analysis and Challenges of Temporal Misalignment
Kelvin Luu | Daniel Khashabi | Suchin Gururangan | Karishma Mandyam | Noah A. Smith
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

When an NLP model is trained on text data from one time period and tested or deployed on data from another, the resulting temporal misalignment can degrade end-task performance. In this work, we establish a suite of eight diverse tasks across different domains (social media, science papers, news, and reviews) and periods of time (spanning five years or more) to quantify the effects of temporal misalignment. Our study is focused on the ubiquitous setting where a pretrained model is optionally adapted through continued domain-specific pretraining, followed by task-specific finetuning. We establish a suite of tasks across multiple domains to study temporal misalignment in modern NLP systems. We find stronger effects of temporal misalignment on task performance than have been previously reported. We also find that, while temporal adaptation through continued pretraining can help, these gains are small compared to task-specific finetuning on data from the target time period. Our findings motivate continued research to improve temporal robustness of NLP models.

Saturated Transformers are Constant-Depth Threshold Circuits
William Merrill | Ashish Sabharwal | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 10

Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al., 2022). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class TC0 as an upper bound on the formal languages they recognize.

2021

Explaining Relationships Between Scientific Documents
Kelvin Luu | Xinyi Wu | Rik Koncel-Kedziorski | Kyle Lo | Isabel Cachola | Noah A. Smith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We address the task of explaining relationships between two scientific documents using natural language text. This task requires modeling the complex content of long technical documents, deducing a relationship between these documents, and expressing the details of that relationship in text. In addition to the theoretical interest of this task, successful solutions can help improve researcher efficiency in search and review. In this paper we establish a dataset of 622K examples from 154K documents. We pretrain a large language model to serve as the foundation for autoregressive approaches to the task. We explore the impact of taking different views on the two documents, including the use of dense representations extracted with scientific IE systems. We provide extensive automatic and human evaluations which show the promise of such models, but make clear challenges for future work.

Shortformer: Better Language Modeling using Shorter Inputs
Ofir Press | Noah A. Smith | Mike Lewis
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Alisa Liu | Maarten Sap | Ximing Lu | Swabha Swayamdipta | Chandra Bhagavatula | Noah A. Smith | Yejin Choi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DExperts: Decoding-time Experts, a decoding-time method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. We apply DExperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DExperts operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.

All That’s ‘Human’ Is Not Gold: Evaluating Human Evaluation of Generated Text
Elizabeth Clark | Tal August | Sofia Serrano | Nikita Haduong | Suchin Gururangan | Noah A. Smith
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Human evaluations are typically considered the gold standard in natural language generation, but as models’ fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts’ ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators’ accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.

Challenges in Automated Debiasing for Toxic Language Detection
Xuhui Zhou | Maarten Sap | Swabha Swayamdipta | Yejin Choi | Noah Smith
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent
William Merrill | Vivek Ramanujan | Yoav Goldberg | Roy Schwartz | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The capacity of neural networks like the widely adopted transformer is known to be very high. Evidence is emerging that they learn successfully due to inductive bias in the training routine, typically a variant of gradient descent (GD). To better understand this bias, we study the tendency for transformer parameters to grow in magnitude (ℓ₂ norm) during training, and its implications for the emergent representations within self attention layers. Empirically, we document norm growth in the training of transformer language models, including T5 during its pretraining. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Such “saturated” networks are known to have a reduced capacity compared to the full network family that can be described in terms of formal languages and automata. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP. We leverage the emergent discrete structure in a saturated transformer to analyze the role of different attention heads, finding that some focus locally on a small number of positions, while other heads compute global averages, allowing counting. We believe understanding the interplay between these two capabilities may shed further light on the structure of computation within large transformers.

Competency Problems: On Finding and Removing Artifacts in Language Data
Matt Gardner | William Merrill | Jesse Dodge | Matthew Peters | Alexis Ross | Sameer Singh | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Much recent work in NLP has documented dataset artifacts, bias, and spurious correlations between input features and output labels. However, how to tell which features have “spurious” instead of legitimate correlations is typically left unspecified. In this work we argue that for complex language understanding tasks, all simple feature correlations are spurious, and we formalize this notion into a class of problems which we call competency problems. For example, the word “amazing” on its own should not give information about a sentiment label independent of the context in which it appears, which could include negation, metaphor, sarcasm, etc. We theoretically analyze the difficulty of creating data for competency problems when human bias is taken into account, showing that realistic datasets will increasingly deviate from competency problems as dataset size increases. This analysis gives us a simple statistical test for dataset artifacts, which we use to show more subtle biases than were described in prior work, including demonstrating that models are inappropriately affected by these less extreme biases. Our theoretical treatment of this problem also allows us to analyze proposed solutions, such as making local edits to dataset instances, and to give recommendations for future data collection and model design efforts that target competency problems.

Sentence Bottleneck Autoencoders from Transformer Language Models
Ivan Montero | Nikolaos Pappas | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Representation learning for text via pretraining a language model on a large corpus has become a standard starting point for building NLP systems. This approach stands in contrast to autoencoders, also trained on raw text, but with the objective of learning to encode each input as a vector that allows full reconstruction. Autoencoders are attractive because of their latent space structure and generative properties. We therefore explore the construction of a sentence-level autoencoder from a pretrained, frozen transformer language model. We adapt the masked language modeling objective as a generative, denoising one, while only training a sentence bottleneck and a single-layer modified transformer decoder. We demonstrate that the sentence representations discovered by our model achieve better quality than previous methods that extract representations from pretrained transformers on text similarity tasks, style transfer (an example of controlled generation), and single-sentence classification tasks in the GLUE benchmark, while using fewer parameters than large pretrained models.

Measuring Association Between Labels and Free-Text Rationales
Sarah Wiegreffe | Ana Marasović | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

In interpretable NLP, we require faithful rationales that reflect the model’s decision-making process for an explained instance. While prior work focuses on extractive rationales (a subset of the input words), we investigate their less-studied counterpart: free-text natural language rationales. We demonstrate that *pipelines*, models for faithful rationalization on information-extraction style tasks, do not work as well on “reasoning” tasks requiring free-text rationales. We turn to models that *jointly* predict and rationalize, a class of widely used high-performance models for free-text rationalization. We investigate the extent to which the labels and rationales predicted by these models are associated, a necessary property of faithful explanation. Via two tests, *robustness equivalence* and *feature importance agreement*, we find that state-of-the-art T5-based joint models exhibit desirable properties for explaining commonsense question-answering and natural language inference, indicating their potential for producing faithful free-text rationales.

Finetuning Pretrained Transformers into RNNs
Jungo Kasai | Hao Peng | Yizhe Zhang | Dani Yogatama | Gabriel Ilharco | Nikolaos Pappas | Yi Mao | Weizhu Chen | Noah A. Smith
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. But this comes with a signifi- cant computational cost, as the attention mechanism’s complexity scales quadratically with sequence length. Efficient transformer variants have received increasing interest in recent works. Among them, a linear-complexity recurrent variant has proven well suited for autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps, but can be difficult to train and may yield suboptimal accuracy. This work aims to convert a pretrained transformer into its efficient recurrent counterpart, improving efficiency while maintaining accuracy. Specifically, we propose a swap-then-finetune procedure: in an off-the-shelf pretrained transformer, we replace the softmax attention with its linear-complexity recurrent alternative and then finetune. With a learned feature map, our approach provides an improved tradeoff between efficiency and accuracy over the standard transformer and other recurrent variants. We also show that the finetuning process has lower training cost relative to training these recurrent variants from scratch. As many models for natural language tasks are increasingly dependent on large-scale pretrained transformers, this work presents a viable approach to improving inference efficiency without repeating the expensive pretraining process.

Promoting Graph Awareness in Linearized Graph-to-Text Generation
Alexander Hoyle | Ana Marasović | Noah A. Smith
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Probing Across Time: What Does RoBERTa Know and When?
Leo Z. Liu | Yizhong Wang | Jungo Kasai | Hannaneh Hajishirzi | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2021

Models of language trained on very large corpora have been demonstrated useful for natural language processing. As fixed artifacts, they have become the object of intense study, with many researchers “probing” the extent to which they acquire and readily demonstrate linguistic abstractions, factual and commonsense knowledge, and reasoning abilities. Recent work applied several probes to intermediate training stages to observe the developmental process of a large-scale model (Chiang et al., 2020). Following this effort, we systematically answer a question: for various types of knowledge a language model learns, when during (pre)training are they acquired? Using RoBERTa as a case study, we find: linguistic knowledge is acquired fast, stably, and robustly across domains. Facts and commonsense are slower and more domain-sensitive. Reasoning abilities are, in general, not stably acquired. As new datasets, pretraining protocols, and probes emerge, we believe that probing-across-time analyses can help researchers understand the complex, intermingled learning that these models undergo and guide us toward more efficient approaches that accomplish necessary learning faster.

Expected Validation Performance and Estimation of a Random Variable’s Maximum
Jesse Dodge | Suchin Gururangan | Dallas Card | Roy Schwartz | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2021

Research in NLP is often supported by experimental results, and improved reporting of such results can lead to better understanding and more reproducible science. In this paper we analyze three statistical estimators for expected validation performance, a tool used for reporting performance (e.g., accuracy) as a function of computational budget (e.g., number of hyperparameter tuning experiments). Where previous work analyzing such estimators focused on the bias, we also examine the variance and mean squared error (MSE). In both synthetic and realistic scenarios, we evaluate three estimators and find the unbiased estimator has the highest variance, and the estimator with the smallest variance has the largest bias; the estimator with the smallest MSE strikes a balance between bias and variance, displaying a classic bias-variance tradeoff. We use expected validation performance to compare between different models, and analyze how frequently each estimator leads to drawing incorrect conclusions about which of two models performs best. We find that the two biased estimators lead to the fewest incorrect conclusions, which hints at the importance of minimizing variance and MSE.

Specializing Multilingual Language Models: An Empirical Study
Ethan C. Chau | Noah A. Smith
Proceedings of the 1st Workshop on Multilingual Representation Learning

Pretrained multilingual language models have become a common tool in transferring NLP capabilities to low-resource languages, often with adaptations. In this work, we study the performance, extensibility, and interaction of two such adaptations: vocabulary augmentation and script transliteration. Our evaluations on part-of-speech tagging, universal dependency parsing, and named entity recognition in nine diverse low-resource languages uphold the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.

Choose Your Own Adventure: Paired Suggestions in Collaborative Writing for Evaluating Story Generation Models
Elizabeth Clark | Noah A. Smith
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Story generation is an open-ended and subjective task, which poses a challenge for evaluating story generation models. We present Choose Your Own Adventure, a collaborative writing setup for pairwise model evaluation. Two models generate suggestions to people as they write a short story; we ask writers to choose one of the two suggestions, and we observe which model’s suggestions they prefer. The setup also allows further analysis based on the revisions people make to the suggestions. We show that these measures, combined with automatic metrics, provide an informative picture of the models’ performance, both in cases where the differences in generation methods are small (nucleus vs. top-k sampling) and large (GPT2 vs. Fusion models).

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers
Pradeep Dasigi | Kyle Lo | Iz Beltagy | Arman Cohan | Noah A. Smith | Matt Gardner
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

Infusing Finetuning with Semantic Dependencies
Zhaofeng Wu | Hao Peng | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 9

For natural language processing systems, two kinds of evidence support the use of text representations from neural language models “pretrained” on large unannotated corpora: performance on application-inspired benchmarks (Peters et al., 2018, inter alia), and the emergence of syntactic abstractions in those representations (Tenney et al., 2019, inter alia). On the other hand, the lack of grounded supervision calls into question how well these representations can ever capture meaning (Bender and Koller, 2020). We apply novel probes to recent language models— specifically focusing on predicate-argument structure as operationalized by semantic dependencies (Ivanova et al., 2012)—and find that, unlike syntax, semantics is not brought to the surface by today’s pretrained models. We then use convolutional graph encoders to explicitly incorporate semantic parses into task-specific finetuning, yielding benefits to natural language understanding (NLU) tasks in the GLUE benchmark. This approach demonstrates the potential for general-purpose (rather than task-specific) linguistic supervision, above and beyond conventional pretraining and finetuning. Several diagnostics help to localize the benefits of our approach.1

Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?
William Merrill | Yoav Goldberg | Roy Schwartz | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 9

Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever “understand” raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of “assertions”: textual contexts that provide indirect clues about the underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation of languages that satisfy a strong notion of semantic transparency. However, for classes of languages where the same expression can take different values in different contexts, we show that emulation can become uncomputable. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to “understand”.

2020

A Formal Hierarchy of RNN Architectures
William Merrill | Gail Weiss | Yoav Goldberg | Roy Schwartz | Noah A. Smith | Eran Yahav
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties: space complexity, which measures the RNN’s memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, which formally separates it from the related QRNN (Bradbury et al., 2016). We also show how these models’ expressive capacity is expanded by stacking multiple layers or composing them with different pooling functions. Our results build on the theory of “saturated” RNNs (Merrill, 2019). While formally extending these findings to unsaturated RNNs is left to future work, we hypothesize that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy. We provide empirical results to support this conjecture. Experimental findings from training unsaturated networks on formal languages support this conjecture.

Recollection versus Imagination: Exploring Human Memory and Cognition via Neural Language Models
Maarten Sap | Eric Horvitz | Yejin Choi | Noah A. Smith | James Pennebaker
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We investigate the use of NLP as a measure of the cognitive processes involved in storytelling, contrasting imagination and recollection of events. To facilitate this, we collect and release Hippocorpus, a dataset of 7,000 stories about imagined and recalled events. We introduce a measure of narrative flow and use this to examine the narratives for imagined and recalled events. Additionally, we measure the differential recruitment of knowledge attributed to semantic memory versus episodic memory (Tulving, 1972) for imagined and recalled storytelling by comparing the frequency of descriptions of general commonsense events with more specific realis events. Our analyses show that imagined stories have a substantially more linear narrative flow, compared to recalled stories in which adjacent sentences are more disconnected. In addition, while recalled stories rely more on autobiographical events based on episodic memory, imagined stories express more commonsense knowledge based on semantic memory. Finally, our measures reveal the effect of narrativization of memories in stories (e.g., stories about frequently recalled memories flow more linearly; Bartlett, 1932). Our findings highlight the potential of using NLP tools to study the traces of human cognition in language.

Improving Transformer Models by Reordering their Sublayers
Ofir Press | Noah A. Smith | Omer Levy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Multilayer transformer networks consist of interleaved self-attention and feedforward sublayers. Could ordering the sublayers in a different pattern lead to better performance? We generate randomly ordered transformers and train them with the language modeling objective. We observe that some of these models are able to achieve better performance than the interleaved baseline, and that those successful variants tend to have more self-attention at the bottom and more feedforward sublayers at the top. We propose a new transformer pattern that adheres to this property, the sandwich transformer, and show that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. However, the sandwich reordering pattern does not guarantee performance gains across every task, as we demonstrate on machine translation models. Instead, we suggest that further exploration of task-specific sublayer reorderings is needed in order to unlock additional gains.

Social Bias Frames: Reasoning about Social and Power Implications of Language
Maarten Sap | Saadia Gabriel | Lianhui Qin | Dan Jurafsky | Noah A. Smith | Yejin Choi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Warning: this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that “we shouldn’t lower our standards to hire more women,” most listeners will infer the implicature intended by the speaker - that “women (candidates) are less qualified.” Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80% F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

A Mixture of h - 1 Heads is Better than h Heads
Hao Peng | Roy Schwartz | Dianqi Li | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Multi-head attentive neural architectures have achieved state-of-the-art results on a variety of natural language processing tasks. Evidence has shown that they are overparameterized; attention heads can be pruned without significant performance loss. In this work, we instead “reallocate” them—the model learns to activate different heads on different inputs. Drawing connections between multi-head attention and mixture of experts, we propose the mixture of attentive experts model (MAE). MAE is trained using a block coordinate descent algorithm that alternates between updating (1) the responsibilities of the experts and (2) their parameters. Experiments on machine translation and language modeling show that MAE outperforms strong baselines on both tasks. Particularly, on the WMT14 English to German translation dataset, MAE improves over “transformer-base” by 0.8 BLEU, with a comparable number of parameters. Our analysis shows that our model learns to specialize different experts to different inputs.

The Right Tool for the Job: Matching Model and Instance Complexities
Roy Schwartz | Gabriel Stanovsky | Swabha Swayamdipta | Jesse Dodge | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) “exit” from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks: three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed/accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency; we allow users to control the inference speed/accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks
Suchin Gururangan | Ana Marasović | Swabha Swayamdipta | Kyle Lo | Iz Beltagy | Doug Downey | Noah A. Smith
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

Multilingual and Interlingual Semantic Representations for Natural Language Processing: A Brief Introduction
Marta R. Costa-jussà | Cristina España-Bonet | Pascale Fung | Noah A. Smith
Computational Linguistics, Volume 46, Issue 2 - June 2020

We introduce the Computational Linguistics special issue on Multilingual and Interlingual Semantic Representations for Natural Language Processing. We situate the special issue’s five articles in the context of our fast-changing field, explaining our motivation for this project. We offer a brief summary of the work in the issue, which includes developments on lexical and sentential semantic representations, from symbolic and neural perspectives.

Grounded Compositional Outputs for Adaptive Language Modeling
Nikolaos Pappas | Phoebe Mulcaire | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Language models have emerged as a central component across NLP, and a great deal of progress depends on the ability to cheaply adapt them (e.g., through finetuning) to new domains and tasks. A language model’s vocabulary—typically selected before training and permanently fixed later—affects its size and is part of what makes it resistant to such adaptation. Prior work has used compositional input embeddings based on surface forms to ameliorate this issue. In this work, we go one step beyond and propose a fully compositional output embedding layer for language models, which is further grounded in information from a structured lexicon (WordNet), namely semantically related words and free-text definitions. To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary. We evaluate the model on conventional language modeling as well as challenging cross-domain settings with an open vocabulary, finding that it matches or outperforms previous state-of-the-art output embedding methods and adaptation approaches. Our analysis attributes the improvements to sample efficiency: our model is more accurate for low-frequency words.

The Multilingual Amazon Reviews Corpus
Phillip Keung | Yichao Lu | György Szarvas | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present the Multilingual Amazon Reviews Corpus (MARC), a large-scale collection of Amazon reviews for multilingual text classification. The corpus contains reviews in English, Japanese, German, French, Spanish, and Chinese, which were collected between 2015 and 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID, and the coarse-grained product category (e.g., ‘books’, ‘appliances’, etc.) The corpus is balanced across the 5 possible star ratings, so each rating constitutes 20% of the reviews in each language. For each language, there are 200,000, 5,000, and 5,000 reviews in the training, development, and test sets, respectively. We report baseline results for supervised text classification and zero-shot cross-lingual transfer learning by fine-tuning a multilingual BERT model on reviews data. We propose the use of mean absolute error (MAE) instead of classification accuracy for this task, since MAE accounts for the ordinal nature of the ratings.

Multilevel Text Alignment with Cross-Document Attention
Xuhui Zhou | Nikolaos Pappas | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Text alignment finds application in tasks such as citation recommendation and plagiarism detection. Existing alignment methods operate at a single, predefined level and cannot learn to align texts at, for example, sentence and document levels. We propose a new learning approach that equips previously established hierarchical attention encoders for representing documents with a cross-document attention component, enabling structural comparisons across different levels (document-to-document and sentence-to-document). Our component is weakly supervised from document pairs and can align at multiple levels. Our evaluation on predicting document-to-document relationships and sentence-to-document relationships on the tasks of citation recommendation and plagiarism detection shows that our approach outperforms previously established hierarchical, attention encoders based on recurrent and transformer contextualization that are unaware of structural correspondence between documents.

Writing Strategies for Science Communication: Data and Computational Analysis
Tal August | Lauren Kim | Katharina Reinecke | Noah A. Smith
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Communicating complex scientific ideas without misleading or overwhelming the public is challenging. While science communication guides exist, they rarely offer empirical evidence for how their strategies are used in practice. Writing strategies that can be automatically recognized could greatly support science communication efforts by enabling tools to detect and suggest strategies for writers. We compile a set of writing strategies drawn from a wide range of prescriptive sources and develop an annotation scheme allowing humans to recognize them. We collect a corpus of 128k science writing documents in English and annotate a subset of this corpus. We use the annotations to train transformer-based classifiers and measure the strategies’ use in the larger corpus. We find that the use of strategies, such as storytelling and emphasizing the most important findings, varies significantly across publications with different reader audiences.

Plug and Play Autoencoders for Conditional Text Generation
Florian Mai | Nikolaos Pappas | Ivan Montero | Noah A. Smith | James Henderson
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Text autoencoders are commonly used for conditional generation tasks such as style transfer. We propose methods which are plug and play, where any pretrained autoencoder can be used, and only require learning a mapping within the autoencoder’s embedding space, training embedding-to-embedding (Emb2Emb). This reduces the need for labeled training data for the task and makes the training procedure more efficient. Crucial to the success of this method is a loss term for keeping the mapped embedding on the manifold of the autoencoder and a mapping which is trained to navigate the manifold by learning offset vectors. Evaluations on style transfer tasks both with and without sequence-to-sequence supervision show that our method performs better than or comparable to strong baselines while being up to four times faster.

Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics
Swabha Swayamdipta | Roy Schwartz | Nicholas Lourie | Yizhong Wang | Hannaneh Hajishirzi | Noah A. Smith | Yejin Choi
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps—a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example—the model’s confidence in the true class, and the variability of this confidence across epochs—obtained in a single run of training. Experiments on four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of “ambiguous” regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are “easy to learn” for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds “hard to learn”; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture the abilities a dataset is intended to test. We propose a more rigorous annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model’s decision boundary, which can be used to more accurately evaluate a model’s true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, and IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets—up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Parsing with Multilingual BERT, a Small Corpus, and a Small Treebank
Ethan C. Chau | Lucy H. Lin | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2020

Pretrained multilingual contextual representations have shown great success, but due to the limits of their pretraining data, their benefits do not apply equally to all language varieties. This presents a challenge for language varieties unfamiliar to these models, whose labeled and unlabeled data is too limited to train a monolingual model effectively. We propose the use of additional language-specific pretraining and vocabulary augmentation to adapt multilingual models to low-resource settings. Using dependency parsing of four diverse low-resource language varieties as a case study, we show that these methods significantly improve performance over baselines, especially in the lowest-resource cases, and demonstrate the importance of the relationship between such models’ pretraining data and target language varieties.

Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs
Ana Marasović | Chandra Bhagavatula | Jae sung Park | Ronan Le Bras | Noah A. Smith | Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2020

Natural language rationales could provide intuitive, higher-level explanations that are easily understandable by humans, complementing the more broadly studied lower-level explanations based on gradients or attention weights. We present the first study focused on generating natural language rationales across several complex visual reasoning tasks: visual commonsense reasoning, visual-textual entailment, and visual question answering. The key challenge of accurate rationalization is comprehensive image understanding at all levels: not just their explicit content at the pixel level, but their contextual contents at the semantic and pragmatic levels. We present RationaleˆVT Transformer, an integrated model that learns to generate free-text rationales by combining pretrained language models with object recognition, grounded visual semantic frames, and visual commonsense graphs. Our experiments show that free-text rationalization is a promising research direction to complement model interpretability for complex visual-textual reasoning tasks. In addition, we find that integration of richer semantic and pragmatic visual features improves visual fidelity of rationales.

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
Samuel Gehman | Suchin Gururangan | Maarten Sap | Yejin Choi | Noah A. Smith
Findings of the Association for Computational Linguistics: EMNLP 2020

Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning “bad” words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.

Thinking Like a Skeptic: Defeasible Inference in Natural Language
Rachel Rudinger | Vered Shwartz | Jena D. Hwang | Chandra Bhagavatula | Maxwell Forbes | Ronan Le Bras | Noah A. Smith | Yejin Choi
Findings of the Association for Computational Linguistics: EMNLP 2020

Defeasible inference is a mode of reasoning in which an inference (X is a bird, therefore X flies) may be weakened or overturned in light of new evidence (X is a penguin). Though long recognized in classical AI and philosophy, defeasible inference has not been extensively studied in the context of contemporary data-driven research on natural language inference and commonsense reasoning. We introduce Defeasible NLI (abbreviated 𝛿-NLI), a dataset for defeasible inference in natural language. Defeasible NLI contains extensions to three existing inference datasets covering diverse modes of reasoning: common sense, natural language inference, and social norms. From Defeasible NLI, we develop both a classification and generation task for defeasible inference, and demonstrate that the generation task is much more challenging. Despite lagging human performance, however, generative models trained on this data are capable of writing sentences that weaken or strengthen a specified inference up to 68% of the time.

Exploring the Effect of Author and Reader Identity in Online Story Writing: the STORIESINTHEWILD Corpus.
Tal August | Maarten Sap | Elizabeth Clark | Katharina Reinecke | Noah A. Smith
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

Current story writing or story editing systems rely on human judgments of story quality for evaluating performance, often ignoring the subjectivity in ratings. We analyze the effect of author and reader characteristics and story writing setup on the quality of stories in a short storytelling task. To study this effect, we create and release STORIESINTHEWILD, containing 1,630 stories collected on a volunteer-based crowdsourcing platform. Each story is rated by three different readers, and comes paired with the author’s and reader’s age, gender, and personality. Our findings show significant effects of authors’ and readers’ identities, as well as writing setup, on story writing and ratings. Notably, compared to younger readers, readers age 45 and older consider stories significantly less creative and less entertaining. Readers also prefer stories written all at once, rather than in chunks, finding them more coherent and creative. We also observe linguistic differences associated with authors’ demographics (e.g., older authors wrote more vivid and emotional stories). Our findings suggest that reader and writer demographics, as well as writing setup, should be accounted for in story writing evaluations.

Unsupervised Bitext Mining and Translation via Self-Trained Contextual Embeddings
Phillip Keung | Julian Salazar | Yichao Lu | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 8

We describe an unsupervised method to create pseudo-parallel corpora for machine translation (MT) from unaligned text. We use multilingual BERT to create source and target sentence embeddings for nearest-neighbor search and adapt the model via self-training. We validate our technique by extracting parallel sentence pairs on the BUCC 2017 bitext mining task and observe up to a 24.5 point increase (absolute) in F1 scores over previous unsupervised methods. We then improve an XLM-based unsupervised neural MT system pre-trained on Wikipedia by supplementing it with pseudo-parallel text mined from the same corpus, boosting unsupervised translation performance by up to 3.5 BLEU on the WMT’14 French-English and WMT’16 German-English tasks and outperforming the previous state-of-the-art. Finally, we enrich the IWSLT’15 English-Vietnamese corpus with pseudo-parallel Wikipedia sentence pairs, yielding a 1.2 BLEU improvement on the low-resource MT task. We demonstrate that unsupervised bitext mining is an effective way of augmenting MT datasets and complements existing techniques like initializing with pre-trained contextual embeddings.

2019

Knowledge Enhanced Contextual Word Representations
Matthew E. Peters | Mark Neumann | Robert Logan | Roy Schwartz | Vidur Joshi | Sameer Singh | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs.

RNN Architecture Learning with Sparse Regularization
Jesse Dodge | Roy Schwartz | Hao Peng | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Neural models for NLP typically use large numbers of parameters to reach state-of-the-art performance, which can lead to excessive memory usage and increased runtime. We present a structure learning method for learning sparse, parameter-efficient NLP models. Our method applies group lasso to rational RNNs (Peng et al., 2018), a family of models that is closely connected to weighted finite-state automata (WFSAs). We take advantage of rational RNNs’ natural grouping of the weights, so the group lasso penalty directly removes WFSA states, substantially reducing the number of parameters in the model. Our experiments on a number of sentiment analysis datasets, using both GloVe and BERT embeddings, show that our approach learns neural structures which have fewer parameters without sacrificing performance relative to parameter-rich baselines. Our method also highlights the interpretable properties of rational RNNs. We show that sparsifying such models makes them easier to visualize, and we present models that rely exclusively on as few as three WFSAs after pruning more than 90% of the weights. We publicly release our code.

Robust Navigation with Language Pretraining and Stochastic Sampling
Xiujun Li | Chunyuan Li | Qiaolin Xia | Yonatan Bisk | Asli Celikyilmaz | Jianfeng Gao | Noah A. Smith | Yejin Choi
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Core to the vision-and-language navigation (VLN) challenge is building robust instruction representations and action decoding schemes, which can generalize well to previously unseen instructions and environments. In this paper, we report two simple but highly effective methods to address these challenges and lead to a new state-of-the-art performance. First, we adapt large-scale pretrained language models to learn text representations that generalize better to previously unseen instructions. Second, we propose a stochastic sampling scheme to reduce the considerable gap between the expert actions in training and sampled actions in test, so that the agent can learn to correct its own mistakes during long sequential action decoding. Combining the two techniques, we achieve a new state of the art on the Room-to-Room benchmark with 6% absolute gain over the previous best result (47% -> 53%) on the Success Rate weighted by Path Length metric.

Show Your Work: Improved Reporting of Experimental Results
Jesse Dodge | Suchin Gururangan | Dallas Card | Roy Schwartz | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Research in natural language processing proceeds, in part, by demonstrating that new models achieve superior performance (e.g., accuracy) on held-out test data, compared to previous results. In this paper, we demonstrate that test-set performance scores alone are insufficient for drawing accurate conclusions about which model performs best. We argue for reporting additional details, especially performance on validation data obtained during model development. We present a novel technique for doing so: expected validation performance of the best-found model as a function of computation budget (i.e., the number of hyperparameter search trials or the overall training time). Using our approach, we find multiple recent model comparisons where authors would have reached a different conclusion if they had used more (or less) computation. Our approach also allows us to estimate the amount of computation required to obtain a given accuracy; applying it to several recently published results yields massive variation across papers, from hours to weeks. We conclude with a set of best practices for reporting experimental results which allow for robust future comparisons, and provide code to allow researchers to use our technique.

PaLM: A Hybrid Parser and Language Model
Hao Peng | Roy Schwartz | Noah A. Smith
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We present PaLM, a hybrid parser and neural language model. Building on an RNN language model, PaLM adds an attention layer over text spans in the left context. An unsupervised constituency parser can be derived from its attention weights, using a greedy decoding algorithm. We evaluate PaLM on language modeling, and empirically show that it outperforms strong baselines. If syntactic annotations are available, the attention component can be trained in a supervised manner, providing syntactically-informed representations of the context, and further improving language modeling performance.

Topics to Avoid: Demoting Latent Confounds in Text Classification
Sachin Kumar | Shuly Wintner | Noah A. Smith | Yulia Tsvetkov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author’s native language is Swedish). We propose a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
Pradeep Dasigi | Nelson F. Liu | Ana Marasović | Noah A. Smith | Matt Gardner
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

Low-Resource Parsing with Crosslingual Contextualized Representations
Phoebe Mulcaire | Jungo Kasai | Noah A. Smith
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Despite advances in dependency parsing, languages with small treebanks still present challenges. We assess recent approaches to multilingual contextual word representations (CWRs), and compare them for crosslingual transfer from a language with a large treebank to a language with a small or nonexistent treebank, by sharing parameters between languages in the parser itself. We experiment with a diverse selection of languages in both simulated and truly low-resource scenarios, and show that multilingual CWRs greatly facilitate low-resource dependency parsing even without crosslingual supervision such as dictionaries or parallel text. Furthermore, we examine the non-contextual part of the learned language models (which we call a “decontextual probe”) to demonstrate that polyglot language models better encode crosslingual lexical correspondence compared to aligned monolingual language models. This analysis provides further evidence that polyglot training is an effective approach to crosslingual transfer.

Linguistic Knowledge and Transferability of Contextual Representations
Nelson F. Liu | Matt Gardner | Yonatan Belinkov | Matthew E. Peters | Noah A. Smith
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Contextual word representations derived from large-scale neural language models are successful across a diverse set of NLP tasks, suggesting that they encode useful and transferable features of language. To shed light on the linguistic knowledge they capture, we study the representations produced by several recent pretrained contextualizers (variants of ELMo, the OpenAI transformer language model, and BERT) with a suite of sixteen diverse probing tasks. We find that linear models trained on top of frozen contextual representations are competitive with state-of-the-art task-specific models in many cases, but fail on tasks requiring fine-grained linguistic knowledge (e.g., conjunct identification). To investigate the transferability of contextual word representations, we quantify differences in the transferability of individual layers within contextualizers, especially between recurrent neural networks (RNNs) and transformers. For instance, higher layers of RNNs are more task-specific, while transformer layers do not exhibit the same monotonic trend. In addition, to better understand what makes contextual word representations transferable, we compare language model pretraining with eleven supervised pretraining tasks. For any given task, pretraining on a closely related task yields better performance than language model pretraining (which is better on average) when the pretraining dataset is fixed. However, language model pretraining on more data gives the best results.

Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets
Nelson F. Liu | Roy Schwartz | Noah A. Smith
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Several datasets have recently been constructed to expose brittleness in models trained on existing benchmarks. While model performance on these challenge datasets is significantly lower compared to the original benchmark, it is unclear what particular weaknesses they reveal. For example, a challenge dataset may be difficult because it targets phenomena that current models cannot capture, or because it simply exploits blind spots in a model’s specific training set. We introduce inoculation by fine-tuning, a new analysis method for studying challenge datasets by exposing models (the metaphorical patient) to a small amount of data from the challenge dataset (a metaphorical pathogen) and assessing how well they can adapt. We apply our method to analyze the NLI “stress tests” (Naik et al., 2018) and the Adversarial SQuAD dataset (Jia and Liang, 2017). We show that after slight exposure, some of these datasets are no longer challenging, while others remain difficult. Our results indicate that failures on challenge datasets may lead to very different conclusions about models, training datasets, and the challenge datasets themselves.

Polyglot Contextual Representations Improve Crosslingual Transfer
Phoebe Mulcaire | Jungo Kasai | Noah A. Smith
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce Rosita, a method to produce multilingual contextual word representations by training a single language model on text from multiple languages. Our method combines the advantages of contextual word representations with those of multilingual representation learning. We produce language models from dissimilar language pairs (English/Arabic and English/Chinese) and use them in dependency parsing, semantic role labeling, and named entity recognition, with comparisons to monolingual and non-contextual variants. Our results provide further evidence for the benefits of polyglot learning, in which representations are shared across multiple languages.

The Risk of Racial Bias in Hate Speech Detection
Maarten Sap | Dallas Card | Saadia Gabriel | Yejin Choi | Noah A. Smith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We investigate how annotators’ insensitivity to differences in dialect can lead to racial bias in automatic hate speech detection models, potentially amplifying harm against minority populations. We first uncover unexpected correlations between surface markers of African American English (AAE) and ratings of toxicity in several widely-used hate speech datasets. Then, we show that models trained on these corpora acquire and propagate these biases, such that AAE tweets and tweets by self-identified African Americans are up to two times more likely to be labelled as offensive compared to others. Finally, we propose *dialect* and *race priming* as ways to reduce the racial bias in annotation, showing that when annotators are made explicitly aware of an AAE tweet’s dialect they are significantly less likely to label the tweet as offensive.

Evaluating Gender Bias in Machine Translation
Gabriel Stanovsky | Noah A. Smith | Luke Zettlemoyer
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present the first challenge set and evaluation protocol for the analysis of gender bias in machine translation (MT). Our approach uses two recent coreference resolution datasets composed of English sentences which cast participants into non-stereotypical gender roles (e.g., “The doctor asked the nurse to help her in the operation”). We devise an automatic gender bias evaluation method for eight target languages with grammatical gender, based on morphological analysis (e.g., the use of female inflection for the word “doctor”). Our analyses show that four popular industrial MT systems and two recent state-of-the-art academic MT models are significantly prone to gender-biased translation errors for all tested target languages. Our data and code are publicly available at https://github.com/gabrielStanovsky/mt_gender.

Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts
Elizabeth Clark | Asli Celikyilmaz | Noah A. Smith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

For evaluating machine-generated texts, automatic methods hold the promise of avoiding collection of human judgments, which can be expensive and time-consuming. The most common automatic metrics, like BLEU and ROUGE, depend on exact word matching, an inflexible approach for measuring semantic similarity. We introduce methods based on sentence mover’s similarity; our automatic metrics evaluate text in a continuous space using word and sentence embeddings. We find that sentence-based metrics correlate with human judgments significantly better than ROUGE, both on machine-generated summaries (average length of 3.4 sentences) and human-authored essays (average length of 7.5). We also show that sentence mover’s similarity can be used as a reward when learning a generation model via reinforcement learning; we present both automatic and human evaluations of summaries learned in this way, finding that our approach outperforms ROUGE.

Is Attention Interpretable?
Sofia Serrano | Noah A. Smith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Attention mechanisms have recently boosted performance on a range of NLP tasks. Because attention layers explicitly weight input components’ representations, it is also often assumed that attention can be used to identify information that models found important (e.g., specific contextualized word tokens). We test whether that assumption holds by manipulating attention weights in already-trained text classification models and analyzing the resulting differences in their predictions. While we observe some ways in which higher attention weights correlate with greater impact on model predictions, we also find many ways in which this does not hold, i.e., where gradient-based rankings of attention weights better predict their effects than their magnitudes. We conclude that while attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator.

Variational Pretraining for Semi-supervised Text Classification
Suchin Gururangan | Tam Dang | Dallas Card | Noah A. Smith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We introduce VAMPIRE, a lightweight pretraining framework for effective text classification when data and computing resources are limited. We pretrain a unigram document model as a variational autoencoder on in-domain, unlabeled data and use its internal states as features in a downstream classifier. Empirically, we show the relative strength of VAMPIRE against computationally expensive contextual embeddings and other popular semi-supervised baselines under low resource settings. We also find that fine-tuning to in-domain data is crucial to achieving decent performance from contextual embeddings when working with limited supervision. We accompany this paper with code to pretrain and use VAMPIRE embeddings in downstream tasks.

Measuring Online Debaters’ Persuasive Skill from Text over Time
Kelvin Luu | Chenhao Tan | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 7

Online debates allow people to express their persuasive abilities and provide exciting opportunities for understanding persuasion. Prior studies have focused on studying persuasion in debate content, but without accounting for each debater’s history or exploring the progression of a debater’s persuasive ability. We study debater skill by modeling how participants progress over time in a collection of debates from Debate.org. We build on a widely used model of skill in two-player games and augment it with linguistic features of a debater’s content. We show that online debaters’ skill levels do tend to improve over time. Incorporating linguistic profiles leads to more robust skill estimation than winning records alone. Notably, we find that an interaction feature combining uncertainty cues (hedging) with terms strongly associated with either side of a particular debate (fightin’ words) is more predictive than either feature on its own, indicating the importance of fine- grained linguistic features.

To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
Matthew E. Peters | Sebastian Ruder | Noah A. Smith
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

While most previous work has focused on different pretraining objectives and architectures for transfer learning, we ask how to best adapt the pretrained model to a given target task. We focus on the two most common forms of adaptation, feature extraction (where the pretrained weights are frozen), and directly fine-tuning the pretrained model. Our empirical results across diverse NLP tasks with two state-of-the-art models show that the relative performance of fine-tuning vs. feature extraction depends on the similarity of the pretraining and target tasks. We explore possible explanations for this finding and provide a set of adaptation guidelines for the NLP practitioner.

2018

Neural Cross-Lingual Named Entity Recognition with Minimal Resources
Jiateng Xie | Zhilin Yang | Graham Neubig | Noah A. Smith | Jaime Carbonell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

For languages with no annotated resources, unsupervised transfer of natural language processing models such as named-entity recognition (NER) from resource-rich languages would be an appealing capability. However, differences in words and word order across languages make it a challenging problem. To improve mapping of lexical items across languages, we propose a method that finds translations based on bilingual word embeddings. To improve robustness to word order differences, we propose to use self-attention, which allows for a degree of flexibility with respect to word order. We demonstrate that these methods achieve state-of-the-art or competitive NER performance on commonly tested languages under a cross-lingual setting, with much lower resource requirements than past approaches. We also evaluate the challenges of applying these methods to Uyghur, a low-resource language.

Rational Recurrences
Hao Peng | Roy Schwartz | Sam Thomson | Noah A. Smith
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Despite the tremendous empirical success of neural models in natural language processing, many of them lack the strong intuitions that accompany classical machine learning approaches. Recently, connections have been shown between convolutional neural networks (CNNs) and weighted finite state automata (WFSAs), leading to new interpretations and insights. In this work, we show that some recurrent neural networks also share this connection to WFSAs. We characterize this connection formally, defining rational recurrences to be recurrent hidden state update functions that can be written as the Forward calculation of a finite set of WFSAs. We show that several recent neural models use rational recurrences. Our analysis provides a fresh view of these models and facilitates devising new neural architectures that draw inspiration from WFSAs. We present one such model, which performs better than two recent baselines on language modeling and text classification. Our results demonstrate that transferring intuitions from classical models like WFSAs can be an effective approach to designing and understanding neural models.

Syntactic Scaffolds for Semantic Structures
Swabha Swayamdipta | Sam Thomson | Kenton Lee | Luke Zettlemoyer | Chris Dyer | Noah A. Smith
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce the syntactic scaffold, an approach to incorporating syntactic information into semantic tasks. Syntactic scaffolds avoid expensive syntactic processing at runtime, only making use of a treebank during training, through a multitask objective. We improve over strong baselines on PropBank semantics, frame semantics, and coreference resolution, achieving competitive performance on all three tasks.

Parsing Tweets into Universal Dependencies
Yijia Liu | Yi Zhu | Wanxiang Che | Bing Qin | Nathan Schneider | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We study the problem of analyzing tweets with universal dependencies (UD). We extend the UD guidelines to cover special constructions in tweets that affect tokenization, part-of-speech tagging, and labeled dependencies. Using the extended guidelines, we create a new tweet treebank for English (Tweebank v2) that is four times larger than the (unlabeled) Tweebank v1 introduced by Kong et al. (2014). We characterize the disagreements between our annotators and show that it is challenging to deliver consistent annotation due to ambiguity in understanding and explaining tweets. Nonetheless, using the new treebank, we build a pipeline system to parse raw tweets into UD. To overcome the annotation noise without sacrificing computational efficiency, we propose a new method to distill an ensemble of 20 transition-based parsers into a single one. Our parser achieves an improvement of 2.2 in LAS over the un-ensembled baseline and outperforms parsers that are state-of-the-art on other treebanks in both accuracy and speed.

Learning Joint Semantic Parsers from Disjoint Data
Hao Peng | Sam Thomson | Swabha Swayamdipta | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present a new approach to learning a semantic parser from multiple datasets, even when the target semantic formalisms are drastically different and the underlying corpora do not overlap. We handle such “disjoint” data by treating annotations for unobserved formalisms as latent structured variables. Building on state-of-the-art baselines, we show improvements both in frame-semantic parsing and semantic dependency parsing by modeling them jointly.

The Importance of Calibration for Estimating Proportions from Annotations
Dallas Card | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Estimating label proportions in a target corpus is a type of measurement that is useful for answering certain types of social-scientific questions. While past work has described a number of relevant approaches, nearly all are based on an assumption which we argue is invalid for many problems, particularly when dealing with human annotations. In this paper, we identify and differentiate between two relevant data generating scenarios (intrinsic vs. extrinsic labels), introduce a simple but novel method which emphasizes the importance of calibration, and then analyze and experimentally validate the appropriateness of various methods for each of the two scenarios.

Neural Text Generation in Stories Using Entity Representations as Context
Elizabeth Clark | Yangfeng Ji | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We introduce an approach to neural text generation that explicitly represents entities mentioned in the text. Entity representations are vectors that are updated as the text proceeds; they are designed specifically for narrative text like fiction or news stories. Our experiments demonstrate that modeling entities offers a benefit in two automatic evaluations: mention generation (in which a model chooses which entity to mention next and which words to use in the mention) and selection between a correct next sentence and a distractor from later in the same story. We also conduct a human evaluation on automatically generated text in story contexts; this study supports our emphasis on entities and suggests directions for further research.

Annotation Artifacts in Natural Language Inference Data
Suchin Gururangan | Swabha Swayamdipta | Omer Levy | Roy Schwartz | Samuel Bowman | Noah A. Smith
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Large-scale datasets for natural language inference are created by presenting crowd workers with a sentence (premise), and asking them to generate three new sentences (hypotheses) that it entails, contradicts, or is logically neutral with respect to. We show that, in a significant portion of such data, this protocol leaves clues that make it possible to identify the label by looking only at the hypothesis, without observing the premise. Specifically, we show that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI (Bowman et. al, 2015) and 53% of MultiNLI (Williams et. al, 2017). Our analysis reveals that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes. Our findings suggest that the success of natural language inference models to date has been overestimated, and that the task remains a hard open problem.

Sounding Board: A User-Centric and Content-Driven Social Chatbot
Hao Fang | Hao Cheng | Maarten Sap | Elizabeth Clark | Ari Holtzman | Yejin Choi | Noah A. Smith | Mari Ostendorf
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present Sounding Board, a social chatbot that won the 2017 Amazon Alexa Prize. The system architecture consists of several components including spoken language processing, dialogue management, language generation, and content management, with emphasis on user-centric and content-driven design. We also share insights gained from large-scale online logs based on 160,000 conversations with real-world users.

Bridging CNNs, RNNs, and Weighted Finite-State Machines
Roy Schwartz | Sam Thomson | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Recurrent and convolutional neural networks comprise two distinct families of models that have proven to be useful for encoding natural language utterances. In this paper we present SoPa, a new model that aims to bridge these two approaches. SoPa combines neural representation learning with weighted finite-state automata (WFSAs) to learn a soft version of traditional surface patterns. We show that SoPa is an extension of a one-layer CNN, and that such CNNs are equivalent to a restricted version of SoPa, and accordingly, to a restricted form of WFSA. Empirically, on three text classification tasks, SoPa is comparable or better than both a BiLSTM (RNN) baseline and a CNN baseline, and is particularly useful in small data settings.

Event2Mind: Commonsense Inference on Events, Intents, and Reactions
Hannah Rashkin | Maarten Sap | Emily Allaway | Noah A. Smith | Yejin Choi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate a new commonsense inference task: given an event described in a short free-form text (“X drinks coffee in the morning”), a system reasons about the likely intents (“X wants to stay awake”) and reactions (“X feels alert”) of the event’s participants. To support this study, we construct a new crowdsourced corpus of 25,000 event phrases covering a diverse range of everyday events and situations. We report baseline performance on this task, demonstrating that neural encoder-decoder models can successfully compose embedding representations of previously unseen events and reason about the likely intents and reactions of the event participants. In addition, we demonstrate how commonsense inference on people’s intents and reactions can help unveil the implicit gender inequality prevalent in modern movie scripts.

Backpropagating through Structured Argmax using a SPIGOT
Hao Peng | Sam Thomson | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce structured projection of intermediate gradients (SPIGOT), a new method for backpropagating through neural networks that include hard-decision structured predictions (e.g., parsing) in intermediate layers. SPIGOT requires no marginal inference, unlike structured attention networks and reinforcement learning-inspired solutions. Like so-called straight-through estimators, SPIGOT defines gradient-like quantities associated with intermediate nondifferentiable operations, allowing backpropagation before and after them; SPIGOT’s proxy aims to ensure that, after a parameter update, the intermediate structure will remain well-formed. We experiment on two structured NLP pipelines: syntactic-then-semantic dependency parsing, and semantic parsing followed by sentiment classification. We show that training with SPIGOT leads to a larger improvement on the downstream task than a modularly-trained pipeline, the straight-through estimator, and structured attention, reaching a new state of the art on semantic dependency parsing.

Neural Models for Documents with Metadata
Dallas Card | Chenhao Tan | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Most real-world document collections involve various types of metadata, such as author, source, and date, and yet the most commonly-used approaches to modeling text corpora ignore this information. While specialized models have been developed for particular applications, few are widely used in practice, as customization typically requires derivation of a custom inference algorithm. In this paper, we build on recent advances in variational inference methods and propose a general neural framework, based on topic models, to enable flexible incorporation of metadata and allow for rapid exploration of alternative models. Our approach achieves strong performance, with a manageable tradeoff between perplexity, coherence, and sparsity. Finally, we demonstrate the potential of our framework through an exploration of a corpus of articles about US immigration.

Polyglot Semantic Role Labeling
Phoebe Mulcaire | Swabha Swayamdipta | Noah A. Smith
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Previous approaches to multilingual semantic dependency parsing treat languages independently, without exploiting the similarities between semantic structures across languages. We experiment with a new approach where we combine resources from different languages in the CoNLL 2009 shared task to build a single polyglot semantic dependency parser. Notwithstanding the absence of parallel data, and the dissimilarity in annotations between languages, our approach results in improvement in parsing performance on several languages over a monolingual baseline. Analysis of the polyglot models’ performance provides a new understanding of the similarities and differences between languages in the shared task.

Discovering Phonesthemes with Sparse Regularization
Nelson F. Liu | Gina-Anne Levow | Noah A. Smith
Proceedings of the Second Workshop on Subword/Character LEvel Models

We introduce a simple method for extracting non-arbitrary form-meaning representations from a collection of semantic vectors. We treat the problem as one of feature selection for a model trained to predict word vectors from subword features. We apply this model to the problem of automatically discovering phonesthemes, which are submorphemic sound clusters that appear in words with similar meaning. Many of our model-predicted phonesthemes overlap with those proposed in the linguistics literature, and we validate our approach with human judgments.

LSTMs Exploit Linguistic Attributes of Data
Nelson F. Liu | Omer Levy | Roy Schwartz | Chenhao Tan | Noah A. Smith
Proceedings of the Third Workshop on Representation Learning for NLP

While recurrent neural networks have found success in a variety of natural language processing applications, they are general models of sequential data. We investigate how the properties of natural language data affect an LSTM’s ability to learn a nonlinguistic task: recalling elements from its input. We find that models trained on natural language data are able to recall tokens from much longer sequences than models trained on non-language sequential data. Furthermore, we show that the LSTM learns to solve the memorization task by explicitly using a subset of its neurons to count timesteps in the input. We hypothesize that the patterns and structure in natural language data enable LSTMs to learn by providing approximate ways of reducing loss, but understanding the effect of different training data on the learnability of LSTMs remains an open question.

2017

Dynamic Entity Representations in Neural Language Models
Yangfeng Ji | Chenhao Tan | Sebastian Martschat | Yejin Choi | Noah A. Smith
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Understanding a long document requires tracking how entities are introduced and evolve over time. We present a new type of language model, EntityNLM, that can explicitly model entities, dynamically update their representations, and contextually generate their mentions. Our model is generative and flexible; it can model an arbitrary number of entities in context while generating each entity mention at an arbitrary length. In addition, it can be used for several different tasks such as language modeling, coreference resolution, and entity prediction. Experimental results with all these tasks demonstrate that our model consistently outperforms strong baselines and prior work.

What Do Recurrent Neural Network Grammars Learn About Syntax?
Adhiguna Kuncoro | Miguel Ballesteros | Lingpeng Kong | Chris Dyer | Graham Neubig | Noah A. Smith
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Recurrent neural network grammars (RNNG) are a recently proposed probablistic generative modeling family for natural language. They show state-of-the-art language modeling and parsing performance. We investigate what information they learn, from a linguistic perspective, through various ablations to the model and the data, and by augmenting the model with an attention mechanism (GA-RNNG) to enable closer inspection. We find that explicit modeling of composition is crucial for achieving the best performance. Through the attention mechanism, we find that headedness plays a central role in phrasal representation (with the model’s latent attention largely agreeing with predictions made by hand-crafted head rules, albeit with some important differences). By training grammars without nonterminal labels, we find that phrasal representations depend minimally on nonterminals, providing support for the endocentricity hypothesis.

Greedy Transition-Based Dependency Parsing with Stack LSTMs
Miguel Ballesteros | Chris Dyer | Yoav Goldberg | Noah A. Smith
Computational Linguistics, Volume 43, Issue 2 - June 2017

We introduce a greedy transition-based parser that learns to represent parser states using recurrent neural networks. Our primary innovation that enables us to do this efficiently is a new control structure for sequential neural networks—the stack long short-term memory unit (LSTM). Like the conventional stack data structures used in transition-based parsers, elements can be pushed to or popped from the top of the stack in constant time, but, in addition, an LSTM maintains a continuous space embedding of the stack contents. Our model captures three facets of the parser’s state: (i) unbounded look-ahead into the buffer of incoming words, (ii) the complete history of transition actions taken by the parser, and (iii) the complete contents of the stack of partially built tree fragments, including their internal structures. In addition, we compare two different word representations: (i) standard word vectors based on look-up tables and (ii) character-based models of words. Although standard word embedding models work well in all languages, the character-based models improve the handling of out-of-vocabulary words, particularly in morphologically rich languages. Finally, we discuss the use of dynamic oracles in training the parser. During training, dynamic oracles alternate between sampling parser states from the training data and from the model as it is being learned, making the model more robust to the kinds of errors that will be made at test time. Training our model with dynamic oracles yields a linear-time greedy parser with very competitive performance.

The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task
Roy Schwartz | Maarten Sap | Ioannis Konstas | Leila Zilles | Yejin Choi | Noah A. Smith
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

A writer’s style depends not just on personal traits but also on her intent and mental state. In this paper, we show how variants of the same writing task can lead to measurable differences in writing style. We present a case study based on the story cloze task (Mostafazadeh et al., 2016a), where annotators were assigned similar writing tasks with different constraints: (1) writing an entire story, (2) adding a story ending for a given story context, and (3) adding an incoherent ending to a story. We show that a simple linear classifier informed by stylistic features is able to successfully distinguish among the three cases, without even looking at the story context. In addition, combining our stylistic features with language model predictions reaches state of the art performance on the story cloze challenge. Our results demonstrate that different task framings can dramatically affect the way people write.

Friendships, Rivalries, and Trysts: Characterizing Relations between Ideas in Texts
Chenhao Tan | Dallas Card | Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Understanding how ideas relate to each other is a fundamental question in many domains, ranging from intellectual history to public communication. Because ideas are naturally embedded in texts, we propose the first framework to systematically characterize the relations between ideas based on their occurrence in a corpus of documents, independent of how these ideas are represented. Combining two statistics—cooccurrence within documents and prevalence correlation over time—our approach reveals a number of different ways in which ideas can cooperate and compete. For instance, two ideas can closely track each other’s prevalence over time, and yet rarely cooccur, almost like a “cold war” scenario. We observe that pairwise cooccurrence and prevalence correlation exhibit different distributions. We further demonstrate that our approach is able to uncover intriguing relations between ideas through in-depth case studies on news articles and research papers.

Neural Discourse Structure for Text Categorization
Yangfeng Ji | Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We show that discourse structure, as defined by Rhetorical Structure Theory and provided by an existing discourse parser, benefits text categorization. Our approach uses a recursive neural network and a newly proposed attention mechanism to compute a representation of the text that focuses on salient content, from the perspective of both RST and the task. Experiments consider variants of the approach and illustrate its strengths and weaknesses.

Deep Multitask Learning for Semantic Dependency Parsing
Hao Peng | Sam Thomson | Noah A. Smith
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a deep neural architecture that parses sentences into three semantic dependency graph formalisms. By using efficient, nearly arc-factored inference and a bidirectional-LSTM composed with a multi-layer perceptron, our base system is able to significantly improve the state of the art for semantic dependency parsing, without using hand-engineered features or syntax. We then explore two multitask learning approaches—one that shares parameters across formalisms, and one that uses higher-order structures to predict the graphs jointly. We find that both approaches improve performance across formalisms on average, achieving a new state of the art. Our code is open-source and available at https://github.com/Noahs-ARK/NeurboParser.

Story Cloze Task: UW NLP System
Roy Schwartz | Maarten Sap | Ioannis Konstas | Leila Zilles | Yejin Choi | Noah A. Smith
Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics

This paper describes University of Washington NLP’s submission for the Linking Models of Lexical, Sentential and Discourse-level Semantics (LSDSem 2017) shared task—the Story Cloze Task. Our system is a linear classifier with a variety of features, including both the scores of a neural language model and style features. We report 75.2% accuracy on the task. A further discussion of our results can be found in Schwartz et al. (2017).

2016

Semi-Supervised Learning of Sequence Models with Method of Moments
Zita Marinho | André F. T. Martins | Shay B. Cohen | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Analyzing Framing through the Casts of Characters in the News
Dallas Card | Justin Gross | Amber Boydstun | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Friends with Motives: Using Text to Infer Influence on SCOTUS
Yanchuan Sim | Bryan Routledge | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Distilling an Ensemble of Greedy Dependency Parsers into One MST Parser
Adhiguna Kuncoro | Miguel Ballesteros | Lingpeng Kong | Chris Dyer | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Character Sequence Models for Colorful Words
Kazuya Kawakami | Chris Dyer | Bryan Routledge | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Training with Exploration Improves a Greedy Stack LSTM Parser
Miguel Ballesteros | Yoav Goldberg | Chris Dyer | Noah A. Smith
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

Greedy, Joint Syntactic-Semantic Parsing with Stack LSTMs
Swabha Swayamdipta | Miguel Ballesteros | Chris Dyer | Noah A. Smith
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

Recurrent Neural Network Grammars
Chris Dyer | Adhiguna Kuncoro | Miguel Ballesteros | Noah A. Smith
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Generation from Abstract Meaning Representation using Tree Transducers
Jeffrey Flanigan | Chris Dyer | Noah A. Smith | Jaime Carbonell
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Katrin Erk | Noah A. Smith
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Katrin Erk | Noah A. Smith
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Many Languages, One Parser
Waleed Ammar | George Mulcaire | Miguel Ballesteros | Chris Dyer | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 4

We train one multilingual model for dependency parsing and use it to parse sentences in several languages. The parsing model uses (i) multilingual word clusters and embeddings; (ii) token-level language information; and (iii) language-specific features (fine-grained POS tags). This input representation enables the parser not only to parse effectively in multiple languages, but also to generalize across languages based on linguistic universals and typological similarities, making it more effective to learn from limited annotations. Our parser’s performance compares favorably to strong baselines in a range of data scenarios, including when the target language has a large treebank, a small treebank, or no treebank for training.

UW-CSE at SemEval-2016 Task 10: Detecting Multiword Expressions and Supersenses using Double-Chained Conditional Random Fields
Mohammad Javad Hosseini | Noah A. Smith | Su-In Lee
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
Jeffrey Flanigan | Chris Dyer | Noah A. Smith | Jaime Carbonell
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

A Neural Model for Language Identification in Code-Switched Tweets
Aaron Jaech | George Mulcaire | Mari Ostendorf | Noah A. Smith
Proceedings of the Second Workshop on Computational Approaches to Code Switching

Hierarchical Character-Word Models for Language Identification
Aaron Jaech | George Mulcaire | Shobhit Hathi | Mari Ostendorf | Noah A. Smith
Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media

2015

Open Extraction of Fine-Grained Political Statements
David Bamman | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Improved Transition-based Parsing by Modeling Characters instead of Words with LSTMs
Miguel Ballesteros | Chris Dyer | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

A Utility Model of Authors in the Scientific Community
Yanchuan Sim | Bryan Routledge | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Extractive Summarization by Maximizing Semantic Volume
Dani Yogatama | Fei Liu | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Bayesian Optimization of Text Representations
Dani Yogatama | Lingpeng Kong | Noah A. Smith
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

Transforming Dependencies into Phrase Structures
Lingpeng Kong | Alexander M. Rush | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Toward Abstractive Summarization Using Semantic Representations
Fei Liu | Jeffrey Flanigan | Sam Thomson | Norman Sadeh | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A Corpus and Model Integrating Multiword Expressions and Supersenses
Nathan Schneider | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Retrofitting Word Vectors to Semantic Lexicons
Manaal Faruqui | Jesse Dodge | Sujay Kumar Jauhar | Chris Dyer | Eduard Hovy | Noah A. Smith
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Transition-Based Dependency Parsing with Stack Long Short-Term Memory
Chris Dyer | Miguel Ballesteros | Wang Ling | Austin Matthews | Noah A. Smith
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Sparse Overcomplete Word Vector Representations
Manaal Faruqui | Yulia Tsvetkov | Dani Yogatama | Chris Dyer | Noah A. Smith
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Frame-Semantic Role Labeling with Heterogeneous Annotations
Meghana Kshirsagar | Sam Thomson | Nathan Schneider | Jaime Carbonell | Noah A. Smith | Chris Dyer
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

The Media Frames Corpus: Annotations of Frames Across Issues
Dallas Card | Amber E. Boydstun | Justin H. Gross | Philip Resnik | Noah A. Smith
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

2014

A Step Towards Usable Privacy Policy: Automatic Alignment of Privacy Statements
Fei Liu | Rohan Ramanath | Norman Sadeh | Noah A. Smith
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

A Dependency Parser for Tweets
Lingpeng Kong | Nathan Schneider | Swabha Swayamdipta | Archna Bhatia | Chris Dyer | Noah A. Smith
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Frame-Semantic Parsing
Dipanjan Das | Desai Chen | André F. T. Martins | Nathan Schneider | Noah A. Smith
Computational Linguistics, Volume 40, Issue 1 - March 2014

Phrase Dependency Machine Translation with Quasi-Synchronous Tree-to-Tree Features
Kevin Gimpel | Noah A. Smith
Computational Linguistics, Volume 40, Issue 2 - June 2014

Comprehensive Annotation of Multiword Expressions in a Social Web Corpus
Nathan Schneider | Spencer Onuffer | Nora Kazour | Emily Danchik | Michael T. Mordowanec | Henrietta Conrad | Noah A. Smith
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Multiword expressions (MWEs) are quite frequent in languages such as English, but their diversity, the scarcity of individual MWE types, and contextual ambiguity have presented obstacles to corpus-based studies and NLP systems addressing them as a class. Here we advocate for a comprehensive annotation approach: proceeding sentence by sentence, our annotators manually group tokens into MWEs according to guidelines that cover a broad range of multiword phenomena. Under this scheme, we have fully annotated an English web corpus for multiword expressions, including those containing gaps.

A Bayesian Mixed Effects Model of Literary Character
David Bamman | Ted Underwood | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Linguistic Structured Sparsity in Text Categorization
Dani Yogatama | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

A Discriminative Graph-Based Parser for the Abstract Meaning Representation
Jeffrey Flanigan | Sam Thomson | Jaime Carbonell | Chris Dyer | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Unsupervised Alignment of Privacy Policies using Hidden Markov Models
Rohan Ramanath | Fei Liu | Norman Sadeh | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Distributed Representations of Geographically Situated Language
David Bamman | Chris Dyer | Noah A. Smith
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Simplified Dependency Annotations with GFL-Web
Michael T. Mordowanec | Nathan Schneider | Chris Dyer | Noah A. Smith
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

Dynamic Language Models for Streaming Text
Dani Yogatama | Chong Wang | Bryan R. Routledge | Noah A. Smith | Eric P. Xing
Transactions of the Association for Computational Linguistics, Volume 2

We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.

Discriminative Lexical Semantic Segmentation with Gaps: Running the MWE Gamut
Nathan Schneider | Emily Danchik | Chris Dyer | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 2

We present a novel representation, evaluation measure, and supervised models for the task of identifying the multiword expressions (MWEs) in a sentence, resulting in a lexical semantic segmentation. Our approach generalizes a standard chunking representation to encode MWEs containing gaps, thereby enabling efficient sequence tagging algorithms for feature-rich discriminative models. Experiments on a new dataset of English web text offer the first linguistically-driven evaluation of MWE identification with truly heterogeneous expression types. Our statistical sequence model greatly outperforms a lookup-based segmentation procedure, achieving nearly 60% F1 for MWE identification.

Unsupervised Discovery of Biographical Structure from Text
David Bamman | Noah A. Smith
Transactions of the Association for Computational Linguistics, Volume 2

We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as Born, Graduates High School, and Becomes Citizen), along with the typical times in a person’s life when those events occur. In a quantitative evaluation at the task of predicting a person’s age for a given event, we find that our generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features. The abstract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011)—we find that there is a bias in their characterization as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.

CMU: Arc-Factored, Discriminative Semantic Dependency Parsing
Sam Thomson | Brendan O’Connor | Jeffrey Flanigan | David Bamman | Jesse Dodge | Swabha Swayamdipta | Nathan Schneider | Chris Dyer | Noah A. Smith
Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014)

Weakly-Supervised Bayesian Learning of a CCG Supertagger
Dan Garrette | Chris Dyer | Jason Baldridge | Noah A. Smith
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science
Cristian Danescu-Niculescu-Mizil | Jacob Eisenstein | Kathleen McKeown | Noah A. Smith
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

Overview of the 2014 NLP Unshared Task in PoliInformatics
Noah A. Smith | Claire Cardie | Anne Washington | John Wilkerson
Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science

2013

Proceedings of the Workshop on Twenty Years of Bitext
Chris Dyer | Noah A. Smith | Phil Blunsom
Proceedings of the Workshop on Twenty Years of Bitext

Discrete Log-Linear Autoencoders for Unsupervised Learning of Linguistic Structure
Waleed Ammar | Chris Dyer | Noah A. Smith
Proceedings of the Workshop on Twenty Years of Bitext

Measuring Ideological Proportions in Political Speeches
Yanchuan Sim | Brice D. L. Acree | Justin H. Gross | Noah A. Smith
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Translating into Morphologically Rich Languages with Synthetic Phrases
Victor Chahuneau | Eva Schlinger | Noah A. Smith | Chris Dyer
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Learning Topics and Positions from Debatepedia
Swapna Gottipati | Minghui Qiu | Yanchuan Sim | Jing Jiang | Noah A. Smith
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
Olutobi Owoputi | Brendan O’Connor | Chris Dyer | Kevin Gimpel | Nathan Schneider | Noah A. Smith
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A Simple, Fast, and Effective Reparameterization of IBM Model 2
Chris Dyer | Victor Chahuneau | Noah A. Smith
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Supersense Tagging for Arabic: the MT-in-the-Middle Attack
Nathan Schneider | Behrang Mohit | Chris Dyer | Kemal Oflazer | Noah A. Smith
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Knowledge-Rich Morphological Priors for Bayesian Language Models
Victor Chahuneau | Noah A. Smith | Chris Dyer
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Learning Latent Personas of Film Characters
David Bamman | Brendan O’Connor | Noah A. Smith
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Learning to Extract International Relations from Political Context
Brendan O’Connor | Brandon M. Stewart | Noah A. Smith
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers
André Martins | Miguel Almeida | Noah A. Smith
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

A Framework for (Under)specifying Dependency Syntax without Overloading Annotators
Nathan Schneider | Brendan O’Connor | Naomi Saphra | David Bamman | Manaal Faruqui | Noah A. Smith | Chris Dyer | Jason Baldridge
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

2012

Word Salad: Relating Food Prices and Descriptions
Victor Chahuneau | Kevin Gimpel | Bryan R. Routledge | Lily Scherlis | Noah A. Smith
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

Recall-Oriented Learning of Named Entities in Arabic Wikipedia
Behrang Mohit | Nathan Schneider | Rishav Bhowmick | Kemal Oflazer | Noah A. Smith
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Empirical Risk Minimization for Probabilistic Grammars: Sample Complexity and Hardness of Learning
Shay B. Cohen | Noah A. Smith
Computational Linguistics, Volume 38, Issue 3 - September 2012

Structured Ramp Loss Minimization for Machine Translation
Kevin Gimpel | Noah A. Smith
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Concavity and Initialization for Unsupervised Dependency Parsing
Kevin Gimpel | Noah A. Smith
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Graph-Based Lexicon Expansion with Sparsity-Inducing Penalties
Dipanjan Das | Noah A. Smith
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Textual Predictors of Bill Survival in Congressional Committees
Tae Yano | Noah A. Smith | John D. Wilkerson
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Structured Sparsity in Natural Language Processing: Models, Algorithms and Applications
André F. T. Martins | Mário A. T. Figueiredo | Noah A. Smith
Tutorial Abstracts at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

A Probabilistic Model for Canonicalizing Named Entity Mentions
Dani Yogatama | Yanchuan Sim | Noah A. Smith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study
Nathan Schneider | Behrang Mohit | Kemal Oflazer | Noah A. Smith
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

An Exact Dual Decomposition Algorithm for Shallow Semantic Parsing with Constraints
Dipanjan Das | André F. T. Martins | Noah A. Smith
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

Discovering Factions in the Computational Linguistics Community
Yanchuan Sim | Noah A. Smith | David A. Smith
Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries

Transliteration by Sequence Labeling with Lattice Encodings and Reranking
Waleed Ammar | Chris Dyer | Noah Smith
Proceedings of the 4th Named Entity Workshop (NEWS) 2012

2011

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance
Shay B. Cohen | Dipanjan Das | Noah A. Smith
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Dual Decomposition with Many Overlapping Components
André Martins | Noah Smith | Mário Figueiredo | Pedro Aguiar
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation
Kevin Gimpel | Noah A. Smith
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Predicting a Scientific Community’s Response to an Article
Dani Yogatama | Michael Heilman | Brendan O’Connor | Chris Dyer | Bryan R. Routledge | Noah A. Smith
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Structured Sparsity in Structured Prediction
André Martins | Noah Smith | Mário Figueiredo | Pedro Aguiar
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

Unsupervised Word Alignment with Arbitrary Features
Chris Dyer | Jonathan H. Clark | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Discovering Sociolinguistic Associations with Structured Sparsity
Jacob Eisenstein | Noah A. Smith | Eric P. Xing
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Semi-Supervised Frame-Semantic Parsing for Unknown Predicates
Dipanjan Das | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
Kevin Gimpel | Nathan Schneider | Brendan O’Connor | Dipanjan Das | Daniel Mills | Jacob Eisenstein | Michael Heilman | Dani Yogatama | Jeffrey Flanigan | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability
Jonathan H. Clark | Chris Dyer | Alon Lavie | Noah A. Smith
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Author Age Prediction from Text using Linear Regression
Dong Nguyen | Noah A. Smith | Carolyn P. Rosé
Proceedings of the 5th ACL-HLT Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

The CMU-ARK German-English Translation System
Chris Dyer | Kevin Gimpel | Jonathan H. Clark | Noah A. Smith
Proceedings of the Sixth Workshop on Statistical Machine Translation

Generative Models of Monolingual and Bilingual Gappy Patterns
Kevin Gimpel | Noah A. Smith
Proceedings of the Sixth Workshop on Statistical Machine Translation

Structured Databases of Named Entities from Bayesian Nonparametrics
Jacob Eisenstein | Tae Yano | William Cohen | Noah Smith | Eric Xing
Proceedings of the First workshop on Unsupervised Learning in NLP

Unsupervised Bilingual POS Tagging with Markov Random Fields
Desai Chen | Chris Dyer | Shay Cohen | Noah Smith
Proceedings of the First workshop on Unsupervised Learning in NLP

2010

Nonparametric Word Segmentation for Machine Translation
ThuyLinh Nguyen | Stephan Vogel | Noah A. Smith
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

Turbo Parsers: Dependency Parsing by Approximate Variational Inference
André Martins | Noah Smith | Eric Xing | Pedro Aguiar | Mário Figueiredo
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

A Latent Variable Model for Geographic Lexical Variation
Jacob Eisenstein | Brendan O’Connor | Noah A. Smith | Eric P. Xing
Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing

Movie Reviews and Revenues: An Experiment in Text Regression
Mahesh Joshi | Dipanjan Das | Kevin Gimpel | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Variational Inference for Adaptor Grammars
Shay B. Cohen | David M. Blei | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Good Question! Statistical Ranking for Question Generation
Michael Heilman | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Softmax-Margin CRFs: Training Log-Linear Models with Cost Functions
Kevin Gimpel | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Probabilistic Frame-Semantic Parsing
Dipanjan Das | Nathan Schneider | Desai Chen | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Tree Edit Models for Recognizing Textual Entailments, Paraphrases, and Answers to Questions
Michael Heilman | Noah A. Smith
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Viterbi Training for PCFGs: Hardness Results and Competitiveness of Uniform Initialization
Shay Cohen | Noah A. Smith
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

SEMAFOR: Frame Argument Resolution with Log-Linear Models
Desai Chen | Nathan Schneider | Dipanjan Das | Noah A. Smith
Proceedings of the 5th International Workshop on Semantic Evaluation

Rating Computer-Generated Questions with Mechanical Turk
Michael Heilman | Noah A. Smith
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

Shedding (a Thousand Points of) Light on Biased Language
Tae Yano | Philip Resnik | Noah A. Smith
Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk

Distributed Asynchronous Online Learning for Natural Language Processing
Kevin Gimpel | Dipanjan Das | Noah A. Smith
Proceedings of the Fourteenth Conference on Computational Natural Language Learning

2009

Feature-Rich Translation by Quasi-Synchronous Lattice Parsing
Kevin Gimpel | Noah A. Smith
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

Cube Summing, Approximate Inference with Non-Local Features, and Dynamic Programming without Semirings
Kevin Gimpel | Noah A. Smith
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction
Shay Cohen | Noah A. Smith
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Preference Grammars: Softening Syntactic Constraints to Improve Statistical Machine Translation
Ashish Venugopal | Andreas Zollmann | Noah A. Smith | Stephan Vogel
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Predicting Risk from Financial Reports with Regression
Shimon Kogan | Dimitry Levin | Bryan R. Routledge | Jacob S. Sagi | Noah A. Smith
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Predicting Response to Political Blog Posts with Topic Models
Tae Yano | William W. Cohen | Noah A. Smith
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Concise Integer Linear Programming Formulations for Dependency Parsing
André Martins | Noah Smith | Eric Xing
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

Paraphrase Identification as Probabilistic Quasi-Synchronous Recognition
Dipanjan Das | Noah A. Smith
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

Variational Inference for Grammar Induction with Prior Knowledge
Shay Cohen | Noah A. Smith
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Leveraging Structural Relations for Fluent Compressions at Multiple Compression Rates
Sourish Chaudhuri | Naman K. Gupta | Noah A. Smith | Carolyn P. Rosé
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Summarization with a Joint Model for Sentence Extraction and Compression
André Martins | Noah A. Smith
Proceedings of the Workshop on Integer Linear Programming for Natural Language Processing

2008

Wider Pipelines: N-Best Alignments and Parses in MT Training
Ashish Venugopal | Andreas Zollmann | Noah A. Smith | Stephan Vogel
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

State-of-the-art statistical machine translation systems use hypotheses from several maximum a posteriori inference steps, including word alignments and parse trees, to identify translational structure and estimate the parameters of translation models. While this approach leads to a modular pipeline of independently developed components, errors made in these “single-best” hypotheses can propagate to downstream estimation steps that treat these inputs as clean, trustworthy training data. In this work we integrate N-best alignments and parses by using a probability distribution over these alternatives to generate posterior fractional counts for use in downstream estimation. Using these fractional counts in a DOP-inspired syntax-based translation system, we show significant improvements in translation quality over a single-best trained baseline.

Stacking Dependency Parsers
André F. T. Martins | Dipanjan Das | Noah A. Smith | Eric P. Xing
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

Book Reviews: Computational Approaches to Morphology and Syntax by Brian Roark and Richard Sproat
Noah A. Smith
Computational Linguistics, Volume 34, Number 3, September 2008

Competitive Grammar Writing
Jason Eisner | Noah A. Smith
Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics

Rich Source-Side Context for Statistical Machine Translation
Kevin Gimpel | Noah A. Smith
Proceedings of the Third Workshop on Statistical Machine Translation

2007

What is the Jeopardy Model? A Quasi-Synchronous Grammar for QA
Mengqiu Wang | Noah A. Smith | Teruko Mitamura
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Probabilistic Models of Nonprojective Dependency Trees
David A. Smith | Noah A. Smith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Joint Morphological and Syntactic Disambiguation
Shay B. Cohen | Noah A. Smith
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

Weighted and Probabilistic Context-Free Grammars Are Equally Expressive
Noah A. Smith | Mark Johnson
Computational Linguistics, Volume 33, Number 4, December 2007

Computationally Efficient M-Estimation of Log-Linear Structure Models
Noah A. Smith | Douglas L. Vail | John D. Lafferty
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

2006

Annealing Structural Bias in Multilingual Weighted Grammar Induction
Noah A. Smith | Jason Eisner
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Vine Parsing and Minimum Risk Reranking for Speed and Precision
Markus Dreyer | David A. Smith | Noah A. Smith
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

2005

Compiling Comp Ling: Weighted Dynamic Programming and the Dyna Language
Jason Eisner | Eric Goldlust | Noah A. Smith
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Context-Based Morphological Disambiguation with Random Fields
Noah A. Smith | David A. Smith | Roy W. Tromble
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing

Contrastive Estimation: Training Log-Linear Models on Unlabeled Data
Noah A. Smith | Jason Eisner
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

Parsing with Soft and Hard Constraints on Dependency Length
Jason Eisner | Noah A. Smith
Proceedings of the Ninth International Workshop on Parsing Technology

2004

Annealing Techniques For Unsupervised Statistical Language Learning
Noah A. Smith | Jason Eisner
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)

Dyna: A Language for Weighted Dynamic Programming
Jason Eisner | Eric Goldlust | Noah A. Smith
Proceedings of the ACL Interactive Poster and Demonstration Sessions

Bilingual Parsing with Factored Estimation: Using English to Parse Korean
David A. Smith | Noah A. Smith
Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing

2003

The Web as a Parallel Corpus
Philip Resnik | Noah A. Smith
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus

2002

From Words to Corpora: Recognizing Translation
Noah A. Smith
Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing (EMNLP 2002)

2000

Cairo: An Alignment Visualization Tool
Noah A. Smith | Michael E. Jahr
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)

Co-authors

Kevin Gimpel 15

Swabha Swayamdipta 15

Dipanjan Das 12

Suchin Gururangan 12

André F. T. Martins 11

Luke Zettlemoyer 11

Dani Yogatama 10

Miguel Ballesteros 9

Shay B. Cohen 9

William Merrill 8

Brendan O’Connor 8

Yulia Tsvetkov 8

Pradeep Dasigi 7

Lingpeng Kong 7

Ronan Le Bras 7

Nikolaos Pappas 7

Bryan R. Routledge 7

Elizabeth Clark 6

Jeffrey Flanigan 6

Phillip Keung 6

Nelson F. Liu 6

Jacob Morrison 6

Mari Ostendorf 6

Matthew E. Peters 6

Jaime G. Carbonell 5

Jacob Eisenstein 5

Yoav Goldberg 5

Michael Heilman 5

Daniel Khashabi 5

Ana Marasović 5

Phoebe Mulcaire 5

David A. Smith 5

Orevaoghene Ahia 4

Victor Chahuneau 4

Khyathi Chandu 4

Keisuke Sakaguchi 4

Vidhisha Balachandran 3

Jason Baldridge 3

Chandra Bhagavatula 3

Terra Blevins 3

Jonathan H. Clark 3

Manaal Faruqui 3

Mário Figueiredo 3

Dirk Groeneveld 3

Justin H. Gross 3

Nikita Haduong 3

Adhiguna Kuncoro 3

Nathan Lambert 3

Ian Magnusson 3

Behrang Mohit 3

George Mulcaire 3

Kemal Oflazer 3

Valentina Pyatkin 3

Dragomir Radev 3

Katharina Reinecke 3

Philip Resnik 3

Sofia Serrano 3

Luca Soldaini 3

Gabriel Stanovsky 3

Stephan Vogel 3

David Atkinson 2

Russell Authur 2

Akshita Bhagia 2

Amber Boydstun 2

Asli Celikyilmaz 2

Ethan C. Chau 2

William Cohen 2

Emily Danchik 2

Jennifer Dumas 2

Lavinia Dunagan 2

Maxwell Forbes 2

Saadia Gabriel 2

Eric Goldlust 2

Gabriel Ilharco 2

Hamish Ivison 2

Rodney Kinney 2

Rik Koncel-Kedziorski 2

Ioannis Konstas 2

Nicholas Lourie 2

Michael T. Mordowanec 2

Niklas Muennighoff 2

Aakanksha Naik 2

Graham Neubig 2

Chan Young Park 2

Rohan Ramanath 2

Abhilasha Ravichander 2

Kyle Richardson 2

Sebastian Ruder 2

Dustin Schwenk 2

Vivek Srikumar 2

Emma Strubell 2

Nishant Subramani 2

Oyvind Tafjord 2

Ashish Venugopal 2

Sarah Wiegreffe 2

John Wilkerson 2

Andreas Zollmann 2

Diana Abagyan 1

Brice D. L. Acree 1

David Ifeoluwa Adelani 1

David Albright 1

Emily Allaway 1

Miguel B. Almeida 1

Prithviraj Ammanabrolu 1

Taira Anderson 1

Anuoluwapo Aremu 1

Victoria Basmov 1

Nicholas Scott Batchelder 1

Yonatan Belinkov 1

Jonathan Berant 1

Archna Bhatia 1

Rishav Bhowmick 1

Byron Bischoff 1

Taylor Blanton 1

Jon Borchardt 1

Samuel Bowman 1

Jonathan Bragg 1

Faeze Brahman 1

Zander Brumbaugh 1

Isabel Cachola 1

Claire Cardie 1

Sourish Chaudhuri 1

Yen-Sung Chen 1

Evie Yu-Yen Cheng 1

Arnavi Chheda-Kothary 1

Henrietta Conrad 1

Marta R. Costa-jussà 1

Cristian Danescu-Niculescu-Mizil 1

Markus Dreyer 1

Daniel Edmiston 1

Cristina España-Bonet 1

Allyson Ettinger 1

Mário A. T. Figueiredo 1

Samuel Gehman 1

Shyamnath Gollakota 1

Swapna Gottipati 1

Ananth Gottumukkala 1

Naman K. Gupta 1

Michael Hassid 1

Shobhit Hathi 1

Shirley Anugrah Hayati 1

James Henderson 1

Valentin Hofmann 1

Mohammad Javad Hosseini 1

Alexander Miserlis Hoyle 1

Jena D. Hwang 1

Michael E. Jahr 1

Sujay Kumar Jauhar 1

David Jurgens 1

Dongyeop Kang 1

Kazuya Kawakami 1

Alexander Koller 1

Yeganeh Kordi 1

Meghana Kshirsagar 1

John Lafferty 1

Sophie Lebrecht 1

Chia-Hsuan Lee 1

Dimitry Levin 1

Gina-Anne Levow 1

Tomasz Limisiewicz 1

Bill Yuchen Lin 1

Jiangming Liu 1

Karishma Mandyam 1

Sebastian Martschat 1

Austin Matthews 1

Kathleen McKeown 1

Julian Michael 1

Daniel P. Mills 1

Lester James Validad Miranda 1

Swaroop Mishra 1

Teruko Mitamura 1

David R. Mortensen 1

ThuyLinh Nguyen 1

Spencer Onuffer 1

Olutobi Owoputi 1

Rock Yuren Pang 1

Madhur Panwar 1

Bhargavi Paranjape 1

Jae Sung Park 1

James Pennebaker 1

Bing Qin (秦兵) 1

Vivek Ramanujan 1

Hannah Rashkin 1

Alexander Richard Fabbri 1

Rachel Rudinger 1

Alexander M. Rush 1

Ashish Sabharwal 1

Jacob S. Sagi 1

Julian Salazar 1

Lily Scherlis 1

Eva Schlinger 1

Ludwig Schmidt 1

Michael Schmitz 1

Carissa Schoenick 1

Torsten Scholak 1

Vered Shwartz 1

William Smith 1

Brandon M. Stewart 1

Sanjay Subramanian 1

György Szarvas 1

Cassidy Trier 1

Reut Tsarfaty 1

Ted Underwood 1

Douglas L. Vail 1

Dianzhuo Wang 1

Anne Washington 1

Daniel S. Weld 1

Shuly Wintner 1

Mitchell Wortsman 1

Chen Henry Wu 1

Chien-Sheng Wu 1

Caiming Xiong 1

Michihiro Yasunaga 1

Pengcheng Yin 1

Rowan Zellers 1

Venues

ws12