Christopher D. Manning

Also published as: Christopher Manning, Chris Manning, Christopher D. Manning, Christopher D Manning

2024

pdf bib abs
Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations
Róbert Csordás | Christopher Potts | Christopher D Manning | Atticus Geiger
Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP

The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.

pdf bib
MSCAW-coref: Multilingual, Singleton and Conjunction-Aware Word-Level Coreference Resolution
Houjun Liu | John Bauer | Karel D’Oosterlinck | Christopher Potts | Christopher D. Manning
Proceedings of the Seventh Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib abs
Statistical Uncertainty in Word Embeddings: GloVe-V
Andrea Vallebueno | Cassandra Handan-Nader | Christopher D Manning | Daniel E. Ho
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Static word embeddings are ubiquitous in computational social science applications and contribute to practical decision-making in a variety of fields including law and healthcare. However, assessing the statistical uncertainty in downstream conclusions drawn from word embedding statistics has remained challenging. When using only point estimates for embeddings, researchers have no streamlined way of assessing the degree to which their model selection criteria or scientific conclusions are subject to noise due to sparsity in the underlying data used to generate the embeddings. We introduce a method to obtain approximate, easy-to-use, and scalable reconstruction error variance estimates for GloVe, one of the most widely used word embedding models, using an analytical approximation to a multivariate normal model. To demonstrate the value of embeddings with variance (GloVe-V), we illustrate how our approach enables principled hypothesis testing in core word embedding tasks, such as comparing the similarity between different word pairs in vector space, assessing the performance of different models, and analyzing the relative degree of ethnic or gender bias in a corpus using different word lists.

pdf bib abs
Predicting Narratives of Climate Obstruction in Social Media Advertising
Harri Rowlands | Gaku Morio | Dylan Tanner | Christopher Manning
Findings of the Association for Computational Linguistics: ACL 2024

Social media advertising offers a platform for fossil fuel value chain companies and their agents to reinforce their narratives, often emphasizing economic, labor market, and energy security benefits to promote oil and gas policy and products. Whether such narratives can be detected automatically and the extent to which the cost of human annotation can be reduced is our research question. We introduce a task of classifying narratives into seven categories, based on existing definitions and data.Experiments showed that RoBERTa-large outperforms other methods, while GPT-4 Turbo can serve as a viable annotator for the task, thereby reducing human annotation costs. Our findings and insights provide guidance to automate climate-related ad analysis and lead to more scalable ad scrutiny.

pdf bib abs
pyvene: A Library for Understanding and Improving PyTorch Models via Interventions
Zhengxuan Wu | Atticus Geiger | Aryaman Arora | Jing Huang | Zheng Wang | Noah Goodman | Christopher Manning | Christopher Potts
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)

Interventions on model-internal states are fundamental operations in many areas of AI, including model editing, steering, robustness, and interpretability. To facilitate such research, we introduce pyvene, an open-source Python library that supports customizable interventions on a range of different PyTorch modules. pyvene supports complex intervention schemes with an intuitive configuration format, and its interventions can be static or include trainable parameters. We show how pyvene provides a unified and extensible framework for performing interventions on neural models and sharing the intervened upon models with others. We illustrate the power of the library via interpretability analyses using causal abstraction and knowledge localization. We publish our library through Python Package Index (PyPI) and provide code, documentation, and tutorials at ‘https://github.com/stanfordnlp/pyvene‘.

2023

pdf bib abs
Backpack Language Models
John Hewitt | John Thickstun | Christopher Manning | Percy Liang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present Backpacks: a new neural architecture that marries strong modeling performancewith an interface for interpretability and control. Backpacks learn multiple non-contextual sense vectors for each word in a vocabulary, and represent a word in a sequence as a context-dependent, non-negative linear combination ofsense vectors in this sequence. We find that, after training, sense vectors specialize, each encoding a different aspect of a word. We can interpret a sense vector by inspecting its (non-contextual, linear) projection onto the output space, and intervene on these interpretable hooks to change the model’s behavior in predictable ways. We train a 170M-parameter Backpack language model on OpenWebText, matching the loss of a GPT-2 small (124Mparameter) Transformer. On lexical similarity evaluations, we find that Backpack sense vectors outperform even a 6B-parameter Transformer LM’s word embeddings. Finally, we present simple algorithms that intervene on sense vectors to perform controllable text generation and debiasing. For example, we can edit the sense vocabulary to tend more towards a topic, or localize a source of gender bias to a sense vector and globally suppress that sense.

pdf bib abs
Grokking of Hierarchical Structure in Vanilla Transformers
Shikhar Murty | Pratyusha Sharma | Jacob Andreas | Christopher Manning
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

For humans, language production and comprehension is sensitive to the hierarchical structure of sentences. In natural language processing, past work has questioned how effectively neural sequence models like transformers capture this hierarchical structure when generalizing to structurally novel inputs. We show that transformer language models can learn to generalize hierarchically after training for extremely long periods—far beyond the point when in-domain accuracy has saturated. We call this phenomenon structural grokking. On multiple datasets, structural grokking exhibits inverted U-shaped scaling in model depth: intermediate-depth models generalize better than both very deep and very shallow transformers. When analyzing the relationship between model-internal properties and grokking, we find that optimal depth for grokking can be identified using the tree-structuredness metric of CITATION. Overall, our work provides strong evidence that, with extended training, vanilla transformers discover and use hierarchical structure.

pdf bib abs
Multilingual self-supervised speech representations improve the speech recognition of low-resource African languages with codeswitching
Tolulope Ogunremi | Christopher Manning | Dan Jurafsky
Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching

While many speakers of low-resource languages regularly code-switch between their languages and other regional languages or English, datasets of codeswitched speech are too small to train bespoke acoustic models from scratch or do language model rescoring. Here we propose finetuning self-supervised speech representations such as wav2vec 2.0 XLSR to recognize code-switched data. We find that finetuning self-supervised multilingual representations and augmenting them with n-gram language models trained from transcripts reduces absolute word error rates by up to 20% compared to baselines of hybrid models trained from scratch on code-switched data. Our findings suggest that in circumstances with limited training data finetuning self-supervised representations is a better performing and viable solution.

pdf bib abs
Pushdown Layers: Encoding Recursive Structure in Transformer Language Models
Shikhar Murty | Pratyusha Sharma | Jacob Andreas | Christopher Manning
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Recursion is a prominent feature of human language, and fundamentally challenging for self-attention due to the lack of an explicit recursive-state tracking mechanism. Consequently, Transformer language models poorly capture long-tail recursive structure and exhibit sample-inefficient syntactic generalization. This work introduces Pushdown Layers, a new self-attention layer that models recursive state via a stack tape that tracks estimated depths of every token in an incremental parse of the observed prefix. Transformer LMs with Pushdown Layers are syntactic language models that autoregressively and synchronously update this stack tape as they predict new tokens, in turn using the stack tape to softly modulate attention over tokens—for instance, learning to “skip” over closed constituents. When trained on a corpus of strings annotated with silver constituency parses, Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization, while maintaining similar perplexities. Pushdown Layers are a drop-in replacement for standard self-attention. We illustrate this by finetuning GPT2-medium with Pushdown Layers on an automatically parsed WikiText-103, leading to improvements on several GLUE text classification tasks.

pdf bib abs
Meta-Learning Online Adaptation of Language Models
Nathan Hu | Eric Mitchell | Christopher Manning | Chelsea Finn
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models encode impressively broad world knowledge in their parameters. However, the knowledge in static language models falls out of date, limiting the model’s effective “shelf life.” While online fine-tuning can reduce this degradation, we find that naively fine-tuning on a stream of documents leads to a low level of information uptake. We hypothesize that online fine-tuning does not sufficiently attend to important information. That is, the gradient signal from important tokens representing factual information is drowned out by the gradient from inherently noisy tokens, suggesting that a dynamic, context-aware learning rate may be beneficial. We therefore propose learning which tokens to upweight. We meta-train a small, autoregressive model to reweight the language modeling loss for each token during online fine-tuning, with the objective of maximizing the out-of-date base question-answering model’s ability to answer questions about a document after a single weighted gradient step. We call this approach Context-aware Meta-learned Loss Scaling (CaMeLS). Across three different distributions of documents, our experiments find that CaMeLS provides substantially improved information uptake on streams of thousands of documents compared with standard fine-tuning and baseline heuristics for reweighting token losses.

pdf bib abs
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
Katherine Tian | Eric Mitchell | Allan Zhou | Archit Sharma | Rafael Rafailov | Huaxiu Yao | Chelsea Finn | Christopher Manning
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

A trustworthy real-world prediction system should produce well-calibrated confidence scores; that is, its confidence in an answer should be indicative of the likelihood that the answer is correct, enabling deferral to an expert in cases of low-confidence predictions. Recent studies have shown that unsupervised pre-training produces large language models (LMs) whose conditional probabilities are remarkably well-calibrated. However, the most widely-used LMs are fine-tuned with reinforcement learning from human feedback (RLHF-LMs), and some studies have suggested that RLHF-LMs produce conditional probabilities that are very poorly calibrated. In light of this perceived weakness, we conduct a broad evaluation of methods for extracting confidence scores from RLHF-LMs. For RLHF-LMs such as ChatGPT, GPT-4, and Claude, we find that verbalized confidences emitted as output tokens are typically better-calibrated than the model’s conditional probabilities on the TriviaQA, SciQ, and TruthfulQA benchmarks, often reducing the expected calibration error by a relative 50%.

pdf bib abs
MQuAKE: Assessing Knowledge Editing in Language Models via Multi-Hop Questions
Zexuan Zhong | Zhengxuan Wu | Christopher Manning | Christopher Potts | Danqi Chen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The information stored in large language models (LLMs) falls out of date quickly, and retraining from scratch is often not an option. This has recently given rise to a range of techniques for injecting new facts through updating model weights. Current evaluation paradigms are extremely limited, mainly validating the recall of edited facts, but changing one fact should cause rippling changes to the model’s related beliefs. If we edit the UK Prime Minister to now be Rishi Sunak, then we should get a different answer to Who is married to the British Prime Minister? In this work, we present a benchmark MQuAKE (Multi-hop Question Answering for Knowledge Editing) comprising multi-hop questions that assess whether edited models correctly answer questions where the answer should change as an entailed consequence of edited facts. While we find that current knowledge-editing approaches can recall edited facts accurately, they fail catastrophically on the constructed multi-hop questions. We thus propose a simple memory-based approach, MeLLo, which stores all edited facts externally while prompting the language model iteratively to generate answers that are consistent with the edited facts. While MQuAKE remains challenging, we show that MeLLo scales well with LLMs (up to 175B) and outperforms previous model editors by a large margin.

pdf bib abs
Mini But Mighty: Efficient Multilingual Pretraining with Linguistically-Informed Data Selection
Tolulope Ogunremi | Dan Jurafsky | Christopher Manning
Findings of the Association for Computational Linguistics: EACL 2023

With the prominence of large pretrained language models, low-resource languages are rarely modelled monolingually and become victims of the “curse of multilinguality” in massively multilingual models. Recently, AfriBERTa showed that training transformer models from scratch on 1GB of data from many unrelated African languages outperforms massively multilingual models on downstream NLP tasks. Here we extend this direction, focusing on the use of related languages. We propose that training on smaller amounts of data but from related languages could match the performance of models trained on large, unrelated data. We test our hypothesis on the Niger-Congo family and its Bantu and Volta-Niger sub-families, pretraining models with data solely from Niger-Congo languages and finetuning on 4 downstream tasks: NER, part-of-speech tagging, sentiment analysis and text classification. We find that models trained on genetically related languages achieve equal performance on downstream tasks in low-resource languages despite using less training data. We recommend selecting training data based on language-relatedness when pretraining language models for low-resource languages.

pdf bib abs
PragmatiCQA: A Dataset for Pragmatic Question Answering in Conversations
Peng Qi | Nina Du | Christopher Manning | Jing Huang
Findings of the Association for Computational Linguistics: ACL 2023

Pragmatic reasoning about another speaker’s unspoken intent and state of mind is crucial to efficient and effective human communication. It is virtually omnipresent in conversations between humans, e.g., when someone asks “do you have a minute?”, instead of interpreting it literally as a query about your schedule, you understand that the speaker might have requests that take time, and respond accordingly. In this paper, we present PragmatiCQA, the first large-scale open-domain question answering (QA) dataset featuring 6873 QA pairs that explores pragmatic reasoning in conversations over a diverse set of topics. We designed innovative crowdsourcing mechanisms for interest-based and task-driven data collection to address the common issue of incentive misalignment between crowdworkers and potential users. To compare computational models’ capability at pragmatic reasoning, we also propose several quantitative metrics to evaluate question answering systems on PragmatiCQA. We find that state-of-the-art systems still struggle to perform human-like pragmatic reasoning, and highlight their limitations for future research.

pdf bib abs
Do “English” Named Entity Recognizers Work Well on Global Englishes?
Alexander Shan | John Bauer | Riley Carlson | Christopher Manning
Findings of the Association for Computational Linguistics: EMNLP 2023

The vast majority of the popular English named entity recognition (NER) datasets contain American or British English data, despite the existence of many global varieties of English. As such, it is unclear whether they generalize for analyzing use of English globally. To test this, we build a newswire dataset, the Worldwide English NER Dataset, to analyze NER model performance on low-resource English variants from around the world. We test widely used NER toolkits and transformer models, including models using the pre-trained contextual models RoBERTa and ELECTRA, on three datasets: a commonly used British English newswire dataset, CoNLL 2003, a more American focused dataset OntoNotes, and our global dataset. All models trained on the CoNLL or OntoNotes datasets experienced significant performance drops—over 10 F1 in some cases—when tested on the Worldwide English dataset. Upon examination of region-specific errors, we observe the greatest performance drops for Oceania and Africa, while Asia and the Middle East had comparatively strong performance. Lastly, we find that a combined model trained on the Worldwide dataset and either CoNLL or OntoNotes lost only 1-2 F1 on both test sets.

pdf bib abs
ReCOGS: How Incidental Details of a Logical Form Overshadow an Evaluation of Semantic Interpretation
Zhengxuan Wu | Christopher D. Manning | Christopher Potts
Transactions of the Association for Computational Linguistics, Volume 11

Compositional generalization benchmarks for semantic parsing seek to assess whether models can accurately compute meanings for novel sentences, but operationalize this in terms of logical form (LF) prediction. This raises the concern that semantically irrelevant details of the chosen LFs could shape model performance. We argue that this concern is realized for the COGS benchmark (Kim and Linzen, 2020). COGS poses generalization splits that appear impossible for present-day models, which could be taken as an indictment of those models. However, we show that the negative results trace to incidental features of COGS LFs. Converting these LFs to semantically equivalent ones and factoring out capabilities unrelated to semantic interpretation, we find that even baseline models get traction. A recent variable-free translation of COGS LFs suggests similar conclusions, but we observe this format is not semantically equivalent; it is incapable of accurately representing some COGS meanings. These findings inform our proposal for ReCOGS, a modified version of COGS that comes closer to assessing the target semantic capabilities while remaining very challenging. Overall, our results reaffirm the importance of compositional generalization and careful benchmark task design.

pdf bib abs
Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs
John Bauer | Chloé Kiddon | Eric Yeh | Alex Shan | Christopher D. Manning
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.

Currently, there is no usable machine translation system for Nko, a language spoken by tens of millions of people across multiple West African countries, which holds significant cultural and educational value. To address this issue, we present a set of tools, resources, and baseline results aimed towards the development of usable machine translation systems for Nko and other languages that do not currently have sufficiently large parallel text corpora available. (1) Fria∥el: A novel collaborative parallel text curation software that incorporates quality control through copyedit-based workflows. (2) Expansion of the FLoRes-200 and NLLB-Seed corpora with 2,009 and 6,193 high-quality Nko translations in parallel with 204 and 40 other languages. (3) nicolingua-0005: A collection of trilingual and bilingual corpora with 130,850 parallel segments and monolingual corpora containing over 3 million Nko words. (4) Baseline bilingual and multilingual neural machine translation results with the best model scoring 30.83 English-Nko chrF++ on FLoRes-devtest.

2022

While large pre-trained language models are powerful, their predictions often lack logical consistency across test inputs. For example, a state-of-the-art Macaw question-answering (QA) model answers Yes to Is a sparrow a bird? and Does a bird have feet? but answers No to Does a sparrow have feet?. To address this failure mode, we propose a framework, Consistency Correction through Relation Detection, or ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models using pre-trained natural language inference (NLI) models without fine-tuning or re-training. Given a batch of test inputs, ConCoRD samples several candidate outputs for each input and instantiates a factor graph that accounts for both the model’s belief about the likelihood of each answer choice in isolation and the NLI model’s beliefs about pair-wise answer choice compatibility. We show that a weighted MaxSAT solver can efficiently compute high-quality answer choices under this factor graph, improving over the raw model’s predictions. Our experiments demonstrate that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models using off-the-shelf NLI models, notably increasing accuracy of LXMERT on ConVQA by 5% absolute. See the project website (https://ericmitchell.ai/emnlp-2022-concord/) for code and data.

Recent approaches to Open-domain Question Answering refer to an external knowledge base using a retriever model, optionally rerank passages with a separate reranker model and generate an answer using another reader model. Despite performing related tasks, the models have separate parameters and are weakly-coupled during training. We propose casting the retriever and the reranker as internal passage-wise attention mechanisms applied sequentially within the transformer architecture and feeding computed representations to the reader, with the hidden representations progressively refined at each stage. This allows us to use a single question answering model trained end-to-end, which is a more efficient use of model capacity and also leads to better gradient flow. We present a pre-training method to effectively train this architecture and evaluate our model on the Natural Questions and TriviaQA open datasets. For a fixed parameter budget, our model outperforms the previous state-of-the-art model by 1.0 and 0.7 exact match scores.

pdf bib abs
On Measuring the Intrinsic Few-Shot Hardness of Datasets
Xinran Zhao | Shikhar Murty | Christopher Manning
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

While advances in pre-training have led to dramatic improvements in few-shot learning of NLP tasks, there is limited understanding of what drives successful few-shot adaptation in datasets. In particular, given a new dataset and a pre-trained model, what properties of the dataset make it few-shot learnable, and are these properties independent of the specific adaptation techniques used? We consider an extensive set of recent few-shot learning methods and show that their performance across a large number of datasets is highly correlated, showing that few-shot hardness may be intrinsic to datasets, for a given pre-trained model. To estimate intrinsic few-shot hardness, we then propose a simple and lightweight metric called Spread that captures the intuition that few-shot learning is made possible by exploiting feature-space invariances between training and test samples. Our metric better accounts for few-shot hardness compared to existing notions of hardness and is ~8-100x faster to compute.

pdf bib abs
Detecting Label Errors by Using Pre-Trained Language Models
Derek Chong | Jenny Hong | Christopher Manning
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work. To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness.We argue that human-originated noise is a better standard for evaluation than synthetic noise. Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9–36% higher absolute Area Under the Precision-Recall Curve than existing models.

pdf bib abs
Fixing Model Bugs with Natural Language Patches
Shikhar Murty | Christopher Manning | Scott Lundberg | Marco Tulio Ribeiro
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Current approaches for fixing systematic problems in NLP models (e.g., regex patches, finetuning on more data) are either brittle, or labor-intensive and liable to shortcuts. In contrast, humans often provide corrections to each other through natural language. Taking inspiration from this, we explore natural language patches—declarative statements that allow developers to provide corrective feedback at the right level of abstraction, either overriding the model (“if a review gives 2 stars, the sentiment is negative”) or providing additional information the model may lack (“if something is described as the bomb, then it is good”). We model the task of determining if a patch applies separately from the task of integrating patch information, and show that with a small amount of synthetic data, we can teach models to effectively use real patches on real data—1 to 7 patches improve accuracy by ~1–4 accuracy points on different slices of a sentiment analysis dataset, and F1 by 7 points on a relation extraction dataset. Finally, we show that finetuning on as many as 100 labeled examples may be needed to match the performance of a small set of language patches.

pdf bib abs
Truncation Sampling as Language Model Desmoothing
John Hewitt | Christopher Manning | Percy Liang
Findings of the Association for Computational Linguistics: EMNLP 2022

Long samples of text from neural language models can be of poor quality. Truncation sampling algorithms–like top-p or top-k—address this by setting some words’ probabilities to zero at each step. This work investigates why these methods are important, and how to improve them. We propose thinking of a neural language model as a mixture of a true distribution and a smoothing distribution that avoids infinite perplexity. In this light, truncation algorithms aim to perform desmoothing, estimating a subset of the support of the true distribution. Finding a good subset is crucial: we show that top-p unnecessarily truncates high-probability words, for example causing it to truncate all words but Trump for a document that starts with Donald. We introduce eta-sampling, which truncates words below an entropy-dependent probability threshold. Compared to previous algorithms, our eta-sampling generates more plausible long documents according to humans, is better at breaking out of repetition, and behaves more reasonably on a battery of test distributions.

pdf bib abs
JamPatoisNLI: A Jamaican Patois Natural Language Inference Dataset
Ruth-Ann Armstrong | John Hewitt | Christopher Manning
Findings of the Association for Computational Linguistics: EMNLP 2022

JamPatoisNLI provides the first dataset for natural language inference in a creole language, Jamaican Patois.Many of the most-spoken low-resource languages are creoles. These languages commonly have a lexicon derived from a major world language and a distinctive grammar reflecting the languages of the original speakers and the process of language birth by creolization. This gives them a distinctive place in exploring the effectiveness of transfer from large monolingual or multilingual pretrained models. While our work, along with previous work, shows that transfer from these models to low-resource languages that are unrelated to languages in their training set is not very effective, we would expect stronger results from transfer to creoles. Indeed, our experiments show considerably better results from few-shot learning of JamPatoisNLI than for such unrelated languages, and help us begin to understand how the unique relationship between creoles and their high-resource base languages affect cross-lingual transfer. JamPatoisNLI, which consists of naturally-occurring premises and expert-written hypotheses, is a step towards steering research into a traditionally underserved language and a useful benchmark for understanding cross-lingual NLP.

pdf bib abs
When can I Speak? Predicting initiation points for spoken dialogue agents
Siyan Li | Ashwin Paranjape | Christopher Manning
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Current spoken dialogue systems initiate their turns after a long period of silence (700-1000ms), which leads to little real-time feedback, sluggish responses, and an overall stilted conversational flow. Humans typically respond within 200ms and successfully predicting initiation points in advance would allow spoken dialogue agents to do the same. In this work, we predict the lead-time to initiation using prosodic features from a pre-trained speech representation model (wav2vec 1.0) operating on user audio and word features from a pre-trained language model (GPT-2) operating on incremental transcriptions. To evaluate errors, we propose two metrics w.r.t. predicted and true lead times. We train and evaluate the models on the Switchboard Corpus and find that our method outperforms features from prior work on both metrics and vastly outperforms the common approach of waiting for 700ms of silence.

We present Chirpy Cardinal, an open-domain social chatbot. Aiming to be both informative and conversational, our bot chats with users in an authentic, emotionally intelligent way. By integrating controlled neural generation with scaffolded, hand-written dialogue, we let both the user and bot take turns driving the conversation, producing an engaging and socially fluent experience. Deployed in the fourth iteration of the Alexa Prize Socialbot Grand Challenge, Chirpy Cardinal handled thousands of conversations per day, placing second out of nine bots with an average user rating of 3.58/5.

2021

pdf bib abs
Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering
Siddharth Karamcheti | Ranjay Krishna | Li Fei-Fei | Christopher Manning
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers – groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work.

pdf bib abs
Universal Dependencies
Marie-Catherine de Marneffe | Christopher D. Manning | Joakim Nivre | Daniel Zeman
Computational Linguistics, Volume 47, Issue 2 - June 2021

Universal dependencies (UD) is a framework for morphosyntactic annotation of human language, which to date has been used to create treebanks for more than 100 languages. In this article, we outline the linguistic theory of the UD framework, which draws on a long tradition of typologically oriented grammatical theories. Grammatical relations between words are centrally used to explain how predicate–argument structures are encoded morphosyntactically in different languages while morphological features and part-of-speech classes give the properties of words. We argue that this theory is a good basis for crosslinguistically consistent annotation of typologically diverse languages in a way that supports computational natural language understanding as well as broader linguistic studies.

pdf bib abs
Conditional probing: measuring usable information beyond a baseline
John Hewitt | Kawin Ethayarajh | Percy Liang | Christopher Manning
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Probing experiments investigate the extent to which neural representations make properties—like part-of-speech—predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we’re interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of-speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called V-information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.

pdf bib abs
Answering Open-Domain Questions of Varying Reasoning Steps from Text
Peng Qi | Haejun Lee | Tg Sido | Christopher Manning
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We develop a unified system to answer directly from text open-domain questions that may require a varying number of retrieval steps. We employ a single multi-task transformer model to perform all the necessary subtasks—retrieving supporting facts, reranking them, and predicting the answer from all retrieved documents—in an iterative fashion. We avoid crucial assumptions of previous work that do not transfer well to real-world settings, including exploiting knowledge of the fixed number of retrieval steps required to answer each question or using structured metadata like knowledge bases or web links that have limited availability. Instead, we design a system that can answer open-domain questions on any text collection without prior knowledge of reasoning complexity. To emulate this setting, we construct a new benchmark, called BeerQA, by combining existing one- and two-step datasets with a new collection of 530 questions that require three Wikipedia pages to answer, unifying Wikipedia corpora versions in the process. We show that our model demonstrates competitive performance on both existing benchmarks and this new benchmark. We make the new benchmark available at https://beerqa.github.io/.

pdf bib abs
ContractNLI: A Dataset for Document-level Natural Language Inference for Contracts
Yuta Koreeda | Christopher Manning
Findings of the Association for Computational Linguistics: EMNLP 2021

Reviewing contracts is a time-consuming procedure that incurs large expenses to companies and social inequality to those who cannot afford it. In this work, we propose “document-level natural language inference (NLI) for contracts”, a novel, real-world application of NLI that addresses such problems. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is “entailed by”, “contradicting to” or “not mentioned by” (neutral to) the contract as well as identifying “evidence” for the decision as spans in the contract. We annotated and release the largest corpus to date consisting of 607 annotated contracts. We then show that existing models fail badly on our task and introduce a strong baseline, which (a) models evidence identification as multi-label classification over spans instead of trying to predict start and end tokens, and (b) employs more sophisticated context segmentation for dealing with long documents. We also show that linguistic characteristics of contracts, such as negations by exceptions, are contributing to the difficulty of this task and that there is much room for improvement.

pdf bib abs
Human-like informative conversations: Better acknowledgements using conditional mutual information
Ashwin Paranjape | Christopher Manning
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This work aims to build a dialogue agent that can weave new factual content into conversations as naturally as humans. We draw insights from linguistic principles of conversational analysis and annotate human-human conversations from the Switchboard Dialog Act Corpus to examine humans strategies for acknowledgement, transition, detail selection and presentation. When current chatbots (explicitly provided with new factual content) introduce facts into a conversation, their generated responses do not acknowledge the prior turns. This is because models trained with two contexts - new factual content and conversational history - generate responses that are non-specific w.r.t. one of the contexts, typically the conversational history. We show that specificity w.r.t. conversational history is better captured by pointwise conditional mutual information (pcmi_h) than by the established use of pointwise mutual information (pmi). Our proposed method, Fused-PCMI, trades off pmi for pcmi_h and is preferred by humans for overall quality over the Max-PMI baseline 60% of the time. Human evaluators also judge responses with higher pcmi_h better at acknowledgement 74% of the time. The results demonstrate that systems mimicking human conversational traits (in this case acknowledgement) improve overall quality and more broadly illustrate the utility of linguistic principles in improving dialogue agents.

pdf bib abs
DReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference
Shikhar Murty | Tatsunori B. Hashimoto | Christopher Manning
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Meta-learning promises few-shot learners that can adapt to new distributions by repurposing knowledge acquired from previous training. However, we believe meta-learning has not yet succeeded in NLP due to the lack of a well-defined task distribution, leading to attempts that treat datasets as tasks. Such an ad hoc task distribution causes problems of quantity and quality. Since there’s only a handful of datasets for any NLP problem, meta-learners tend to overfit their adaptation mechanism and, since NLP datasets are highly heterogeneous, many learning episodes have poor transfer between their support and query sets, which discourages the meta-learner from adapting. To alleviate these issues, we propose DReCA (Decomposing datasets into Reasoning Categories), a simple method for discovering and using latent reasoning categories in a dataset, to form additional high quality tasks. DReCA works by splitting examples into label groups, embedding them with a finetuned BERT model and then clustering each group into reasoning categories. Across four few-shot NLI problems, we demonstrate that using DReCA improves the accuracy of meta-learners by 1.5-4%

pdf bib abs
Capturing Logical Structure of Visually Structured Documents with Multimodal Transition Parser
Yuta Koreeda | Christopher Manning
Proceedings of the Natural Legal Language Processing Workshop 2021

While many NLP pipelines assume raw, clean texts, many texts we encounter in the wild, including a vast majority of legal documents, are not so clean, with many of them being visually structured documents (VSDs) such as PDFs. Conventional preprocessing tools for VSDs mainly focused on word segmentation and coarse layout analysis, whereas fine-grained logical structure analysis (such as identifying paragraph boundaries and their hierarchies) of VSDs is underexplored. To that end, we proposed to formulate the task as prediction of “transition labels” between text fragments that maps the fragments to a tree, and developed a feature-based machine learning system that fuses visual, textual and semantic cues. Our system is easily customizable to different types of VSDs and it significantly outperformed baselines in identifying different structures in VSDs. For example, our system obtained a paragraph boundary detection F1 score of 0.953 which is significantly better than a popular PDF-to-text tool with an F1 score of 0.739.

pdf bib abs
Learning from Limited Labels for Long Legal Dialogue
Jenny Hong | Derek Chong | Christopher Manning
Proceedings of the Natural Legal Language Processing Workshop 2021

We study attempting to achieve high accuracy information extraction of case factors from a challenging dataset of parole hearings, which, compared to other legal NLP datasets, has longer texts, with fewer labels. On this corpus, existing work directly applying pretrained neural models has failed to extract all but a few relatively basic items with little improvement over rule-based extraction. We address two challenges posed by existing work: training on long documents and reasoning over complex speech patterns. We use a similar approach to the two-step open-domain question answering approach by using a Reducer to extract relevant text segments and a Producer to generate both extractive answers and non-extractive classifications. In a context like ours, with limited labeled data, we show that a superior approach for strong performance within limited development time is to use a combination of a rule-based Reducer and a neural Producer. We study four representative tasks from the parole dataset. On all four, we improve extraction from the previous benchmark of 0.41–0.63 to 0.83–0.89 F1.

pdf bib abs
Challenges for Information Extraction from Dialogue in Criminal Law
Jenny Hong | Catalin Voss | Christopher Manning
Proceedings of the 1st Workshop on NLP for Positive Impact

Information extraction and question answering have the potential to introduce a new paradigm for how machine learning is applied to criminal law. Existing approaches generally use tabular data for predictive metrics. An alternative approach is needed for matters of equitable justice, where individuals are judged on a case-by-case basis, in a process involving verbal or written discussion and interpretation of case factors. Such discussions are individualized, but they nonetheless rely on underlying facts. Information extraction can play an important role in surfacing these facts, which are still important to understand. We analyze unsupervised, weakly supervised, and pre-trained models’ ability to extract such factual information from the free-form dialogue of California parole hearings. With a few exceptions, most F1 scores are below 0.85. We use this opportunity to highlight some opportunities for further research for information extraction and question answering. We encourage new developments in NLP to enable analysis and review of legal cases to be done in a post-hoc, not predictive, manner.

pdf bib abs
Understanding and predicting user dissatisfaction in a neural generative chatbot
Abigail See | Christopher Manning
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Neural generative dialogue agents have shown an increasing ability to hold short chitchat conversations, when evaluated by crowdworkers in controlled settings. However, their performance in real-life deployment – talking to intrinsically-motivated users in noisy environments – is less well-explored. In this paper, we perform a detailed case study of a neural generative model deployed as part of Chirpy Cardinal, an Alexa Prize socialbot. We find that unclear user utterances are a major source of generative errors such as ignoring, hallucination, unclearness and repetition. However, even in unambiguous contexts the model frequently makes reasoning errors. Though users express dissatisfaction in correlation with these errors, certain dissatisfaction types (such as offensiveness and privacy objections) depend on additional factors – such as the user’s personal attitudes, and prior unaddressed dissatisfaction in the conversation. Finally, we show that dissatisfied user utterances can be used as a semi-supervised learning signal to improve the dialogue system. We train a model to predict next-turn dissatisfaction, and show through human evaluation that as a ranking function, it selects higher-quality neural-generated utterances.

pdf bib abs
Effective Social Chatbot Strategies for Increasing User Initiative
Amelia Hardy | Ashwin Paranjape | Christopher Manning
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Many existing chatbots do not effectively support mixed initiative, forcing their users to either respond passively or lead constantly. We seek to improve this experience by introducing new mechanisms to encourage user initiative in social chatbot conversations. Since user initiative in this setting is distinct from initiative in human-human or task-oriented dialogue, we first propose a new definition that accounts for the unique behaviors users take in this context. Drawing from linguistics, we propose three mechanisms to promote user initiative: back-channeling, personal disclosure, and replacing questions with statements. We show that simple automatic metrics of utterance length, number of noun phrases, and diversity of user responses correlate with human judgement of initiative. Finally, we use these metrics to suggest that these strategies do result in statistically significant increases in user initiative, where frequent, but not excessive, back-channeling is the most effective strategy.

pdf bib abs
Large-Scale Quantitative Evaluation of Dialogue Agents’ Response Strategies against Offensive Users
Haojun Li | Dilara Soylu | Christopher Manning
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

As voice assistants and dialogue agents grow in popularity, so does the abuse they receive. We conducted a large-scale quantitative evaluation of the effectiveness of 4 response types (avoidance, why, empathetic, and counter), and 2 additional factors (using a redirect or a voluntarily provided name) that have not been tested by prior work. We measured their direct effectiveness on real users in-the-wild by the re-offense ratio, length of conversation after the initial response, and number of turns until the next re-offense. Our experiments confirm prior lab studies in showing that empathetic responses perform better than generic avoidance responses as well as counter responses. We show that dialogue agents should almost always guide offensive users to a new topic through the use of redirects and use the user’s name if provided. As compared to a baseline avoidance strategy employed by commercial agents, our best strategy is able to reduce the re-offense ratio from 92% to 43%.

2020

pdf bib abs
Optimizing the Factual Correctness of a Summary: A Study of Summarizing Radiology Reports
Yuhao Zhang | Derek Merck | Emily Tsai | Christopher D. Manning | Curtis Langlotz
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Neural abstractive summarization models are able to generate summaries which have high overlap with human references. However, existing models are not optimized for factual correctness, a critical metric in real-world applications. In this work, we develop a general framework where we evaluate the factual correctness of a generated summary by fact-checking it automatically against its reference using an information extraction module. We further propose a training strategy which optimizes a neural summarization model with a factual correctness reward via reinforcement learning. We apply the proposed method to the summarization of radiology reports, where factual correctness is a key requirement. On two separate datasets collected from hospitals, we show via both automatic and human evaluation that the proposed approach substantially improves the factual correctness and overall quality of outputs over a competitive neural summarization system, producing radiology summaries that approach the quality of human-authored ones.

pdf bib abs
Finding Universal Grammatical Relations in Multilingual BERT
Ethan A. Chi | John Hewitt | Christopher D. Manning
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent work has found evidence that Multilingual BERT (mBERT), a transformer-based multilingual masked language model, is capable of zero-shot cross-lingual transfer, suggesting that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks’ internal representations to the multilingual setting. We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English, and that these subspaces are approximately shared across languages. Motivated by these results, we present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy. This evidence suggests that even without explicit supervision, multilingual masked language models learn certain linguistic universals.

pdf bib abs
Stanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi | Yuhao Zhang | Yuhui Zhang | Jason Bolton | Christopher D. Manning
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at https://stanfordnlp.github.io/stanza/.

pdf bib abs
The EOS Decision and Length Extrapolation
Benjamin Newman | John Hewitt | Percy Liang | Christopher D. Manning
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction.

pdf bib abs
Pre-Training Transformers as Energy-Based Cloze Models
Kevin Clark | Minh-Thang Luong | Quoc Le | Christopher D. Manning
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce Electric, an energy-based cloze model for representation learning over text. Like BERT, it is a conditional generative model of tokens given their contexts. However, Electric does not use masking or output a full distribution over tokens that could occur in a context. Instead, it assigns a scalar energy score to each input token indicating how likely it is given its context. We train Electric using an algorithm based on noise-contrastive estimation and elucidate how this learning objective is closely related to the recently proposed ELECTRA pre-training method. Electric performs well when transferred to downstream tasks and is particularly effective at producing likelihood scores for text: it re-ranks speech recognition n-best lists better than language models and much faster than masked language models. Furthermore, it offers a clearer and more principled view of what ELECTRA learns during pre-training.

pdf bib abs
SLM: Learning a Discourse Language Representation with Sentence Unshuffling
Haejun Lee | Drew A. Hudson | Kangwook Lee | Christopher D. Manning
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We introduce Sentence-level Language Modeling, a new pre-training objective for learning a discourse language representation in a fully self-supervised manner. Recent pre-training methods in NLP focus on learning either bottom or top-level language representations: contextualized word representations derived from language model objectives at one extreme and a whole sequence representation learned by order classification of two given textual segments at the other. However, these models are not directly encouraged to capture representations of intermediate-size structures that exist in natural languages such as sentences and the relationships among them. To that end, we propose a new approach to encourage learning of a contextualized sentence-level representation by shuffling the sequence of input sentences and training a hierarchical transformer model to reconstruct the original ordering. Through experiments on downstream tasks such as GLUE, SQuAD, and DiscoEval, we show that this feature of our model improves the performance of the original BERT by large margins.

pdf bib abs
RNNs can generate bounded hierarchical languages with optimal memory
John Hewitt | Michael Hahn | Surya Ganguli | Percy Liang | Christopher D. Manning
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Recurrent neural networks empirically generate natural language with high syntactic fidelity. However, their success is not well-understood theoretically. We provide theoretical insight into this success, proving in a finite-precision setting that RNNs can efficiently generate bounded hierarchical languages that reflect the scaffolding of natural language syntax. We introduce Dyck-(k,m), the language of well-nested brackets (of k types) and m-bounded nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax. The best known results use O(k^{^m⁄₂}) memory (hidden units) to generate these languages. We prove that an RNN with O(m log k) hidden units suffices, an exponential reduction in memory, by an explicit construction. Finally, we show that no algorithm, even with unbounded computation, can suffice with o(m log k) hidden units.

pdf bib abs
Stay Hungry, Stay Focused: Generating Informative and Specific Questions in Information-Seeking Conversations
Peng Qi | Yuhao Zhang | Christopher D. Manning
Findings of the Association for Computational Linguistics: EMNLP 2020

We investigate the problem of generating informative questions in information-asymmetric conversations. Unlike previous work on question generation which largely assumes knowledge of what the answer might be, we are interested in the scenario where the questioner is not given the context from which answers are drawn, but must reason pragmatically about how to acquire new information, given the shared conversation history. We identify two core challenges: (1) formally defining the informativeness of potential questions, and (2) exploring the prohibitively large space of potential questions to find the good candidates. To generate pragmatic questions, we use reinforcement learning to optimize an informativeness metric we propose, combined with a reward function designed to promote more specific questions. We demonstrate that the resulting pragmatic questioner substantially improves the informativeness and specificity of questions generated over a baseline model, as evaluated by our metrics as well as humans.

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

2019

pdf bib abs
Answering Complex Open-domain Questions Through Iterative Query Generation
Peng Qi | Xiaowen Lin | Leo Mehr | Zijian Wang | Christopher D. Manning
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

It is challenging for current one-step retrieve-and-read question answering (QA) systems to answer questions like “Which novel by the author of ‘Armada’ will be adapted as a feature film by Steven Spielberg?” because the question seldom contains retrievable clues about the missing entity (here, the author). Answering such a question requires multi-hop reasoning where one must gather information about the missing entity (or facts) to proceed with further reasoning. We present GoldEn (Gold Entity) Retriever, which iterates between reading context and retrieving more supporting documents to answer open-domain multi-hop questions. Instead of using opaque and computationally expensive neural retrieval models, GoldEn Retriever generates natural language search queries given the question and available context, and leverages off-the-shelf information retrieval systems to query for missing entities. This allows GoldEn Retriever to scale up efficiently for open-domain multi-hop reasoning while maintaining interpretability. We evaluate GoldEn Retriever on the recently proposed open-domain multi-hop QA dataset, HotpotQA, and demonstrate that it outperforms the best previously published model despite not using pretrained language models such as BERT.

pdf bib abs
Do Massively Pretrained Language Models Make Better Storytellers?
Abigail See | Aneesh Pappu | Rohun Saxena | Akhila Yerukola | Christopher D. Manning
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Large neural language models trained on massive amounts of text have emerged as a formidable strategy for Natural Language Understanding tasks. However, the strength of these models as Natural Language Generators is less clear. Though anecdotal evidence suggests that these models generate better quality text, there has been no detailed study characterizing their generation abilities. In this work, we compare the performance of an extensively pretrained model, OpenAI GPT2-117 (Radford et al., 2019), to a state-of-the-art neural story generation model (Fan et al., 2018). By evaluating the generated text across a wide variety of automatic metrics, we characterize the ways in which pretrained models do, and do not, make better storytellers. We find that although GPT2-117 conditions more strongly on context, is more sensitive to ordering of events, and uses more unusual words, it is just as likely to produce repetitive and under-diverse text when using likelihood-maximizing decoding algorithms.

pdf bib abs
A Structural Probe for Finding Syntax in Word Representations
John Hewitt | Christopher D. Manning
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recent work has improved our ability to detect linguistic knowledge in word representations. However, current methods for detecting syntactic knowledge do not test whether syntax trees are represented in their entirety. In this work, we propose a structural probe, which evaluates whether syntax trees are embedded in a linear transformation of a neural network’s word representation space. The probe identifies a linear transformation under which squared L2 distance encodes the distance between words in the parse tree, and one in which squared L2 norm encodes depth in the parse tree. Using our probe, we show that such transformations exist for both ELMo and BERT but not in baselines, providing evidence that entire syntax trees are embedded implicitly in deep models’ vector geometry.

pdf bib abs
BAM! Born-Again Multi-Task Networks for Natural Language Understanding
Kevin Clark | Minh-Thang Luong | Urvashi Khandelwal | Christopher D. Manning | Quoc V. Le
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmark. Our method consistently improves over standard single-task and multi-task training.

pdf bib abs
CoQA: A Conversational Question Answering Challenge
Siva Reddy | Danqi Chen | Christopher D. Manning
Transactions of the Association for Computational Linguistics, Volume 7

Humans gather information through conversations involving a series of interconnected questions and answers. For machines to assist in information gathering, it is therefore essential to enable them to answer conversational questions. We introduce CoQA, a novel dataset for building Conversational Question Answering systems. Our dataset contains 127k questions with answers, obtained from 8k conversations about text passages from seven diverse domains. The questions are conversational, and the answers are free-form text with their corresponding evidence highlighted in the passage. We analyze CoQA in depth and show that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning). We evaluate strong dialogue and reading comprehension models on CoQA. The best system obtains an F1 score of 65.4%, which is 23.4 points behind human performance (88.8%), indicating that there is ample room for improvement. We present CoQA as a challenge to the community at https://stanfordnlp.github.io/coqa.

pdf bib abs
What Does BERT Look at? An Analysis of BERT’s Attention
Kevin Clark | Urvashi Khandelwal | Omer Levy | Christopher D. Manning
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Large pre-trained neural networks such as BERT have had great recent success in NLP, motivating a growing body of research investigating what aspects of language they are able to learn from unlabeled data. Most recent analysis has focused on model outputs (e.g., language model surprisal) or internal vector representations (e.g., probing classifiers). Complementary to these works, we propose methods for analyzing the attention mechanisms of pre-trained models and apply them to BERT. BERT’s attention heads exhibit patterns such as attending to delimiter tokens, specific positional offsets, or broadly attending over the whole sentence, with heads in the same layer often exhibiting similar behaviors. We further show that certain attention heads correspond well to linguistic notions of syntax and coreference. For example, we find heads that attend to the direct objects of verbs, determiners of nouns, objects of prepositions, and coreferent mentions with remarkably high accuracy. Lastly, we propose an attention-based probing classifier and use it to further demonstrate that substantial syntactic information is captured in BERT’s attention.

2018

pdf bib abs
Textual Analogy Parsing: What’s Shared and What’s Compared among Analogous Facts
Matthew Lamm | Arun Chaganty | Christopher D. Manning | Dan Jurafsky | Percy Liang
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

To understand a sentence like “whereas only 10% of White Americans live at or below the poverty line, 28% of African Americans do” it is important not only to identify individual facts, e.g., poverty rates of distinct demographic groups, but also the higher-order relations between them, e.g., the disparity between them. In this paper, we propose the task of Textual Analogy Parsing (TAP) to model this higher-order meaning. Given a sentence such as the one above, TAP outputs a frame-style meaning representation which explicitly specifies what is shared (e.g., poverty rates) and what is compared (e.g., White Americans vs. African Americans, 10% vs. 28%) between its component facts. Such a meaning representation can enable new applications that rely on discourse understanding such as automated chart generation from quantitative text. We present a new dataset for TAP, baselines, and a model that successfully uses an ILP to enforce the structural constraints of the problem.

pdf bib abs
Semi-Supervised Sequence Modeling with Cross-View Training
Kevin Clark | Minh-Thang Luong | Christopher D. Manning | Quoc Le
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Unsupervised representation learning algorithms such as word2vec and ELMo improve the accuracy of many supervised NLP models, mainly because they can take advantage of large amounts of unlabeled text. However, the supervised models only learn from task-specific labeled data during the main training phase. We therefore propose Cross-View Training (CVT), a semi-supervised learning algorithm that improves the representations of a Bi-LSTM sentence encoder using a mix of labeled and unlabeled data. On labeled examples, standard supervised learning is used. On unlabeled examples, CVT teaches auxiliary prediction modules that see restricted views of the input (e.g., only part of a sentence) to match the predictions of the full model seeing the whole input. Since the auxiliary modules and the full model share intermediate representations, this in turn improves the full model. Moreover, we show that CVT is particularly effective when combined with multi-task learning. We evaluate CVT on five sequence tagging tasks, machine translation, and dependency parsing, achieving state-of-the-art results.

pdf bib abs
Graph Convolution over Pruned Dependency Trees Improves Relation Extraction
Yuhao Zhang | Peng Qi | Christopher D. Manning
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Dependency trees help relation extraction models capture long-range relations between words. However, existing dependency-based models either neglect crucial information (e.g., negation) by pruning the dependency trees too aggressively, or are computationally inefficient because it is difficult to parallelize over different tree structures. We propose an extension of graph convolutional networks that is tailored for relation extraction, which pools information over arbitrary dependency structures efficiently in parallel. To incorporate relevant information while maximally removing irrelevant content, we further apply a novel pruning strategy to the input trees by keeping words immediately around the shortest path between the two entities among which a relation might hold. The resulting model achieves state-of-the-art performance on the large-scale TACRED dataset, outperforming existing sequence and dependency-based neural models. We also show through detailed analysis that this model has complementary strengths to sequence models, and combining them further improves the state of the art.

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems’ ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

pdf bib abs
Universal Dependency Parsing from Scratch
Peng Qi | Timothy Dozat | Yuhao Zhang | Christopher D. Manning
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes Stanford’s system at the CoNLL 2018 UD Shared Task. We introduce a complete neural pipeline system that takes raw text as input, and performs all tasks required by the shared task, ranging from tokenization and sentence segmentation, to POS tagging and dependency parsing. Our single system submission achieved very competitive performance on big treebanks. Moreover, after fixing an unfortunate bug, our corrected system would have placed the 2nd, 1st, and 3rd on the official evaluation metrics LAS, MLAS, and BLEX, and would have outperformed all submission systems on low-resource treebank categories on all metrics by a large margin. We further show the effectiveness of different model components through extensive ablation studies.

pdf bib abs
Sentences with Gapping: Parsing and Reconstructing Elided Predicates
Sebastian Schuster | Joakim Nivre | Christopher D. Manning
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments. Surface syntax representations of such sentences are often produced poorly by parsers, and even if correct, not well suited to downstream natural language understanding tasks such as relation extraction that are typically designed to extract information from sentences with canonical clause structure. In this paper, we present two methods for parsing to a Universal Dependencies graph representation that explicitly encodes the elided material with additional nodes and edges. We find that both methods can reconstruct elided material from dependency trees with high accuracy when the parser correctly predicts the existence of a gap. We further demonstrate that one of our methods can be applied to other languages based on a case study on Swedish.

pdf bib abs
Simpler but More Accurate Semantic Dependency Parsing
Timothy Dozat | Christopher D. Manning
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

While syntactic dependency annotations concentrate on the surface or functional structure of a sentence, semantic dependency annotations aim to capture between-word relationships that are more closely related to the meaning of a sentence, using graph-structured representations. We extend the LSTM-based syntactic parser of Dozat and Manning (2017) to train on and generate these graph structures. The resulting system on its own achieves state-of-the-art performance, beating the previous, substantially more complex state-of-the-art system by 0.6% labeled F1. Adding linguistically richer input representations pushes the margin even higher, allowing us to beat it by 1.9% labeled F1.

pdf bib abs
Learning to Summarize Radiology Findings
Yuhao Zhang | Daisy Yi Ding | Tianpei Qian | Christopher D. Manning | Curtis P. Langlotz
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

The Impression section of a radiology report summarizes crucial radiology findings in natural language and plays a central role in communicating these findings to physicians. However, the process of generating impressions by summarizing findings is time-consuming for radiologists and prone to errors. We propose to automate the generation of radiology impressions with neural sequence-to-sequence learning. We further propose a customized neural model for this task which learns to encode the study background information and use this information to guide the decoding process. On a large dataset of radiology reports collected from actual hospital studies, our model outperforms existing non-neural and neural baselines under the ROUGE metrics. In a blind experiment, a board-certified radiologist indicated that 67% of sampled system summaries are at least as good as the corresponding human-written summaries, suggesting significant clinical validity. To our knowledge our work represents the first attempt in this direction.

2017

pdf bib abs
Position-aware Attention and Supervised Data Improve Slot Filling
Yuhao Zhang | Victor Zhong | Danqi Chen | Gabor Angeli | Christopher D. Manning
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Organized relational knowledge in the form of “knowledge graphs” is important for many applications. However, the ability to populate knowledge bases with facts automatically extracted from documents has improved frustratingly slowly. This paper simultaneously addresses two issues that have held back prior work. We first propose an effective new model, which combines an LSTM sequence model with a form of entity position-aware attention that is better suited to relation extraction. Then we build TACRED, a large (119,474 examples) supervised relation extraction dataset obtained via crowdsourcing and targeted towards TAC KBP relations. The combination of better supervised data and a more appropriate high-capacity model enables much better relation extraction performance. When the model trained on this new dataset replaces the previous relation extraction component of the best TAC KBP 2015 slot filling system, its F1 score increases markedly from 22.2% to 26.7%.

pdf bib abs
Importance sampling for unbiased on-demand evaluation of knowledge base population
Arun Chaganty | Ashwin Paranjape | Percy Liang | Christopher D. Manning
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Knowledge base population (KBP) systems take in a large document corpus and extract entities and their relations. Thus far, KBP evaluation has relied on judgements on the pooled predictions of existing systems. We show that this evaluation is problematic: when a new system predicts a previously unseen relation, it is penalized even if it is correct. This leads to significant bias against new systems, which counterproductively discourages innovation in the field. Our first contribution is a new importance-sampling based evaluation which corrects for this bias by annotating a new system’s predictions on-demand via crowdsourcing. We show this eliminates bias and reduces variance using data from the 2015 TAC KBP task. Our second contribution is an implementation of our method made publicly available as an online KBP evaluation service. We pilot the service by testing diverse state-of-the-art systems on the TAC KBP 2016 corpus and obtain accurate scores in a cost effective manner.

pdf bib abs
A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue
Mihail Eric | Christopher Manning
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Task-oriented dialogue focuses on conversational agents that participate in dialogues with user goals on domain-specific topics. In contrast to chatbots, which simply seek to sustain open-ended meaningful discourse, existing task-oriented agents usually explicitly model user intent and belief states. This paper examines bypassing such an explicit representation by depending on a latent neural embedding of state and learning selective attention to dialogue history together with copying to incorporate relevant prior context. We complement recent work by showing the effectiveness of simple sequence-to-sequence neural architectures with a copy mechanism. Our model outperforms more complex memory-augmented models by 7% in per-response generation and is on par with the current state-of-the-art on DSTC2, a real-world task-oriented dialogue dataset.

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

pdf bib abs
Stanford’s Graph-based Neural Dependency Parser at the CoNLL 2017 Shared Task
Timothy Dozat | Peng Qi | Christopher D. Manning
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes the neural dependency parser submitted by Stanford to the CoNLL 2017 Shared Task on parsing Universal Dependencies. Our system uses relatively simple LSTM networks to produce part of speech tags and labeled dependency parses from segmented and tokenized sequences of words. In order to address the rare word problem that abounds in languages with complex morphology, we include a character-based word representation that uses an LSTM to produce embeddings from sequences of characters. Our system was ranked first according to all five relevant metrics for the system: UPOS tagging (93.09%), XPOS tagging (82.27%), unlabeled attachment score (81.30%), labeled attachment score (76.30%), and content word labeled attachment score (72.57%).

pdf bib abs
Naturalizing a Programming Language via Interactive Learning
Sida I. Wang | Samuel Ginn | Percy Liang | Christopher D. Manning
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Our goal is to create a convenient natural language interface for performing well-specified but complex actions such as analyzing data, manipulating text, and querying databases. However, existing natural language interfaces for such tasks are quite primitive compared to the power one wields with a programming language. To bridge this gap, we start with a core programming language and allow users to “naturalize” the core language incrementally by defining alternative, more natural syntax and increasingly complex concepts in terms of compositions of simpler ones. In a voxel world, we show that a community of users can simultaneously teach a common system a diverse language and use it to build hundreds of complex voxel structures. Over the course of three days, these users went from using only the core language to using the naturalized language in 85.9% of the last 10K utterances.

pdf bib abs
Get To The Point: Summarization with Pointer-Generator Networks
Abigail See | Peter J. Liu | Christopher D. Manning
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Neural sequence-to-sequence models have provided a viable new approach for abstractive text summarization (meaning they are not restricted to simply selecting and rearranging passages from the original text). However, these models have two shortcomings: they are liable to reproduce factual details inaccurately, and they tend to repeat themselves. In this work we propose a novel architecture that augments the standard sequence-to-sequence attentional model in two orthogonal ways. First, we use a hybrid pointer-generator network that can copy words from the source text via pointing, which aids accurate reproduction of information, while retaining the ability to produce novel words through the generator. Second, we use coverage to keep track of what has been summarized, which discourages repetition. We apply our model to the CNN / Daily Mail summarization task, outperforming the current abstractive state-of-the-art by at least 2 ROUGE points.

pdf bib abs
Arc-swift: A Novel Transition System for Dependency Parsing
Peng Qi | Christopher D. Manning
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Transition-based dependency parsers often need sequences of local shift and reduce operations to produce certain attachments. Correct individual decisions hence require global information about the sentence context and mistakes cause error propagation. This paper proposes a novel transition system, arc-swift, that enables direct attachments between tokens farther apart with a single transition. This allows the parser to leverage lexical information more directly in transition decisions. Hence, arc-swift can achieve significantly better performance with a very small beam size. Our parsers reduce error by 3.7–7.6% relative to those using existing transition systems on the Penn Treebank dependency parsing task and English Universal Dependencies.

pdf bib
Gapping Constructions in Universal Dependencies v2
Sebastian Schuster | Matthew Lamm | Christopher D. Manning
Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017)

pdf bib abs
Key-Value Retrieval Networks for Task-Oriented Dialogue
Mihail Eric | Lakshmi Krishnan | Francois Charette | Christopher D. Manning
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Neural task-oriented dialogue systems often struggle to smoothly interface with a knowledge base. In this work, we seek to address this problem by proposing a new neural dialogue agent that is able to effectively sustain grounded, multi-domain discourse through a novel key-value retrieval mechanism. The model is end-to-end differentiable and does not need to explicitly model dialogue state or belief trackers. We also release a new dataset of 3,031 dialogues that are grounded through underlying knowledge bases and span three distinct tasks in the in-car personal assistant space: calendar scheduling, weather information retrieval, and point-of-interest navigation. Our architecture is simultaneously trained on data from all domains and significantly outperforms a competitive rule-based system and other existing neural dialogue architectures on the provided domains according to both automatic and human evaluation metrics.

2016

pdf bib
Deep Reinforcement Learning for Mention-Ranking Coreference Models
Kevin Clark | Christopher D. Manning
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Compression of Neural Machine Translation Models via Pruning
Abigail See | Minh-Thang Luong | Christopher D. Manning
Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning

pdf bib abs
A comparison of Named-Entity Disambiguation and Word Sense Disambiguation
Angel Chang | Valentin I. Spitkovsky | Christopher D. Manning | Eneko Agirre
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Named Entity Disambiguation (NED) is the task of linking a named-entity mention to an instance in a knowledge-base, typically Wikipedia-derived resources like DBpedia. This task is closely related to word-sense disambiguation (WSD), where the mention of an open-class word is linked to a concept in a knowledge-base, typically WordNet. This paper analyzes the relation between two annotated datasets on NED and WSD, highlighting the commonalities and differences. We detail the methods to construct a NED system following the WSD word-expert approach, where we need a dictionary and one classifier is built for each target entity mention string. Constructing a dictionary for NED proved challenging, and although similarity and ambiguity are higher for NED, the results are also higher due to the larger number of training data, and the more crisp and skewed meaning differences.

Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments. It is also useful for multilingual system development and comparative linguistic studies. Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages.

pdf bib abs
Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks
Sebastian Schuster | Christopher D. Manning
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Many shallow natural language understanding tasks use dependency trees to extract relations between content words. However, strict surface-structure dependency trees tend to follow the linguistic structure of sentences too closely and frequently fail to provide direct relations between content words. To mitigate this problem, the original Stanford Dependencies representation also defines two dependency graph representations which contain additional and augmented relations that explicitly capture otherwise implicit relations between content words. In this paper, we revisit and extend these dependency graph representations in light of the recent Universal Dependencies (UD) initiative and provide a detailed account of an enhanced and an enhanced++ English UD representation. We further present a converter from constituency to basic, i.e., strict surface structure, UD trees, and a converter from basic UD trees to enhanced and enhanced++ English UD graphs. We release both converters as part of Stanford CoreNLP and the Stanford Parser.

pdf bib
Combining Natural Logic and Shallow Reasoning for Question Answering
Gabor Angeli | Neha Nayak Kennard | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Improving Coreference Resolution by Learning Entity-Level Distributed Representations
Kevin Clark | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models
Minh-Thang Luong | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Fast Unified Model for Parsing and Sentence Understanding
Samuel R. Bowman | Jon Gauthier | Abhinav Rastogi | Raghav Gupta | Christopher D. Manning | Christopher Potts
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
Danqi Chen | Jason Bolton | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Learning Language Games through Interaction
Sida I. Wang | Percy Liang | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

bib abs
Neural Machine Translation
Thang Luong | Kyunghyun Cho | Christopher D. Manning
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

Neural Machine Translation (NMT) is a simple new architecture for getting machines to learn to translate. Despite being relatively new (Kalchbrenner and Blunsom, 2013; Cho et al., 2014; Sutskever et al., 2014), NMT has already shown promising results, achieving state-of-the-art performances for various language pairs (Luong et al, 2015a; Jean et al, 2015; Luong et al, 2015b; Sennrich et al., 2016; Luong and Manning, 2016). While many of these NMT papers were presented to the ACL community, research and practice of NMT are only at their beginning stage. This tutorial would be a great opportunity for the whole community of machine translation and natural language processing to learn more about a very promising new approach to MT. This tutorial has four parts.In the first part, we start with an overview of MT approaches, including: (a) traditional methods that have been dominant over the past twenty years and (b) recent hybrid models with the use of neural network components. From these, we motivate why an end-to-end approach like neural machine translation is needed. The second part introduces a basic instance of NMT. We start out with a discussion of recurrent neural networks, including the back-propagation-through-time algorithm and stochastic gradient descent optimizers, as these are the foundation on which NMT builds. We then describe in detail the basic sequence-to-sequence architecture of NMT (Cho et al., 2014; Sutskever et al., 2014), the maximum likelihood training approach, and a simple beam-search decoder to produce translations.The third part of our tutorial describes techniques to build state-of-the-art NMT. We start with approaches to extend the vocabulary coverage of NMT (Luong et al., 2015a; Jean et al., 2015; Chitnis and DeNero, 2015). We then introduce the idea of jointly learning both translations and alignments through an attention mechanism (Bahdanau et al., 2015); other variants of attention (Luong et al., 2015b; Tu et al., 2016) are discussed too. We describe a recent trend in NMT, that is to translate at the sub-word level (Chung et al., 2016; Luong and Manning, 2016; Sennrich et al., 2016), so that language variations can be effectively handled. We then give tips on training and testing NMT systems such as batching and ensembling. In the final part of the tutorial, we briefly describe promising approaches, such as (a) how to combine multiple tasks to help translation (Dong et al., 2015; Luong et al., 2016; Firat et al., 2016; Zoph and Knight, 2016) and (b) how to utilize monolingual corpora (Sennrich et al., 2016). Lastly, we conclude with challenges remained to be solved for future NMT.PS: we would also like to acknowledge the very first paper by Forcada and Ñeco (1997) on sequence-to-sequence models for translation!

pdf bib
Evaluating Word Embeddings Using a Representative Suite of Practical Tasks
Neha Nayak Kennard | Gabor Angeli | Christopher D. Manning
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

2015

pdf bib
Stanford neural machine translation systems for spoken language domains
Minh-Thang Luong | Christopher Manning
Proceedings of the 12th International Workshop on Spoken Language Translation: Evaluation Campaign

pdf bib
A large annotated corpus for learning natural language inference
Samuel R. Bowman | Gabor Angeli | Christopher Potts | Christopher D. Manning
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Effective Approaches to Attention-based Neural Machine Translation
Thang Luong | Hieu Pham | Christopher D. Manning
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf bib
Last Words: Computational Linguistics and Deep Learning
Christopher D. Manning
Computational Linguistics, Volume 41, Issue 4 - December 2015

pdf bib
Deep Neural Language Models for Machine Translation
Thang Luong | Michael Kayser | Christopher D. Manning
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf bib
Distributed Representations of Words to Guide Bootstrapped Entity Classifiers
Sonal Gupta | Christopher D. Manning
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Text to 3D Scene Generation with Rich Lexical Grounding
Angel Chang | Will Monroe | Manolis Savva | Christopher Potts | Christopher D. Manning
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Leveraging Linguistic Structure For Open Domain Information Extraction
Gabor Angeli | Melvin Jose Johnson Premkumar | Christopher D. Manning
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Robust Subgraph Generation Improves Abstract Meaning Representation Parsing
Keenon Werling | Gabor Angeli | Christopher D. Manning
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Entity-Centric Coreference Resolution with Model Stacking
Kevin Clark | Christopher D. Manning
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks
Kai Sheng Tai | Richard Socher | Christopher D. Manning
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Learning Distributed Representations for Multilingual Text Sequences
Hieu Pham | Thang Luong | Christopher Manning
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
Bilingual Word Representations with Monolingual Quality in Mind
Thang Luong | Hieu Pham | Christopher D. Manning
Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing

pdf bib
Invited Talk: The Case for Universal Dependencies
Christopher Manning
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Does Universal Dependencies need a parsing representation? An investigation of English
Natalia Silveira | Christopher Manning
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf bib
Generating Semantically Precise Scene Graphs from Textual Descriptions for Improved Image Retrieval
Sebastian Schuster | Ranjay Krishna | Angel Chang | Li Fei-Fei | Christopher D. Manning
Proceedings of the Fourth Workshop on Vision and Language

pdf bib
Recursive Neural Networks Can Learn Logical Semantics
Samuel R. Bowman | Christopher Potts | Christopher D. Manning
Proceedings of the 3rd Workshop on Continuous Vector Space Models and their Compositionality

2014

pdf bib
NaturalLI: Natural Logic Inference for Common Sense Reasoning
Gabor Angeli | Christopher D. Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
A Fast and Accurate Dependency Parser using Neural Networks
Danqi Chen | Christopher Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Human Effort and Machine Learnability in Computer Aided Translation
Spence Green | Sida I. Wang | Jason Chuang | Jeffrey Heer | Sebastian Schuster | Christopher D. Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
GloVe: Global Vectors for Word Representation
Jeffrey Pennington | Richard Socher | Christopher Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Combining Distant and Partial Supervision for Relation Extraction
Gabor Angeli | Julie Tibshirani | Jean Wu | Christopher D. Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib
Learning Spatial Knowledge for Text to 3D Scene Generation
Angel Chang | Manolis Savva | Christopher D. Manning
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf bib abs
Universal Stanford dependencies: A cross-linguistic typology
Marie-Catherine de Marneffe | Timothy Dozat | Natalia Silveira | Katri Haverinen | Filip Ginter | Joakim Nivre | Christopher D. Manning
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Revisiting the now de facto standard Stanford dependency representation, we propose an improved taxonomy to capture grammatical relations across languages, including morphologically rich ones. We suggest a two-layered taxonomy: a set of broadly attested universal grammatical relations, to which language-specific relations can be added. We emphasize the lexicalist stance of the Stanford Dependencies, which leads to a particular, partially new treatment of compounding, prepositions, and morphology. We show how existing dependency schemes for several languages map onto the universal taxonomy proposed here and close with consideration of practical implications of dependency representation choices for NLP applications, in particular parsing.

We present a gold standard annotation of syntactic dependencies in the English Web Treebank corpus using the Stanford Dependencies formalism. This resource addresses the lack of a gold standard dependency treebank for English, as well as the limited availability of gold standard syntactic annotations for English informal text genres. We also present experiments on the use of this resource, both for training dependency parsers and for evaluating the quality of different versions of the Stanford Parser, which includes a converter tool to produce dependency annotation from constituency trees. We show that training a dependency parser on a mix of newswire and web data leads to better performance on that type of data without hurting performance on newswire text, and therefore gold standard annotations for non-canonical text can be a valuable resource for parsing. Furthermore, the systematic annotation effort has informed both the SD formalism and its implementation in the Stanford Parser’s dependency converter. In response to the challenges encountered by annotators in the EWT corpus, the formalism has been revised and extended, and the converter has been improved.

pdf bib abs
Event Extraction Using Distant Supervision
Kevin Reschke | Martin Jankowiak | Mihai Surdeanu | Christopher Manning | Daniel Jurafsky
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Distant supervision is a successful paradigm that gathers training data for information extraction systems by automatically aligning vast databases of facts with text. Previous work has demonstrated its usefulness for the extraction of binary relations such as a person’s employer or a film’s director. Here, we extend the distant supervision approach to template-based event extraction, focusing on the extraction of passenger counts, aircraft types, and other facts concerning airplane crash events. We present a new publicly available dataset and event extraction task in the plane crash domain based on Wikipedia infoboxes and newswire text. Using this dataset, we conduct a preliminary evaluation of four distantly supervised extraction models which assign named entity mentions in text to entries in the event template. Our results indicate that joint inference over sequences of candidate entity mentions is beneficial. Furthermore, we demonstrate that the Searn algorithm outperforms a linear-chain CRF and strong baselines with local inference.

pdf bib
Robust Logistic Regression using Shift Parameters
Julie Tibshirani | Christopher D. Manning
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Faster Phrase-Based Decoding by Refining Feature State
Kenneth Heafield | Michael Kayser | Christopher D. Manning
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition
Mengqiu Wang | Rob Voigt | Christopher D. Manning
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Word Segmentation of Informal Arabic with Domain Adaptation
Will Monroe | Spence Green | Christopher D. Manning
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
The Stanford CoreNLP Natural Language Processing Toolkit
Christopher Manning | Mihai Surdeanu | John Bauer | Jenny Finkel | Steven Bethard | David McClosky
Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib abs
Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning
Mengqiu Wang | Christopher D. Manning
Transactions of the Association for Computational Linguistics, Volume 2

We consider a multilingual weakly supervised learning scenario where knowledge from annotated corpora in a resource-rich language is transferred via bitext to guide the learning in other languages. Past approaches project labels across bitext and use them as features or gold labels for training. We propose a new method that projects model expectations rather than labels, which facilities transfer of model uncertainty across language boundaries. We encode expectations as constraints and train a discriminative CRF model using Generalized Expectation Criteria (Mann and McCallum, 2010). Evaluated on standard Chinese-English and German-English NER datasets, our method demonstrates F1 scores of 64% and 60% when no labeled data is used. Attaining the same accuracy with supervised CRFs requires 12k and 1.5k labeled sentences. Furthermore, when combined with labeled examples, our method yields significant improvements over state-of-the-art supervised methods, achieving best reported numbers to date on Chinese OntoNotes and German CoNLL-03 datasets.

pdf bib abs
Grounded Compositional Semantics for Finding and Describing Images with Sentences
Richard Socher | Andrej Karpathy | Quoc V. Le | Christopher D. Manning | Andrew Y. Ng
Transactions of the Association for Computational Linguistics, Volume 2

Previous work on Recursive Neural Networks (RNNs) shows that these models can produce compositional feature vectors for accurately representing and classifying sentences or images. However, the sentence vectors of previous models cannot accurately represent visually grounded meaning. We introduce the DT-RNN model which uses dependency trees to embed sentences into a vector space in order to retrieve images that are described by those sentences. Unlike previous RNN-based models which use constituency trees, DT-RNNs naturally focus on the action and agents in a sentence. They are better able to abstract from the details of word order and syntactic expression. DT-RNNs outperform other recursive and recurrent neural networks, kernelized CCA and a bag-of-words baseline on the tasks of finding an image that fits a sentence description and vice versa. They also give more similar representations to sentences that describe the same image.

pdf bib
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)
Alexandre Allauzen | Raffaella Bernardi | Edward Grefenstette | Hugo Larochelle | Christopher Manning | Scott Wen-tau Yih
Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)

pdf bib
Improved Pattern Learning for Bootstrapped Entity Extraction
Sonal Gupta | Christopher Manning
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

pdf bib
Semantic Parsing for Text to 3D Scene Generation
Angel Chang | Manolis Savva | Christopher Manning
Proceedings of the ACL 2014 Workshop on Semantic Parsing

pdf bib
Interactive Learning of Spatial Knowledge for Text to 3D Scene Generation
Angel Chang | Manolis Savva | Christopher Manning
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

pdf bib
SPIED: Stanford Pattern based Information Extraction and Diagnostics
Sonal Gupta | Christopher Manning
Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces

pdf bib
Phrasal: A Toolkit for New Directions in Statistical Machine Translation
Spence Green | Daniel Cer | Christopher Manning
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
Stanford University’s Submissions to the WMT 2014 Translation Task
Julia Neidert | Sebastian Schuster | Spence Green | Kenneth Heafield | Christopher Manning
Proceedings of the Ninth Workshop on Statistical Machine Translation

pdf bib
An Empirical Comparison of Features and Tuning for Phrase-based Machine Translation
Spence Green | Daniel Cer | Christopher Manning
Proceedings of the Ninth Workshop on Statistical Machine Translation

2013

pdf bib
Feature Noising for Log-Linear Structured Prediction
Sida Wang | Mengqiu Wang | Stefan Wager | Percy Liang | Christopher D. Manning
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Bilingual Word Embeddings for Phrase-Based Machine Translation
Will Y. Zou | Richard Socher | Daniel Cer | Christopher D. Manning
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf bib
Learning a Product of Experts with Elitist Lasso
Mengqiu Wang | Christopher D. Manning
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Effect of Non-linear Deep Architecture in Sequence Labeling
Mengqiu Wang | Christopher D. Manning
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf bib
Parsing Models for Identifying Multiword Expressions
Spence Green | Marie-Catherine de Marneffe | Christopher D. Manning
Computational Linguistics, Volume 39, Issue 1 - March 2013

pdf bib
Named Entity Recognition with Bilingual Constraints
Wanxiang Che | Mengqiu Wang | Christopher D. Manning | Ting Liu
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Deep Learning for NLP (without Magic)
Richard Socher | Christopher D. Manning
NAACL HLT 2013 Tutorial Abstracts

pdf bib
Fast and Adaptive Online Training of Feature-Rich Translation Models
Spence Green | Sida Wang | Daniel Cer | Christopher D. Manning
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Parsing with Compositional Vector Grammars
Richard Socher | John Bauer | Christopher D. Manning | Andrew Y. Ng
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition
Mengqiu Wang | Wanxiang Che | Christopher D. Manning
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
SUTime: Evaluation in TempEval-3
Angel Chang | Christopher D. Manning
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

pdf bib
Positive Diversity Tuning for Machine Translation System Combination
Daniel Cer | Christopher D. Manning | Dan Jurafsky
Proceedings of the Eighth Workshop on Statistical Machine Translation

pdf bib
Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality
Alexandre Allauzen | Hugo Larochelle | Christopher Manning | Richard Socher
Proceedings of the Workshop on Continuous Vector Space Models and their Compositionality

pdf bib
Better Word Representations with Recursive Neural Networks for Morphology
Thang Luong | Richard Socher | Christopher Manning
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf bib
Philosophers are Mortal: Inferring the Truth of Unseen Facts
Gabor Angeli | Christopher Manning
Proceedings of the Seventeenth Conference on Computational Natural Language Learning

pdf bib
More Constructions, More Genres: Extending Stanford Dependencies
Marie-Catherine de Marneffe | Miriam Connor | Natalia Silveira | Samuel R. Bowman | Timothy Dozat | Christopher D. Manning
Proceedings of the Second International Conference on Dependency Linguistics (DepLing 2013)

2012

pdf bib
Multi-instance Multi-label Learning for Relation Extraction
Mihai Surdeanu | Julie Tibshirani | Ramesh Nallapati | Christopher D. Manning
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Learning Constraints for Consistent Timeline Extraction
David McClosky | Christopher D. Manning
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Probabilistic Finite State Machines for Regression-based MT Evaluation
Mengqiu Wang | Christopher D. Manning
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Semantic Compositionality through Recursive Matrix-Vector Spaces
Richard Socher | Brody Huval | Christopher D. Manning | Andrew Y. Ng
Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

pdf bib
Did It Happen? The Pragmatic Complexity of Veridicality Assessment
Marie-Catherine de Marneffe | Christopher D. Manning | Christopher Potts
Computational Linguistics, Volume 38, Issue 2 - June 2012

pdf bib abs
SUTime: A library for recognizing and normalizing time expressions
Angel X. Chang | Christopher Manning
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We describe SUTIME, a temporal tagger for recognizing and normalizing temporal expressions in English text. SUTIME is available as part of the Stanford CoreNLP pipeline and can be used to annotate documents with temporal information. It is a deterministic rule-based system designed for extensibility. Testing on the TempEval-2 evaluation corpus shows that this system outperforms state-of-the-art techniques.

pdf bib
Entity Clustering Across Languages
Spence Green | Nicholas Andrews | Matthew R. Gormley | Mark Dredze | Christopher D. Manning
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Parsing Time: Learning to Interpret Time Expressions
Gabor Angeli | Christopher Manning | Daniel Jurafsky
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Improving Word Representations via Global Context and Multiple Word Prototypes
Eric Huang | Richard Socher | Christopher Manning | Andrew Ng
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Baselines and Bigrams: Simple, Good Sentiment and Topic Classification
Sida Wang | Christopher Manning
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Deep Learning for NLP (without Magic)
Richard Socher | Yoshua Bengio | Christopher D. Manning
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
SPEDE: Probabilistic Edit Distance Metrics for MT Evaluation
Mengqiu Wang | Christopher Manning
Proceedings of the Seventh Workshop on Statistical Machine Translation

pdf bib
Accurate Unsupervised Joint Named-Entity Extraction from Unaligned Parallel Text
Robert Munro | Christopher D. Manning
Proceedings of the 4th Named Entity Workshop (NEWS) 2012

2011

pdf bib
Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions
Richard Socher | Jeffrey Pennington | Eric H. Huang | Andrew Y. Ng | Christopher D. Manning
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French
Spence Green | Marie-Catherine de Marneffe | John Bauer | Christopher D. Manning
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf bib
Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers
Sonal Gupta | Christopher Manning
Proceedings of 5th International Joint Conference on Natural Language Processing

pdf bib
Event Extraction as Dependency Parsing
David McClosky | Mihai Surdeanu | Christopher Manning
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the Fifteenth Conference on Computational Natural Language Learning
Sharon Goldwater | Christopher Manning
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

pdf bib
Customizing an Information Extraction System to a New Domain
Mihai Surdeanu | David McClosky | Mason Smith | Andrey Gusev | Christopher Manning
Proceedings of the ACL 2011 Workshop on Relational Models of Semantics

pdf bib
Event Extraction as Dependency Parsing for BioNLP 2011
David McClosky | Mihai Surdeanu | Christopher Manning
Proceedings of BioNLP Shared Task 2011 Workshop

pdf bib
Model Combination for Event Extraction in BioNLP 2011
Sebastian Riedel | David McClosky | Mihai Surdeanu | Andrew McCallum | Christopher D. Manning
Proceedings of BioNLP Shared Task 2011 Workshop

2010

pdf bib
Better Arabic Parsing: Baselines, Evaluations, and Analysis
Spence Green | Christopher D. Manning
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering
Mengqiu Wang | Christopher Manning
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib abs
Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy
Daniel Cer | Marie-Catherine de Marneffe | Dan Jurafsky | Chris Manning
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We investigate a number of approaches to generating Stanford Dependencies, a widely used semantically-oriented dependency representation. We examine algorithms specifically designed for dependency parsing (Nivre, Nivre Eager, Covington, Eisner, and RelEx) as well as dependencies extracted from constituent parse trees created by phrase structure parsers (Charniak, Charniak-Johnson, Bikel, Berkeley and Stanford). We found that constituent parsers systematically outperform algorithms designed specifically for dependency parsing. The most accurate method for generating dependencies is the Charniak-Johnson reranking parser, with 89% (labeled) attachment F1 score. The fastest methods are Nivre, Nivre Eager, and Covington, used with a linear classifier to make local parsing decisions, which can parse the entire Penn Treebank development set (section 22) in less than 10 seconds on an Intel Xeon E5520. However, this speed comes with a substantial drop in F1 score (about 76% for labeled attachment) compared to competing methods. By tuning how much of the search space is explored by the Charniak-Johnson parser, we are able to arrive at a balanced configuration that is both fast and nearly as good as the most accurate approaches.

pdf bib
Subword Variation in Text Message Classification
Robert Munro | Christopher D. Manning
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
The Best Lexical Metric for Phrase-Based Statistical MT System Optimization
Daniel Cer | Christopher D. Manning | Daniel Jurafsky
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Ensemble Models for Dependency Parsing: Cheap and Good?
Mihai Surdeanu | Christopher D. Manning
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Improved Models of Distortion Cost for Statistical Machine Translation
Spence Green | Michel Galley | Christopher D. Manning
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Accurate Non-Hierarchical Phrase-Based Translation
Michel Galley | Christopher D. Manning
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features
Daniel Cer | Michel Galley | Daniel Jurafsky | Christopher D. Manning
Proceedings of the NAACL HLT 2010 Demonstration Session

pdf bib
“Was It Good? It Was Provocative.” Learning the Meaning of Scalar Adjectives
Marie-Catherine de Marneffe | Christopher D. Manning | Christopher Potts
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data
Jenny Rose Finkel | Christopher D. Manning
Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading
Rutu Mulkar-Mehta | James Allen | Jerry Hobbs | Eduard Hovy | Bernardo Magnini | Chris Manning
Proceedings of the NAACL HLT 2010 First International Workshop on Formalisms and Methodology for Learning by Reading

pdf bib
Viterbi Training Improves Unsupervised Dependency Parsing
Valentin I. Spitkovsky | Hiyan Alshawi | Daniel Jurafsky | Christopher D. Manning
Proceedings of the Fourteenth Conference on Computational Natural Language Learning

2009

pdf bib
NP Subject Detection in Verb-initial Arabic Clauses
Spence Green | Conal Sathi | Christopher D. Manning
Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3)

pdf bib
Nested Named Entity Recognition
Jenny Rose Finkel | Christopher D. Manning
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora
Daniel Ramage | David Hall | Ramesh Nallapati | Christopher D. Manning
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Joint Parsing and Named Entity Recognition
Jenny Rose Finkel | Christopher D. Manning
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Hierarchical Bayesian Domain Adaptation
Jenny Rose Finkel | Christopher D. Manning
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Robust Machine Translation Evaluation with Entailment Features
Sebastian Padó | Michel Galley | Dan Jurafsky | Christopher D. Manning
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Quadratic-Time Dependency Parsing for Machine Translation
Michel Galley | Christopher D. Manning
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP

pdf bib
Machine Translation Evaluation with Textual Entailment Features
Sebastian Padó | Michel Galley | Daniel Jurafsky | Christopher D. Manning
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Disambiguating “DE” for Chinese-English Machine Translation
Pi-Chuan Chang | Daniel Jurafsky | Christopher D. Manning
Proceedings of the Fourth Workshop on Statistical Machine Translation

pdf bib
Discriminative Reordering with Chinese Grammatical Relations Features
Pi-Chuan Chang | Huihsin Tseng | Dan Jurafsky | Christopher D. Manning
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009

pdf bib
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)
Chris Callison-Burch | Ido Dagan | Christopher Manning | Marco Pennacchiotti | Fabio Massimo Zanzotto
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

pdf bib
Multi-word expressions in textual inference: Much ado about nothing?
Marie-Catherine de Marneffe | Sebastian Padó | Christopher D. Manning
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

pdf bib
Presupposed Content and Entailments in Natural Language Inference
David Clausen | Christopher D. Manning
Proceedings of the 2009 Workshop on Applied Textual Inference (TextInfer)

pdf bib
Random Walks for Text Semantic Similarity
Daniel Ramage | Anna N. Rafferty | Christopher D. Manning
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

pdf bib
WikiWalk: Random walks on Wikipedia for Semantic Relatedness
Eric Yeh | Daniel Ramage | Christopher D. Manning | Eneko Agirre | Aitor Soroa
Proceedings of the 2009 Workshop on Graph-based Methods for Natural Language Processing (TextGraphs-4)

pdf bib
An extended model of natural logic
Bill MacCartney | Christopher D. Manning
Proceedings of the Eight International Conference on Computational Semantics

2008

pdf bib
Modeling Semantic Containment and Exclusion in Natural Language Inference
Bill MacCartney | Christopher D. Manning
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Studying the History of Ideas Using Topic Models
David Hall | Daniel Jurafsky | Christopher D. Manning
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
Legal Docket Classification: Where Machine Learning Stumbles
Ramesh Nallapati | Christopher D. Manning
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Phrase-Based Alignment Model for Natural Language Inference
Bill MacCartney | Michel Galley | Christopher D. Manning
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Simple and Effective Hierarchical Phrase Reordering Model
Michel Galley | Christopher D. Manning
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

pdf bib
A Global Joint Model for Semantic Role Labeling
Kristina Toutanova | Aria Haghighi | Christopher D. Manning
Computational Linguistics, Volume 34, Number 2, June 2008 - Special Issue on Semantic Role Labeling

Lexicon schemas and their use are discussed in this paper from the perspective of lexicographers and field linguists. A variety of lexicon schemas have been developed, with goals ranging from computational lexicography (DATR) through archiving (LIFT, TEI) to standardization (LMF, FSR). A number of requirements for lexicon schemas are given. The lexicon schemas are introduced and compared to each other in terms of conversion and usability for this particular user group, using a common lexicon entry and providing examples for each schema under consideration. The formats are assessed and the final recommendation is given for the potential users, namely to request standard compliance from the developers of the tools used. This paper should foster a discussion between authors of standards, lexicographers and field linguists.

pdf bib
Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency Factors that Increase ASR Error Rates
Sharon Goldwater | Dan Jurafsky | Christopher D. Manning
Proceedings of ACL-08: HLT

pdf bib
Efficient, Feature-based, Conditional Random Field Parsing
Jenny Rose Finkel | Alex Kleeman | Christopher D. Manning
Proceedings of ACL-08: HLT

pdf bib
Finding Contradictions in Text
Marie-Catherine de Marneffe | Anna N. Rafferty | Christopher D. Manning
Proceedings of ACL-08: HLT

pdf bib
Enforcing Transitivity in Coreference Resolution
Jenny Rose Finkel | Christopher D. Manning
Proceedings of ACL-08: HLT, Short Papers

pdf bib
Regularization and Search for Minimum Error Rate Training
Daniel Cer | Dan Jurafsky | Christopher D. Manning
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Optimizing Chinese Word Segmentation for Machine Translation Performance
Pi-Chuan Chang | Michel Galley | Christopher D. Manning
Proceedings of the Third Workshop on Statistical Machine Translation

pdf bib
Parsing Three German Treebanks: Lexicalized and Unlexicalized Baselines
Anna Rafferty | Christopher D. Manning
Proceedings of the Workshop on Parsing German

pdf bib
The Stanford Typed Dependencies Representation
Marie-Catherine de Marneffe | Christopher D. Manning
Coling 2008: Proceedings of the workshop on Cross-Framework and Cross-Domain Parser Evaluation

2007

pdf bib
The Infinite Tree
Jenny Rose Finkel | Trond Grenager | Christopher D. Manning
Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics

pdf bib
Natural Logic for Textual Inference
Bill MacCartney | Christopher D. Manning
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing

2006

pdf bib abs
Generating Typed Dependency Parses from Phrase Structure Parses
Marie-Catherine de Marneffe | Bill MacCartney | Christopher D. Manning
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes a system for extracting typed dependency parses of English sentences from phrase structure parses. In order to capture inherent relations occurring in corpus texts that can be critical in real-world applications, many NP relations are included in the set of grammatical relations used. We provide a comparison of our system with Minipar and the Link parser. The typed dependency extraction facility described here is integrated in the Stanford Parser, available for download.

pdf bib
Learning to recognize features of valid textual entailments
Bill MacCartney | Trond Grenager | Marie-Catherine de Marneffe | Daniel Cer | Christopher D. Manning
Proceedings of the Human Language Technology Conference of the NAACL, Main Conference

pdf bib
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts
Chris Manning | Doug Oard | Jim Glass
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

pdf bib
An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition
Vijay Krishnan | Christopher D. Manning
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

pdf bib
Unsupervised Discovery of a Statistical Verb Lexicon
Trond Grenager | Christopher D. Manning
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

pdf bib
Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines
Jenny Rose Finkel | Christopher D. Manning | Andrew Y. Ng
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing

We introduce a novel parser based on a probabilistic version of a left-corner parser. The left-corner strategy is attractive because rule probabilities can be conditioned on both top-down goals and bottom-up derivations. We develop the underlying theory and explain how a grammar can be induced from analyzed data. We show that the left-corner approach provides an advantage over simple top-down probabilistic context-free grammars in parsing the Wall Street Journal using a grammar induced from the Penn Treebank. We also conclude that the Penn Treebank provides a fairly weak tes bed due to the flatness of its bracketings and to the obvious overgeneration and undergeneration of its induced grammar.