Tal Linzen - ACL Anthology

Tal Linzen

2025

Findings of the Third BabyLM Challenge: Accelerating Language Modeling Research with Cognitively Plausible Data
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

This report summarizes the findings from the 3rd BabyLM Challenge. The BabyLM Challenge is a shared task aimed at closing the data efficiency gap between human and machine language learners. This year, the challenge was held as part of an expanded BabyLM Workshop that invited paper submissions on topics relevant to the BabyLM effort, including sample-efficient pretraining and cognitive modeling for LMs. For the challenge, we kept the text-only and text–image tracks from previous years, but also introduced a new interaction track, where student models are allowed to learn from feedback from larger teacher models. Furthermore, we introduce a new set of evaluation tasks to assess the “human likeness” of models on a cognitive and linguistic level, limit the total amount of training compute allowed, and measure performance on intermediate checkpoints. We observe that new training objectives and architectures tend to produce the best-performing approaches, and that interaction with teacher models can yield high-quality language models. The strict-small and interaction tracks saw submissions that outperformed the baselines. We do not observe a complete correlation between training FLOPs and performance. This year’s BabyLM Challenge shows that there is still room to innovate in a data-constrained setting, and that community-driven research can yield actionable insights for language modeling.

Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases
Michael Y. Hu | Jackson Petty | Chuan Shi | William Merrill | Tal Linzen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Pretraining language models on formal language can improve their acquisition of natural language. Which features of the formal language impart an inductive bias that leads to effective transfer? Drawing on insights from linguistics and complexity theory, we hypothesize that effective transfer occurs when two conditions are met: the formal language should capture the dependency structures present in natural language, and it should remain within the computational limitations of the model architecture. We experiment with pre-pretraining (training on formal language before natural languages) on transformers and find that formal languages capturing hierarchical dependencies indeed enable language models to achieve lower loss on natural language and better linguistic generalization compared to other formal languages. We also find modest support for the hypothesis that the formal language should fall within the computational limitations of the architecture. Strikingly, pre-pretraining reduces loss more efficiently than training on a matched amount of natural language. For a 1B-parameter language model trained on roughly 1.6B tokens of natural language, pre-pretraining achieves the same loss and better linguistic generalization with a 33% smaller token budget. Finally, we also give mechanistic evidence of transfer from formal tonatural language: attention heads acquired during pre-pretraining remain crucial for the model’s performance on syntactic evaluations.

What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
Lindia Tjuatja | Graham Neubig | Tal Linzen | Sophie Hao
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

When comparing the linguistic capabilities of language models (LMs) with humans using LM probabilities, factors such as the length of the sequence and the unigram frequency of lexical items have a significant effect on LM probabilities in ways that humans are largely robust to. Prior works in comparing LM and human acceptability judgments treat these effects uniformly across models, making a strong assumption that models require the same degree of adjustment to control for length and unigram frequency effects. We propose MORCELA, a new linking theory between LM scores and acceptability judgments where the optimal level of adjustment for these effects is estimated from data via learned parameters for length and unigram frequency. We first show that MORCELA outperforms a commonly used linking theory for acceptability—SLOR (Pauls and Klein, 2012; Lau et al., 2017)—across two families of transformer LMs (Pythia and OPT). Furthermore, we demonstrate that the assumed degrees of adjustment in SLOR for length and unigram frequency overcorrect for these confounds, and that larger models require a lower relative degree of adjustment for unigram frequency, though a significant amount of adjustment is still necessary for all models. Finally, our subsequent analysis shows that larger LMs’ lower susceptibility to frequency effects can be explained by an ability to better predict rarer words in context.

Proceedings of the First BabyLM Workshop
Lucas Charpentier | Leshem Choshen | Ryan Cotterell | Mustafa Omer Gul | Michael Y. Hu | Jing Liu | Jaap Jumelet | Tal Linzen | Aaron Mueller | Candace Ross | Raj Sanjay Shah | Alex Warstadt | Ethan Gotlieb Wilcox | Adina Williams
Proceedings of the First BabyLM Workshop

Multilingual Prompting for Improving LLM Generation Diversity
Qihan Wang | Shidong Pan | Tal Linzen | Emily Black
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Large Language Models (LLMs) are known to lack cultural representation and overall diversity in their generations, from expressing opinions to answering factual questions. To mitigate this problem, we propose multilingual prompting: a prompting method which generates several variations of a base prompt with added cultural and linguistic cues from several cultures, generates responses, and then combines the results. Building on evidence that LLMs have language-specific knowledge, multilingual prompting seeks to increase diversity by activating a broader range of cultural knowledge embedded in model training data. Through experiments across multiple models (GPT-4o, GPT-4o-mini, LLaMA 70B, and LLaMA 8B), we show that multilingual prompting consistently outperforms existing diversity-enhancing techniques such as high-temperature sampling, step-by-step recall, and persona prompting. Further analyses show that the benefits of multilingual prompting vary between high and low resource languages and across model sizes, and that aligning the prompting language with cultural cues reduces hallucination about culturally-specific information.

Rapid Word Learning Through Meta In-Context Learning
Wentao Wang | Guangyuan Jiang | Tal Linzen | Brenden Lake
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Humans can quickly learn a new word from a few illustrative examples, and then systematically and flexibly use it in novel contexts. Yet the abilities of current language models for few-shot word learning, and methods for improving these abilities, are underexplored. In this study, we introduce a novel method, Meta-training for IN-context learNing Of Words (Minnow). This method trains language models to generate new examples of a word’s usage given a few in-context examples, using a special placeholder token to represent the new word. This training is repeated on many new words to develop a general word-learning ability. We find that training models from scratch with Minnow on human-scale child-directed language enables strong few-shot word learning, comparable to a large language model (LLM) pre-trained on orders of magnitude more data. Furthermore, through discriminative and generative evaluations, we demonstrate that finetuning pre-trained LLMs with Minnow improves their ability to discriminate between new words, identify syntactic categories of new words, and generate reasonable new usages and definitions for new words, based on one or a few in-context examples. These findings highlight the data efficiency of Minnow and its potential to improve language model performance in word learning tasks.

2024

Do Language Models’ Words Refer?
Matthew Mandelkern | Tal Linzen
Computational Linguistics, Volume 50, Issue 3 - September 2024

What do language models (LMs) do with language? They can produce sequences of (mostly) coherent strings closely resembling English. But do those sentences mean something, or are LMs simply babbling in a convincing simulacrum of language use? We address one aspect of this broad question: whether LMs’ words can refer, that is, achieve “word-to-world” connections. There is prima facie reason to think they do not, since LMs do not interact with the world in the way that ordinary language users do. Drawing on the externalist tradition in philosophy of language, we argue that those appearances are misleading: Even if the inputs to LMs are simply strings of text, they are strings of text with natural histories, and that may suffice for LMs’ words to refer.

A Systematic Comparison of Syllogistic Reasoning in Humans and Language Models
Tiwalayo Eisape | Michael Tessler | Ishita Dasgupta | Fei Sha | Sjoerd Steenkiste | Tal Linzen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

A central component of rational behavior is logical inference: the process of determining which conclusions follow from a set of premises. Psychologists have documented several ways in which humans’ inferences deviate from the rules of logic. Do language models, which are trained on text generated by humans, replicate such human biases, or are they able to overcome them? Focusing on the case of syllogisms—inferences from two simple premises—we show that, within the PaLM 2 family of transformer language models, larger models are more logical than smaller ones, and also more logical than humans. At the same time, even the largest models make systematic errors, some of which mirror human reasoning biases: they show sensitivity to the (irrelevant) ordering of the variables in the syllogism, and draw confident but incorrect inferences from particular syllogisms (syllogistic fallacies). Overall, we find that language models often mimic the human biases included in their training data, but are able to overcome them in some cases.

Can You Learn Semantics Through Next-Word Prediction? The Case of Entailment
William Merrill | Zhaofeng Wu | Norihito Naka | Yoon Kim | Tal Linzen
Findings of the Association for Computational Linguistics: ACL 2024

Do LMs infer the semantics of text from co-occurrence patterns in their training data? Merrill et al. (2022) argue that, in theory, sentence co-occurrence probabilities predicted by an optimal LM should reflect the entailment relationship of the constituent sentences, but it is unclear whether probabilities predicted by neural LMs encode entailment in this way because of strong assumptions made by Merrill et al. (namely, that humans always avoid redundancy). In this work, we investigate whether their theory can be used to decode entailment relations from neural LMs. We find that a test similar to theirs can decode entailment relations between natural sentences, well above random chance, though not perfectly, across many datasets and LMs. This suggests LMs implicitly model aspects of semantics to predict semantic effects on sentence co-occurrence patterns. However, we find the test that predicts entailment in practice works in the opposite direction to the theoretical test. We thus revisit the assumptions underlying the original test, finding its derivation did not adequately account for redundancy in human-written text. We argue that better accounting for redundancy related to *explanations* might derive the observed flipped test and, more generally, improve computational models of speakers in linguistics.

The Impact of Depth on Compositional Generalization in Transformer Language Models
Jackson Petty | Sjoerd Steenkiste | Ishita Dasgupta | Fei Sha | Dan Garrette | Tal Linzen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

To process novel sentences, language models (LMs) must generalize compositionally—combine familiar elements in new ways. What aspects of a model’s structure promote compositional generalization? Focusing on transformers, we test the hypothesis, motivated by theoretical and empirical work, that deeper transformers generalize more compositionally. Simply adding layers increases the total number of parameters; to address this confound between depth and size, we construct three classes of models which trade off depth for width such that the total number of parameters is kept constant (41M, 134M and 374M parameters). We pretrain all models as LMs and fine-tune them on tasks that test for compositional generalization. We report three main conclusions: (1) after fine-tuning, deeper models generalize more compositionally than shallower models do, but the benefit of additional layers diminishes rapidly; (2) within each family, deeper models show better language modeling performance, but returns are similarly diminishing; (3) the benefits of depth for compositional generalization cannot be attributed solely to better performance on language modeling. Because model latency is approximately linear in the number of layers, these results lead us to the recommendation that, with a given total parameter budget, transformers can be made shallower than is typical without sacrificing performance.

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Michael Y. Hu | Aaron Mueller | Candace Ross | Adina Williams | Tal Linzen | Chengxu Zhuang | Ryan Cotterell | Leshem Choshen | Alex Warstadt | Ethan Gotlieb Wilcox
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

The BabyLM Challenge is a community effort to close the data-efficiency gap between human and computational language learners. Participants compete to optimize language model training on a fixed language data budget of 100 million words or less. This year, we released improved text corpora, as well as a vision-and-language corpus to facilitate research into cognitively plausible vision language models. Submissions were compared on evaluation tasks targeting grammatical ability, (visual) question answering, pragmatic abilities, and grounding, among other abilities. Participants could submit to a 10M-word text-only track, a 100M-word text-only track, and/or a 100M-word and image multimodal track. From 31 submissions employing diverse methods, a hybrid causal-masked language model architecture outperformed other approaches. No submissions outperformed the baselines in the multimodal track. In follow-up analyses, we found a strong relationship between training FLOPs and average performance across tasks, and that the best-performing submissions proposed changes to the training data, training objective, and model architecture. This year’s BabyLM Challenge shows that there is still significant room for innovation in this setting, in particular for image-text modeling, but community-driven research can yield actionable insights about effective strategies for small-scale language modeling.

The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning
Michael Y. Hu | Aaron Mueller | Candace Ross | Adina Williams | Tal Linzen | Chengxu Zhuang | Leshem Choshen | Ryan Cotterell | Alex Warstadt | Ethan Gotlieb Wilcox
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

In-context Learning Generalizes, But Not Always Robustly: The Case of Syntax
Aaron Mueller | Albert Webson | Jackson Petty | Tal Linzen
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

In-context learning (ICL) is now a common method for teaching large language models (LLMs) new tasks: given labeled examples in the input context, the LLM learns to perform the task without weight updates. Do models guided via ICL infer the underlying structure of the task defined by the context, or do they rely on superficial heuristics that only generalize to identically distributed examples? We address this question using transformations tasks and an NLI task that assess sensitivity to syntax—a requirement for robust language understanding. We further investigate whether out-of-distribution generalization can be improved via chain-of-thought prompting, where the model is provided with a sequence of intermediate computation steps that illustrate how the task ought to be performed. In experiments with models from the GPT, PaLM, and Llama 2 families, we find large variance across LMs. The variance is explained more by the composition of the pre-training corpus and supervision methods than by model size; in particular, models pre-trained on code generalize better, and benefit more from chain-of-thought prompting.

SPAWNing Structural Priming Predictions from a Cognitively Motivated Parser
Grusha Prasad | Tal Linzen
Proceedings of the 28th Conference on Computational Natural Language Learning

Structural priming is a widely used psycholinguistic paradigm to study human sentence representations. In this work we introduce SPAWN, a cognitively motivated parser that can generate quantitative priming predictions from contemporary theories in syntax which assume a lexicalized grammar. By generating and testing priming predictions from competing theoretical accounts, we can infer which assumptions from syntactic theory are useful for characterizing the representations humans build when processing sentences. As a case study, we use SPAWN to generate priming predictions from two theories (Whiz-Deletion and Participial-Phase) which make different assumptions about the structure of English relative clauses. By modulating the reanalysis mechanism that the parser uses and strength of the parser’s prior knowledge, we generated nine sets of predictions from each of the two theories. Then, we tested these predictions using a novel web-based comprehension-to-production priming paradigm. We found that while the some of the predictions from the Participial-Phase theory aligned with human behavior, none of the predictions from the the Whiz-Deletion theory did, thus suggesting that the Participial-Phase theory might better characterize human relative clause representations.

2023

How Much Do Language Models Copy From Their Training Data? Evaluating Linguistic Novelty in Text Generation Using RAVEN
R. Thomas McCoy | Paul Smolensky | Tal Linzen | Jianfeng Gao | Asli Celikyilmaz
Transactions of the Association for Computational Linguistics, Volume 11

Current language models can generate high-quality text. Are they simply copying text they have seen before, or have they learned generalizable linguistic abstractions? To tease apart these possibilities, we introduce RAVEN, a suite of analyses for assessing the novelty of generated text, focusing on sequential structure (n-grams) and syntactic structure. We apply these analyses to four neural language models trained on English (an LSTM, a Transformer, Transformer-XL, and GPT-2). For local structure—e.g., individual dependencies—text generated with a standard sampling scheme is substantially less novel than our baseline of human-generated text from each model’s test set. For larger-scale structure—e.g., overall sentence structure—model-generated text is as novel or even more novel than the human-generated baseline, but models still sometimes copy substantially, in some cases duplicating passages over 1,000 words long from the training set. We also perform extensive manual analysis, finding evidence that GPT-2 uses both compositional and analogical generalization mechanisms and showing that GPT-2’s novel text is usually well-formed morphologically and syntactically but has reasonably frequent semantic issues (e.g., being self-contradictory).

Neural Networks Can Learn Patterns of Island-insensitivity in Norwegian
Anastasia Kobzeva | Suhas Arehalli | Tal Linzen | Dave Kush
Proceedings of the Society for Computation in Linguistics 2023

SLOG: A Structural Generalization Benchmark for Semantic Parsing
Bingzhi Li | Lucia Donatelli | Alexander Koller | Tal Linzen | Yuekun Yao | Najoung Kim
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The goal of compositional generalization benchmarks is to evaluate how well models generalize to new complex linguistic expressions. Existing benchmarks often focus on lexical generalization, the interpretation of novel lexical items in syntactic structures familiar from training; structural generalization tasks, where a model needs to interpret syntactic structures that are themselves unfamiliar from training, are often underrepresented, resulting in overly optimistic perceptions of how well models can generalize. We introduce SLOG, a semantic parsing dataset that extends COGS (Kim and Linzen, 2020) with 17 structural generalization cases. In our experiments, the generalization accuracy of Transformer models, including pretrained ones, only reaches 40.6%, while a structure-aware parser only achieves 70.8%. These results are far from the near-perfect accuracy existing models achieve on COGS, demonstrating the role of SLOG in foregrounding the large discrepancy between models’ lexical and structural generalization capacities.

Verb Conjugation in Transformers Is Determined by Linear Encodings of Subject Number
Sophie Hao | Tal Linzen
Findings of the Association for Computational Linguistics: EMNLP 2023

Deep architectures such as Transformers are sometimes criticized for having uninterpretable “black-box” representations. We use causal intervention analysis to show that, in fact, some linguistic features are represented in a linear, interpretable format. Specifically, we show that BERT’s ability to conjugate verbs relies on a linear encoding of subject number that can be manipulated with predictable effects on conjugation accuracy. This encoding is found in the subject position at the first layer and the verb position at the last layer, but distributed across positions at middle layers, particularly when there are multiple cues to subject number.

How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases
Aaron Mueller | Tal Linzen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Accurate syntactic representations are essential for robust generalization in natural language. Recent work has found that pre-training can teach language models to rely on hierarchical syntactic features—as opposed to incorrect linear features—when performing tasks after fine-tuning. We test what aspects of pre-training are important for endowing encoder-decoder Transformers with an inductive bias that favors hierarchical syntactic generalizations. We focus on architectural features (depth, width, and number of parameters), as well as the genre and size of the pre-training corpus, diagnosing inductive biases using two syntactic transformation tasks: question formation and passivization, both in English. We find that the number of parameters alone does not explain hierarchical generalization: model depth plays greater role than model width. We also find that pre-training on simpler language, such as child-directed speech, induces a hierarchical bias using an order-of-magnitude less data than pre-training on more typical datasets based on web text or Wikipedia; this suggests that in cognitively plausible language acquisition settings, neural language models may be more data-efficient than previously thought.

Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

How poor is the stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed speech
Aditya Yedetore | Tal Linzen | Robert Frank | R. Thomas McCoy
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

When acquiring syntax, children consistently choose hierarchical rules over competing non-hierarchical possibilities. Is this preference due to a learning bias for hierarchical structure, or due to more general biases that interact with hierarchical cues in children’s linguistic input? We explore these possibilities by training LSTMs and Transformers - two types of neural networks without a hierarchical bias - on data similar in quantity and content to children’s linguistic input: text from the CHILDES corpus. We then evaluate what these models have learned about English yes/no questions, a phenomenon for which hierarchical structure is crucial. We find that, though they perform well at capturing the surface statistics of child-directed speech (as measured by perplexity), both model types generalize in a way more consistent with an incorrect linear rule than the correct hierarchical rule. These results suggest that human-like generalization from text alone requires stronger biases than the general sequence-processing biases of standard neural network architectures.

Language Models Can Learn Exceptions to Syntactic Rules
Cara Su-Yi Leong | Tal Linzen
Proceedings of the Society for Computation in Linguistics 2023

A Language Model with Limited Memory Capacity Captures Interference in Human Sentence Processing
William Timkey | Tal Linzen
Findings of the Association for Computational Linguistics: EMNLP 2023

Two of the central factors believed to underpin human sentence processing difficulty are expectations and retrieval from working memory. A recent attempt to create a unified cognitive model integrating these two factors have relied on the parallels between the self-attention mechanism of transformer language models and cue-based retrieval theories of working memory in human sentence processing (Ryu and Lewis 2021). While the authors show that attention patterns in specialized attention heads of GPT-2 are consistent with a key prediction of cue-based retrieval models, similarity-based interference effects, their method requires the identification of syntactically specialized attention heads, and makes an cognitively implausible implicit assumption that hundreds of memory retrieval operations take place in parallel. In the present work, we develop a recurrent neural language model with a single self-attention head, which more closely parallels the memory system assumed by cognitive theories. We show that our model’s single attention head can capture semantic and syntactic interference effects observed in human experiments.

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning
Alex Warstadt | Aaron Mueller | Leshem Choshen | Ethan Wilcox | Chengxu Zhuang | Juan Ciro | Rafael Mosquera | Bhargavi Paranjabe | Adina Williams | Tal Linzen | Ryan Cotterell
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

2022

Evaluating Attribution in Dialogue Systems: The BEGIN Benchmark
Nouha Dziri | Hannah Rashkin | Tal Linzen | David Reitter
Transactions of the Association for Computational Linguistics, Volume 10

Knowledge-grounded dialogue systems powered by large language models often generate responses that, while fluent, are not attributable to a relevant source of information. Progress towards models that do not exhibit this issue requires evaluation metrics that can quantify its prevalence. To this end, we introduce the Benchmark for Evaluation of Grounded INteraction (Begin), comprising 12k dialogue turns generated by neural dialogue systems trained on three knowledge-grounded dialogue corpora. We collect human annotations assessing the extent to which the models’ responses can be attributed to the given background information. We then use Begin to analyze eight evaluation metrics. We find that these metrics rely on spurious correlations, do not reliably distinguish attributable abstractive responses from unattributable ones, and perform substantially worse when the knowledge source is longer. Our findings underscore the need for more sophisticated and robust evaluation metrics for knowledge-grounded dialogue. We make Begin publicly available at https://github.com/google/BEGIN-dataset.

Entailment Semantics Can Be Extracted from an Ideal Language Model
William Merrill | Alex Warstadt | Tal Linzen
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Language models are often trained on text alone, without additional grounding. There is debate as to how much of natural language semantics can be inferred from such a procedure. We prove that entailment judgments between sentences can be extracted from an ideal language model that has perfectly learned its target distribution, assuming the training sentences are generated by Gricean agents, i.e., agents who follow fundamental principles of communication from the linguistic theory of pragmatics. We also show entailment judgments can be decoded from the predictions of a language model trained on such Gricean data. Our results reveal a pathway for understanding the semantic information encoded in unlabeled linguistic data and a potential framework for extracting semantics from language models.

Characterizing Verbatim Short-Term Memory in Neural Language Models
Kristijan Armeni | Christopher Honey | Tal Linzen
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

When a language model is trained to predict natural language sequences, its prediction at each moment depends on a representation of prior context. What kind of information about the prior context can language models retrieve? We tested whether language models could retrieve the exact words that occurred previously in a text. In our paradigm, language models (transformers and an LSTM) processed English text in which a list of nouns occurred twice. We operationalized retrieval as the reduction in surprisal from the first to the second list. We found that the transformers retrieved both the identity and ordering of nouns from the first list. Further, the transformers’ retrieval was markedly enhanced when they were trained on a larger corpus and with greater model depth. Lastly, their ability to index prior tokens was dependent on learned attention patterns. In contrast, the LSTM exhibited less precise retrieval, which was limited to list-initial tokens and to short intervening texts. The LSTM’s retrieval was not sensitive to the order of nouns and it improved when the list was semantically coherent. We conclude that transformers implemented something akin to a working memory system that could flexibly retrieve individual token representations across arbitrary delays; conversely, the LSTM maintained a coarser and more rapidly-decaying semantic gist of prior tokens, weighted toward the earliest items.

Improving Compositional Generalization with Latent Structure and Data Augmentation
Linlu Qiu | Peter Shaw | Panupong Pasupat | Pawel Nowak | Tal Linzen | Fei Sha | Kristina Toutanova
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Generic unstructured neural networks have been shown to struggle on out-of-distribution compositional generalization. Compositional data augmentation via example recombination has transferred some prior knowledge about compositionality to such black-box neural models for several semantic parsing tasks, but this often required task-specific engineering or provided limited gains. We present a more powerful data recombination method using a model called Compositional Structure Learner (CSL). CSL is a generative model with a quasi-synchronous context-free grammar backbone, which we induce from the training data. We sample recombined examples from CSL and add them to the fine-tuning data of a pre-trained sequence-to-sequence model (T5). This procedure effectively transfers most of CSL’s compositional bias to T5 for diagnostic tasks, and results in a model even stronger than a T5-CSL ensemble on two real world compositional generalization tasks. This results in new state-of-the-art performance for these challenging semantic parsing tasks requiring generalization to both natural language variation and novel compositions of elements.

When a sentence does not introduce a discourse entity, Transformer-based models still sometimes refer to it
Sebastian Schuster | Tal Linzen
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Understanding longer narratives or participating in conversations requires tracking of discourse entities that have been mentioned. Indefinite noun phrases (NPs), such as ‘a dog’, frequently introduce discourse entities but this behavior is modulated by sentential operators such as negation. For example, ‘a dog’ in ‘Arthur doesn’t own a dog’ does not introduce a discourse entity due to the presence of negation. In this work, we adapt the psycholinguistic assessment of language models paradigm to higher-level linguistic phenomena and introduce an English evaluation suite that targets the knowledge of the interactions between sentential operators and indefinite NPs. We use this evaluation suite for a fine-grained investigation of the entity tracking abilities of the Transformer-based models GPT-2 and GPT-3. We find that while the models are to a certain extent sensitive to the interactions we investigate, they are all challenged by the presence of multiple NPs and their behavior is not systematic, which suggests that even models at the scale of GPT-3 do not fully acquire basic entity tracking abilities.

Causal Analysis of Syntactic Agreement Neurons in Multilingual Language Models
Aaron Mueller | Yu Xia | Tal Linzen
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Structural probing work has found evidence for latent syntactic information in pre-trained language models. However, much of this analysis has focused on monolingual models, and analyses of multilingual models have employed correlational methods that are confounded by the choice of probing tasks. In this study, we causally probe multilingual language models (XGLM and multilingual BERT) as well as monolingual BERT-based models across various languages; we do this by performing counterfactual perturbations on neuron activations and observing the effect on models’ subject-verb agreement probabilities. We observe where in the model and to what extent syntactic agreement is encoded in each language. We find significant neuron overlap across languages in autoregressive multilingual language models, but not masked language models. We also find two distinct layer-wise effect patterns and two distinct sets of neurons used for syntactic agreement, depending on whether the subject and verb are separated by other tokens. Finally, we find that behavioral analyses of language models are likely underestimating how sensitive masked language models are to syntactic information.

Coloring the Blank Slate: Pre-training Imparts a Hierarchical Inductive Bias to Sequence-to-sequence Models
Aaron Mueller | Robert Frank | Tal Linzen | Luheng Wang | Sebastian Schuster
Findings of the Association for Computational Linguistics: ACL 2022

Relations between words are governed by hierarchical structure rather than linear ordering. Sequence-to-sequence (seq2seq) models, despite their success in downstream NLP applications, often fail to generalize in a hierarchy-sensitive manner when performing syntactic transformations—for example, transforming declarative sentences into questions. However, syntactic evaluations of seq2seq models have only observed models that were not pre-trained on natural language data before being trained to perform syntactic transformations, in spite of the fact that pre-training has been found to induce hierarchical linguistic generalizations in language models; in other words, the syntactic capabilities of seq2seq models may have been greatly understated. We address this gap using the pre-trained seq2seq models T5 and BART, as well as their multilingual variants mT5 and mBART. We evaluate whether they generalize hierarchically on two transformations in two languages: question formation and passivization in English and German. We find that pre-trained seq2seq models generalize hierarchically when performing syntactic transformations, whereas models trained from scratch on syntactic transformations do not. This result presents evidence for the learnability of hierarchical syntactic information from non-annotated natural language text while also demonstrating that seq2seq models are capable of syntactic generalization, though only after exposure to much more language data than human learners receive.

Syntactic Surprisal From Neural Models Predicts, But Underestimates, Human Processing Difficulty From Syntactic Ambiguities
Suhas Arehalli | Brian Dillon | Tal Linzen
Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL)

Humans exhibit garden path effects: When reading sentences that are temporarily structurally ambiguous, they slow down when the structure is disambiguated in favor of the less preferred alternative. Surprisal theory (Hale, 2001; Levy, 2008), a prominent explanation of this finding, proposes that these slowdowns are due to the unpredictability of each of the words that occur in these sentences. Challenging this hypothesis, van Schijndel and Linzen (2021) find that estimates of the cost of word predictability derived from language models severely underestimate the magnitude of human garden path effects. In this work, we consider whether this underestimation is due to the fact that humans weight syntactic factors in their predictions more highly than language models do. We propose a method for estimating syntactic predictability from a language model, allowing us to weigh the cost of lexical and syntactic predictability independently. We find that treating syntactic predictability independently from lexical predictability indeed results in larger estimates of garden path. At the same time, even when syntactic predictability is independently weighted, surprisal still greatly underestimate the magnitude of human garden path effects. Our results support the hypothesis that predictability is not the only factor responsible for the processing cost associated with garden path sentences.

2021

The Language Model Understood the Prompt was Ambiguous: Probing Syntactic Uncertainty Through Generation
Laura Aina | Tal Linzen
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Temporary syntactic ambiguities arise when the beginning of a sentence is compatible with multiple syntactic analyses. We inspect to which extent neural language models (LMs) exhibit uncertainty over such analyses when processing temporarily ambiguous inputs, and how that uncertainty is modulated by disambiguating cues. We probe the LM’s expectations by generating from it: we use stochastic decoding to derive a set of sentence completions, and estimate the probability that the LM assigns to each interpretation based on the distribution of parses across completions. Unlike scoring-based methods for targeted syntactic evaluation, this technique makes it possible to explore completions that are not hypothesized in advance by the researcher. We apply this method to study the behavior of two LMs (GPT2 and an LSTM) on three types of temporary ambiguity, using materials from human sentence processing experiments. We find that LMs can track multiple analyses simultaneously; the degree of uncertainty varies across constructions and contexts. As a response to disambiguating cues, the LMs often select the correct interpretation, but occasional errors point to potential areas of improvement

Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models
Matthew Finlayson | Aaron Mueller | Sebastian Gehrmann | Stuart Shieber | Tal Linzen | Yonatan Belinkov
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models’ preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes—notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.

NOPE: A Corpus of Naturally-Occurring Presuppositions in English
Alicia Parrish | Sebastian Schuster | Alex Warstadt | Omar Agha | Soo-Hwan Lee | Zhuoye Zhao | Samuel R. Bowman | Tal Linzen
Proceedings of the 25th Conference on Computational Natural Language Learning

Understanding language requires grasping not only the overtly stated content, but also making inferences about things that were left unsaid. These inferences include presuppositions, a phenomenon by which a listener learns about new information through reasoning about what a speaker takes as given. Presuppositions require complex understanding of the lexical and syntactic properties that trigger them as well as the broader conversational context. In this work, we introduce the Naturally-Occurring Presuppositions in English (NOPE) Corpus to investigate the context-sensitivity of 10 different types of presupposition triggers and to evaluate machine learning models’ ability to predict human inferences. We find that most of the triggers we investigate exhibit moderate variability. We further find that transformer-based models draw correct inferences in simple cases involving presuppositions, but they fail to capture the minority of exceptional cases in which human judgments reveal complex interactions between context and triggers.

Frequency Effects on Syntactic Rule Learning in Transformers
Jason Wei | Dan Garrette | Tal Linzen | Ellie Pavlick
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pre-trained language models perform well on a variety of linguistic tasks that require symbolic reasoning, raising the question of whether such models implicitly represent abstract symbols and rules. We investigate this question using the case study of BERT’s performance on English subject–verb agreement. Unlike prior work, we train multiple instances of BERT from scratch, allowing us to perform a series of controlled interventions at pre-training time. We show that BERT often generalizes well to subject–verb pairs that never occurred in training, suggesting a degree of rule-governed behavior. We also find, however, that performance is heavily influenced by word frequency, with experiments showing that both the absolute frequency of a verb form, as well as the frequency relative to the alternate inflection, are causally implicated in the predictions BERT makes at inference time. Closer analysis of these frequency effects reveals that BERT’s behavior is consistent with a system that correctly applies the SVA rule in general but struggles to overcome strong training priors and to estimate agreement features (singular vs. plural) on infrequent lexical items.

Does Putting a Linguist in the Loop Improve NLU Data Collection?
Alicia Parrish | William Huang | Omar Agha | Soo-Hwan Lee | Nikita Nangia | Alexia Warstadt | Karmanya Aggarwal | Emily Allaway | Tal Linzen | Samuel R. Bowman
Findings of the Association for Computational Linguistics: EMNLP 2021

Many crowdsourced NLP datasets contain systematic artifacts that are identified only after data collection is complete. Earlier identification of these issues should make it easier to create high-quality training and evaluation data. We attempt this by evaluating protocols in which expert linguists work ‘in the loop’ during data collection to identify and address these issues by adjusting task instructions and incentives. Using natural language inference as a test case, we compare three data collection protocols: (i) a baseline protocol with no linguist involvement, (ii) a linguist-in-the-loop intervention with iteratively-updated constraints on the writing task, and (iii) an extension that adds direct interaction between linguists and crowdworkers via a chatroom. We find that linguist involvement does not lead to increased accuracy on out-of-domain test sets compared to baseline, and adding a chatroom has no effect on the data. Linguist involvement does, however, lead to more challenging evaluation data and higher accuracy on some challenge sets, demonstrating the benefits of integrating expert analysis during data collection.

Structure Here, Bias There: Hierarchical Generalization by Jointly Learning Syntactic Transformations
Karl Mulligan | Robert Frank | Tal Linzen
Proceedings of the Society for Computation in Linguistics 2021

Counterfactual Interventions Reveal the Causal Effect of Relative Clause Representations on Agreement Prediction
Shauli Ravfogel | Grusha Prasad | Tal Linzen | Yoav Goldberg
Proceedings of the 25th Conference on Computational Natural Language Learning

When language models process syntactically complex sentences, do they use their representations of syntax in a manner that is consistent with the grammar of the language? We propose AlterRep, an intervention-based method to address this question. For any linguistic feature of a given sentence, AlterRep generates counterfactual representations by altering how the feature is encoded, while leaving in- tact all other aspects of the original representation. By measuring the change in a model’s word prediction behavior when these counterfactual representations are substituted for the original ones, we can draw conclusions about the causal effect of the linguistic feature in question on the model’s behavior. We apply this method to study how BERT models of different sizes process relative clauses (RCs). We find that BERT variants use RC boundary information during word prediction in a manner that is consistent with the rules of English grammar; this RC boundary information generalizes to a considerable extent across different RC types, suggesting that BERT represents RCs as an abstract linguistic category.

2020

COGS: A Compositional Generalization Challenge Based on Semantic Interpretation
Najoung Kim | Tal Linzen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Natural language is characterized by compositionality: the meaning of a complex expression is constructed from the meanings of its constituent parts. To facilitate the evaluation of the compositional abilities of language processing architectures, we introduce COGS, a semantic parsing dataset based on a fragment of English. The evaluation portion of COGS contains multiple systematic gaps that can only be addressed by compositional generalization; these include new combinations of familiar syntactic structures, or new combinations of familiar words and familiar structures. In experiments with Transformers and LSTMs, we found that in-distribution accuracy on the COGS test set was near-perfect (96–99%), but generalization accuracy was substantially lower (16–35%) and showed high sensitivity to random seed (+-6–8%). These findings indicate that contemporary standard NLP models are limited in their compositional generalization capacity, and position COGS as a good way to measure progress.

Proceedings of the 24th Conference on Computational Natural Language Learning
Raquel Fernández | Tal Linzen
Proceedings of the 24th Conference on Computational Natural Language Learning

Discovering the Compositional Structure of Vector Representations with Role Learning Networks
Paul Soulos | R. Thomas McCoy | Tal Linzen | Paul Smolensky
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

How can neural networks perform so well on compositional tasks even though they lack explicit compositional representations? We use a novel analysis technique called ROLE to show that recurrent neural networks perform well on such tasks by converging to solutions which implicitly represent symbolic structure. This method uncovers a symbolic structure which, when properly embedded in vector space, closely approximates the encodings of a standard seq2seq network trained to perform the compositional SCAN task. We verify the causal importance of the discovered symbolic structure by showing that, when we systematically manipulate hidden embeddings based on this symbolic structure, the model’s output is changed in the way predicted by our analysis.

BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance
R. Thomas McCoy | Junghyun Min | Tal Linzen
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

If the same neural network architecture is trained multiple times on the same dataset, will it make similar linguistic generalizations across runs? To study this question, we fine-tuned 100 instances of BERT on the Multi-genre Natural Language Inference (MNLI) dataset and evaluated them on the HANS dataset, which evaluates syntactic generalization in natural language inference. On the MNLI development set, the behavior of all instances was remarkably consistent, with accuracy ranging between 83.6% and 84.8%. In stark contrast, the same models varied widely in their generalization performance. For example, on the simple case of subject-object swap (e.g., determining that “the doctor visited the lawyer” does not entail “the lawyer visited the doctor”), accuracy ranged from 0.0% to 66.2%. Such variation is likely due to the presence of many local minima in the loss surface that are equally attractive to a low-bias learner such as a neural network; decreasing the variability may therefore require models with stronger inductive biases.

Representations of Syntax [MASK] Useful: Effects of Constituency and Dependency Structure in Recursive LSTMs
Michael Lepori | Tal Linzen | R. Thomas McCoy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Sequence-based neural networks show significant sensitivity to syntactic structure, but they still perform less well on syntactic tasks than tree-based networks. Such tree-based networks can be provided with a constituency parse, a dependency parse, or both. We evaluate which of these two representational schemes more effectively introduces biases for syntactic structure that increase performance on the subject-verb agreement prediction task. We find that a constituency-based network generalizes more robustly than a dependency-based one, and that combining the two types of structure does not yield further improvement. Finally, we show that the syntactic robustness of sequential models can be substantially improved by fine-tuning on a small amount of constructed data, suggesting that data augmentation is a viable alternative to explicit constituency structure for imparting the syntactic biases that sequential models are lacking.

Does Syntax Need to Grow on Trees? Sources of Hierarchical Inductive Bias in Sequence-to-Sequence Networks
R. Thomas McCoy | Robert Frank | Tal Linzen
Transactions of the Association for Computational Linguistics, Volume 8

Learners that are exposed to the same training data might generalize differently due to differing inductive biases. In neural network models, inductive biases could in theory arise from any aspect of the model architecture. We investigate which architectural factors affect the generalization behavior of neural sequence-to-sequence models trained on two syntactic tasks, English question formation and English tense reinflection. For both tasks, the training set is consistent with a generalization based on hierarchical structure and a generalization based on linear order. All architectural factors that we investigated qualitatively affected how models generalized, including factors with no clear connection to hierarchical structure. For example, LSTMs and GRUs displayed qualitatively different inductive biases. However, the only factor that consistently contributed a hierarchical bias across tasks was the use of a tree-structured model rather than a model with sequential recurrence, suggesting that human-like syntactic generalization requires architectural syntactic structure.

How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, which has become a central tool for measuring progress in natural language understanding. This paradigm consists of three stages: (1) pre-training of a word prediction model on a corpus of arbitrary size; (2) fine-tuning (transfer learning) on a training set representing a classification task; (3) evaluation on a test set drawn from the same distribution as that training set. This paradigm favors simple, low-bias architectures, which, first, can be scaled to process vast amounts of data, and second, can capture the fine-grained statistical properties of a particular data set, regardless of whether those properties are likely to generalize to examples of the task outside the data set. This contrasts with humans, who learn language from several orders of magnitude less data than the systems favored by this evaluation paradigm, and generalize to new tasks in a consistent way. We advocate for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans.

Tensor Product Decomposition Networks: Uncovering Representations of Structure Learned by Neural Networks
R. Thomas McCoy | Tal Linzen | Ewan Dunbar | Paul Smolensky
Proceedings of the Society for Computation in Linguistics 2020

Syntactic Data Augmentation Increases Robustness to Inference Heuristics
Junghyun Min | R. Thomas McCoy | Dipanjan Das | Emily Pitler | Tal Linzen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretrained neural models such as BERT, when fine-tuned to perform natural language inference (NLI), often show high accuracy on standard datasets, but display a surprising lack of sensitivity to word order on controlled challenge sets. We hypothesize that this issue is not primarily caused by the pretrained model’s limitations, but rather by the paucity of crowdsourced NLI examples that might convey the importance of syntactic structure at the fine-tuning stage. We explore several methods to augment standard training sets with syntactically informative examples, generated by applying syntactic transformations to sentences from the MNLI corpus. The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, without affecting performance on the MNLI test set. This improvement generalized beyond the particular construction used for data augmentation, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

Neural network learning of the Russian genitive of negation: optionality and structure sensitivity
Natalia Talmina | Tal Linzen
Proceedings of the Society for Computation in Linguistics 2020

Cross-Linguistic Syntactic Evaluation of Word Prediction Models
Aaron Mueller | Garrett Nicolai | Panayiota Petrou-Zeniou | Natalia Talmina | Tal Linzen
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

A range of studies have concluded that neural word prediction models can distinguish grammatical from ungrammatical sentences with high accuracy. However, these studies are based primarily on monolingual evidence from English. To investigate how these models’ ability to learn syntax varies by language, we introduce CLAMS (Cross-Linguistic Assessment of Models on Syntax), a syntactic evaluation suite for monolingual and multilingual models. CLAMS includes subject-verb agreement challenge sets for English, French, German, Hebrew and Russian, generated from grammars we develop. We use CLAMS to evaluate LSTM language models as well as monolingual and multilingual BERT. Across languages, monolingual LSTMs achieved high accuracy on dependencies without attractors, and generally poor accuracy on agreement across object relative clauses. On other constructions, agreement accuracy was generally higher in languages with richer morphology. Multilingual models generally underperformed monolingual models. Multilingual BERT showed high syntactic accuracy on English, but noticeable deficiencies in other languages.

2019

Can Entropy Explain Successor Surprisal Effects in Reading?
Marten van Schijndel | Tal Linzen
Proceedings of the Society for Computation in Linguistics (SCiL) 2019

Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages
Shauli Ravfogel | Yoav Goldberg | Tal Linzen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

How do typological properties such as word order and morphological case marking affect the ability of neural sequence models to acquire the syntax of a language? Cross-linguistic comparisons of RNNs’ syntactic performance (e.g., on subject-verb agreement prediction) are complicated by the fact that any two languages differ in multiple typological properties, as well as by differences in training corpus. We propose a paradigm that addresses these issues: we create synthetic versions of English, which differ from English in one or more typological parameters, and generate corpora for those languages based on a parsed English corpus. We report a series of experiments in which RNNs were trained to predict agreement features for verbs in each of those synthetic languages. Among other findings, (1) performance was higher in subject-verb-object order (as in English) than in subject-object-verb order (as in Japanese), suggesting that RNNs have a recency bias; (2) predicting agreement with both subject and object (polypersonal agreement) improves over predicting each separately, suggesting that underlying syntactic knowledge transfers across the two tasks; and (3) overt morphological case makes agreement prediction significantly easier, regardless of word order.

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
R. Thomas McCoy | Ellie Pavlick | Tal Linzen
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area.

Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Tal Linzen | Grzegorz Chrupała | Yonatan Belinkov | Dieuwke Hupkes
Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Probing What Different NLP Tasks Teach Machines about Function Word Comprehension
Najoung Kim | Roma Patel | Adam Poliak | Alex Wang | Patrick Xia | R. Thomas McCoy | Ian Tenney | Alexis Ross | Tal Linzen | Benjamin Van Durme | Samuel R. Bowman | Ellie Pavlick
Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019)

We introduce a set of nine challenge tasks that test for the understanding of function words. These tasks are created by structurally mutating sentences from existing datasets to target the comprehension of specific types of function words (e.g., prepositions, wh-words). Using these probing tasks, we explore the effects of various pretraining objectives for sentence encoders (e.g., language modeling, CCG supertagging and natural language inference (NLI)) on the learned representations. Our results show that pretraining on CCG—our most syntactic objective—performs the best on average across our probing tasks, suggesting that syntactic knowledge helps function word comprehension. Language modeling also shows strong performance, supporting its widespread use for pretraining state-of-the-art NLP models. Overall, no pretraining objective dominates across the board, and our function word probing tasks highlight several intuitive differences between pretraining objectives, e.g., that NLI helps the comprehension of negation.

Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Cassandra Jacobs | Alessandro Lenci | Tal Linzen | Laurent Prévot | Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models
Grusha Prasad | Marten van Schijndel | Tal Linzen
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Neural language models (LMs) perform well on tasks that require sensitivity to syntactic structure. Drawing on the syntactic priming paradigm from psycholinguistics, we propose a novel technique to analyze the representations that enable such success. By establishing a gradient similarity metric between structures, this technique allows us to reconstruct the organization of the LMs’ syntactic representational space. We use this technique to demonstrate that LSTM LMs’ representations of different types of sentences with relative clauses are organized hierarchically in a linguistically interpretable manner, suggesting that the LMs track abstract properties of the sentence.

Quantity doesn’t buy quality syntax with neural language models
Marten van Schijndel | Aaron Mueller | Tal Linzen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recurrent neural networks can learn to predict upcoming words remarkably well on average; in syntactically complex contexts, however, they often assign unexpectedly high probabilities to ungrammatical words. We investigate to what extent these shortcomings can be mitigated by increasing the size of the network and the corpus on which it is trained. We find that gains from increasing network size are minimal beyond a certain point. Likewise, expanding the training corpus yields diminishing returns; we estimate that the training corpus would need to be unrealistically large for the models to match human performance. A comparison to GPT and BERT, Transformer-based models trained on billions of words, reveals that these models perform even more poorly than our LSTMs in some constructions. Our results make the case for more data efficient architectures.

2018

Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)
Asad Sayeed | Cassandra Jacobs | Tal Linzen | Marten van Schijndel
Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)

Colorless Green Recurrent Networks Dream Hierarchically
Kristina Gulordava | Piotr Bojanowski | Edouard Grave | Tal Linzen | Marco Baroni
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Recurrent neural networks (RNNs) achieved impressive results in a variety of linguistic processing tasks, suggesting that they can induce non-trivial properties of language. We investigate to what extent RNNs learn to track abstract hierarchical syntactic structure. We test whether RNNs trained with a generic language modeling objective in four languages (Italian, English, Hebrew, Russian) can predict long-distance number agreement in various constructions. We include in our evaluation nonsensical sentences where RNNs cannot rely on semantic or lexical cues (“The colorless green ideas I ate with the chair sleep furiously”), and, for Italian, we compare model performance to human intuitions. Our language-model-trained RNNs make reliable predictions about long-distance agreement, and do not lag much behind human performance. We thus bring support to the hypothesis that RNNs are not just shallow-pattern extractors, but they also acquire deeper grammatical competence.

Targeted Syntactic Evaluation of Language Models
Rebecca Marvin | Tal Linzen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We present a data set for evaluating the grammaticality of the predictions of a language model. We automatically construct a large number of minimally different pairs of English sentences, each consisting of a grammatical and an ungrammatical sentence. The sentence pairs represent different variations of structure-sensitive phenomena: subject-verb agreement, reflexive anaphora and negative polarity items. We expect a language model to assign a higher probability to the grammatical sentence than the ungrammatical one. In an experiment using this data set, an LSTM language model performed poorly on many of the constructions. Multi-task training with a syntactic objective (CCG supertagging) improved the LSTM’s accuracy, but a large gap remained between its performance and the accuracy of human participants recruited online. This suggests that there is considerable room for improvement over LSTMs in capturing syntax in a language model.

Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
Tal Linzen | Grzegorz Chrupała | Afra Alishahi
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

A Neural Model of Adaptation in Reading
Marten van Schijndel | Tal Linzen
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

It has been argued that humans rapidly adapt their lexical and syntactic expectations to match the statistics of the current linguistic context. We provide further support to this claim by showing that the addition of a simple adaptation mechanism to a neural language model improves our predictions of human reading times compared to a non-adaptive model. We analyze the performance of the model on controlled materials from psycholinguistic experiments and show that it adapts not only to lexical items but also to abstract syntactic structures.

Phonological (un)certainty weights lexical activation
Laura Gwilliams | David Poeppel | Alec Marantz | Tal Linzen
Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2018)

2017

Comparing Character-level Neural Language Models Using a Lexical Decision Task
Gaël Le Godais | Tal Linzen | Emmanuel Dupoux
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

What is the information captured by neural network models of language? We address this question in the case of character-level recurrent neural language models. These models do not have explicit word representations; do they acquire implicit ones? We assess the lexical capacity of a network using the lexical decision task common in psycholinguistics: the system is required to decide whether or not a string of characters forms a word. We explore how accuracy on this task is affected by the architecture of the network, focusing on cell type (LSTM vs. SRN), depth and width. We also compare these architectural properties to a simple count of the parameters of the network. The overall number of parameters in the network turns out to be the most important predictor of accuracy; in particular, there is little evidence that deeper networks are beneficial for this task.

Proceedings of the 7th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2017)
Ted Gibson | Tal Linzen | Asad Sayeed | Marten van Schijndel | William Schuler
Proceedings of the 7th Workshop on Cognitive Modeling and Computational Linguistics (CMCL 2017)

Exploring the Syntactic Abilities of RNNs with Multi-task Learning
Émile Enguehard | Yoav Goldberg | Tal Linzen
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Recent work has explored the syntactic abilities of RNNs using the subject-verb agreement task, which diagnoses sensitivity to sentence structure. RNNs performed this task well in common cases, but faltered in complex sentences (Linzen et al., 2016). We test whether these errors are due to inherent limitations of the architecture or to the relatively indirect supervision provided by most agreement dependencies in a corpus. We trained a single RNN to perform both the agreement task and an additional task, either CCG supertagging or language modeling. Multi-task training led to significantly lower error rates, in particular on complex sentences, suggesting that RNNs have the ability to evolve more sophisticated syntactic representations than shown before. We also show that easily available agreement training data can improve performance on other syntactic tasks, in particular when only a limited amount of training data is available for those tasks. The multi-task paradigm can also be leveraged to inject grammatical knowledge into language models.

2016

Quantificational features in distributional word representations
Tal Linzen | Emmanuel Dupoux | Benjamin Spector
Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics

Evaluating vector space models using human semantic priming results
Allyson Ettinger | Tal Linzen
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

Issues in evaluating semantic spaces using word analogies
Tal Linzen
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies
Tal Linzen | Emmanuel Dupoux | Yoav Goldberg
Transactions of the Association for Computational Linguistics, Volume 4

The success of long short-term memory (LSTM) neural networks in language processing is typically attributed to their ability to capture long-distance statistical regularities. Linguistic regularities are often sensitive to syntactic structure; can such dependencies be captured by LSTMs, which do not have explicit structural representations? We begin addressing this question using number agreement in English subject-verb dependencies. We probe the architecture’s grammatical competence both using training objectives with an explicit grammatical target (number prediction, grammaticality judgments) and using language models. In the strongly supervised settings, the LSTM achieved very high overall accuracy (less than 1% errors), but errors increased when sequential and structural information conflicted. The frequency of such errors rose sharply in the language-modeling setting. We conclude that LSTMs can capture a non-trivial amount of grammatical structure given targeted supervision, but stronger architectures may be required to further reduce errors; furthermore, the language modeling signal is insufficient for capturing syntax-sensitive dependencies, and should be supplemented with more direct supervision if such dependencies need to be captured.

2015

A model of rapid phonotactic generalization
Tal Linzen | Timothy O’Donnell
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Investigating the role of entropy in sentence processing
Tal Linzen | Florian Jaeger
Proceedings of the Fifth Workshop on Cognitive Modeling and Computational Linguistics

Co-authors

Adina Williams 6

Marten van Schijndel 6

Michael Y. Hu 5

Yoav Goldberg 4

Ethan Gotlieb Wilcox 4

Chengxu Zhuang 4

Samuel Bowman 3

Emmanuel Dupoux 3

William Merrill 3

Ellie Pavlick 3

Jackson Petty 3

Grusha Prasad 3

Sebastian Schuster 3

Paul Smolensky 3

Suhas Arehalli 2

Yonatan Belinkov 2

Grzegorz Chrupała 2

Ishita Dasgupta 2

Lucas Georges Gabriel Charpentier 2

Mustafa Omer Gul 2

Cassandra L. Jacobs 2

Rafael Mosquera 2

Bhargavi Paranjabe 2

Alicia Parrish 2

Shauli Ravfogel 2

Raj Sanjay Shah 2

Sjoerd Steenkiste 2

Natalia Talmina 2

Karmanya Aggarwal 1

Afra Alishahi 1

Emily Allaway 1

Kristijan Armeni 1

Piotr Bojanowski 1

Asli Celikyilmaz 1

Emmanuele Chersoni 1

Brian W. Dillon 1

Lucia Donatelli 1

Benjamin Van Durme 1

Tiwalayo Eisape 1

Émile Enguehard 1

Allyson Ettinger 1

Raquel Fernández 1

Matthew Finlayson 1

Sebastian Gehrmann 1

Édouard Grave 1

Kristina Gulordava 1

Laura Gwilliams 1

Christopher Honey 1

William Huang 1

Dieuwke Hupkes 1

T. Florian Jaeger 1

Guangyuan Jiang 1

Anastasia Kobzeva 1

Alexander Koller 1

Gaël Le Godais 1

Alessandro Lenci 1

Cara Su-Yi Leong 1

Michael Lepori 1

Matthew Mandelkern 1

Rebecca Marvin 1

Karl Mulligan 1

Norihito Naka 1

Nikita Nangia 1

Graham Neubig 1

Garrett Nicolai 1

Timothy O’Donnell 1

Panupong Pasupat 1

Panayiota Petrou-Zeniou 1

David Poeppel 1

Laurent Prévot 1

Hannah Rashkin 1

David Reitter 1

Enrico Santus 1

William Schuler 1

Stuart M. Shieber 1

Benjamin Spector 1

Michael Tessler 1

William Timkey 1

Lindia Tjuatja 1

Kristina Toutanova 1

Alexia Warstadt 1

Albert Webson 1

Aditya Yedetore 1

Venues