Takateru Yamakoshi

2025

Evaluating distillation methods for data-efficient syntax learning
Takateru Yamakoshi | Thomas L. Griffiths | R. Thomas McCoy | Robert D. Hawkins
Findings of the Association for Computational Linguistics: EMNLP 2025

Data-efficient training requires strong inductive biases. To the extent that transformer attention matrices encode syntactic relationships, we would predict that knowledge distillation (KD) targeting attention should selectively accelerate syntax acquisition relative to conventional logit-based KD. To test this hypothesis, we train GPT-2 student models on datasets ranging from 10K to 5M sentences using both distillation methods, evaluating them on both syntactic benchmarks and perplexity. Surprisingly, while logit-based KD dramatically improves data-efficiency, attention-based KD provides minimal benefit even for syntactic tasks. This suggests that output distributions provide sufficient supervisory signal for syntax acquisition, indicating that syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.

2023

pdf bib abs

Causal interventions expose implicit situation models for commonsense language understanding
Takateru Yamakoshi | James McClelland | Adele Goldberg | Robert Hawkins
Findings of the Association for Computational Linguistics: ACL 2023

Accounts of human language processing have long appealed to implicit “situation models” that enrich comprehension with relevant but unstated world knowledge. Here, we apply causal intervention techniques to recent transformer models to analyze performance on the Winograd Schema Challenge (WSC), where a single context cue shifts interpretation of an ambiguous pronoun. We identify a relatively small circuit of attention heads that are responsible for propagating information from the context word that guides which of the candidate noun phrases the pronoun ultimately attends to. We then compare how this circuit behaves in a closely matched “syntactic” control where the situation model is not strictly necessary. These analyses suggest a distinct pathway through which implicit situation models may be constructed to guide pronoun resolution

2022

pdf bib abs

Probing BERT’s priors with serial reproduction chains
Takateru Yamakoshi | Thomas Griffiths | Robert Hawkins
Findings of the Association for Computational Linguistics: ACL 2022

Sampling is a promising bottom-up method for exposing what generative models have learned about language, but it remains unclear how to generate representative samples from popular masked language models (MLMs) like BERT. The MLM objective yields a dependency network with no guarantee of consistent conditional distributions, posing a problem for naive approaches. Drawing from theories of iterated learning in cognitive science, we explore the use of serial reproduction chains to sample from BERT’s priors. In particular, we observe that a unique and consistent estimator of the ground-truth joint distribution is given by a Generative Stochastic Network (GSN) sampler, which randomly selects which token to mask and reconstruct on each step. We show that the lexical and syntactic statistics of sentences from GSN chains closely match the ground-truth corpus distribution and perform better than other methods in a large corpus of naturalness judgments. Our findings establish a firmer theoretical foundation for bottom-up probing and highlight richer deviations from human priors.

2020

pdf bib abs

Investigating representations of verb bias in neural language models
Robert Hawkins | Takateru Yamakoshi | Thomas Griffiths | Adele Goldberg
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Languages typically provide more than one grammatical construction to express certain types of messages. A speaker’s choice of construction is known to depend on multiple factors, including the choice of main verb – a phenomenon known as verb bias. Here we introduce DAIS, a large benchmark dataset containing 50K human judgments for 5K distinct sentence pairs in the English dative alternation. This dataset includes 200 unique verbs and systematically varies the definiteness and length of arguments. We use this dataset, as well as an existing corpus of naturally occurring data, to evaluate how well recent neural language models capture human preferences. Results show that larger models perform better than smaller models, and transformer architectures (e.g. GPT-2) tend to out-perform recurrent architectures (e.g. LSTMs) even under comparable parameter and training settings. Additional analyses of internal feature representations suggest that transformers may better integrate specific lexical information with grammatical constructions.

Co-authors

Venues

Findings3
EMNLP1

Fix author