Evaluating distillation methods for data-efficient syntax learning

Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, Robert D. Hawkins


Abstract
Data-efficient training requires strong inductive biases. To the extent that transformer attention matrices encode syntactic relationships, we would predict that knowledge distillation (KD) targeting attention should selectively accelerate syntax acquisition relative to conventional logit-based KD. To test this hypothesis, we train GPT-2 student models on datasets ranging from 10K to 5M sentences using both distillation methods, evaluating them on both syntactic benchmarks and perplexity. Surprisingly, while logit-based KD dramatically improves data-efficiency, attention-based KD provides minimal benefit even for syntactic tasks. This suggests that output distributions provide sufficient supervisory signal for syntax acquisition, indicating that syntactic knowledge may be distributed throughout the network rather than localized in attention patterns.
Anthology ID:
2025.findings-emnlp.801
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
14834–14847
Language:
URL:
https://aclanthology.org/2025.findings-emnlp.801/
DOI:
Bibkey:
Cite (ACL):
Takateru Yamakoshi, Thomas L. Griffiths, R. Thomas McCoy, and Robert D. Hawkins. 2025. Evaluating distillation methods for data-efficient syntax learning. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 14834–14847, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating distillation methods for data-efficient syntax learning (Yamakoshi et al., Findings 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.findings-emnlp.801.pdf
Checklist:
 2025.findings-emnlp.801.checklist.pdf