Causal Distillation for Language Models

Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah Goodman


Abstract
Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal dynamics of the teacher through a distillation interchange intervention training objective (DIITO). DIITO pushes the student model to become a causal abstraction of the teacher model – a faithful model with simpler causal structure. DIITO is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared against standard distillation with the same setting, DIITO results in lower perplexity on the WikiText-103M corpus (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).
Anthology ID:
2022.naacl-main.318
Volume:
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:
July
Year:
2022
Address:
Seattle, United States
Editors:
Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:
NAACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4288–4295
Language:
URL:
https://aclanthology.org/2022.naacl-main.318
DOI:
10.18653/v1/2022.naacl-main.318
Bibkey:
Cite (ACL):
Zhengxuan Wu, Atticus Geiger, Joshua Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, and Noah Goodman. 2022. Causal Distillation for Language Models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4288–4295, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):
Causal Distillation for Language Models (Wu et al., NAACL 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.naacl-main.318.pdf
Video:
 https://aclanthology.org/2022.naacl-main.318.mp4
Code
 frankaging/Causal-Distill
Data
CoNLL 2003GLUESQuADWikiText-103WikiText-2