Multiple Teacher Distillation for Robust and Greener Models

Artur Ilichev; Nikita Sorokin; Irina Piontkovskaya; Valentin Malykh

Multiple Teacher Distillation for Robust and Greener Models

Artur Ilichev, Nikita Sorokin, Irina Piontkovskaya, Valentin Malykh

Abstract

The language models nowadays are in the center of natural language processing progress. These models are mostly of significant size. There are successful attempts to reduce them, but at least some of these attempts rely on randomness. We propose a novel distillation procedure leveraging on multiple teachers usage which alleviates random seed dependency and makes the models more robust. We show that this procedure applied to TinyBERT and DistilBERT models improves their worst case results up to 2% while keeping almost the same best-case ones. The latter fact keeps true with a constraint on computational time, which is important to lessen the carbon footprint. In addition, we present the results of an application of the proposed procedure to a computer vision model ResNet, which shows that the statement keeps true in this totally different domain.

Anthology ID:: 2021.ranlp-1.68
Volume:: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Month:: September
Year:: 2021
Address:: Held Online
Editors:: Ruslan Mitkov, Galia Angelova
Venue:: RANLP
SIG:
Publisher:: INCOMA Ltd.
Note:
Pages:: 601–610
Language:
URL:: https://aclanthology.org/2021.ranlp-1.68
DOI:
Bibkey:
Cite (ACL):: Artur Ilichev, Nikita Sorokin, Irina Piontkovskaya, and Valentin Malykh. 2021. Multiple Teacher Distillation for Robust and Greener Models. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 601–610, Held Online. INCOMA Ltd..
Cite (Informal):: Multiple Teacher Distillation for Robust and Greener Models (Ilichev et al., RANLP 2021)
Copy Citation:
PDF:: https://aclanthology.org/2021.ranlp-1.68.pdf
Data: CIFAR-10, CoLA, GLUE, MRPC, QNLI, SST, SST-2

PDF Cite Search