Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERT

Ashutosh Adhikari, Achyudh Ram, Raphael Tang, William L. Hamilton, Jimmy Lin


Abstract
Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT’s effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillation—a popular model compression method. The results show that BERT-level effectiveness can be achieved by a single-layer LSTM with at least 40× fewer FLOPS and only ∼3% parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT’s knowledge all the way down to linear models—a relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.
Anthology ID:
2020.repl4nlp-1.10
Volume:
Proceedings of the 5th Workshop on Representation Learning for NLP
Month:
July
Year:
2020
Address:
Online
Venues:
ACL | RepL4NLP | WS
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
72–77
Language:
URL:
https://aclanthology.org/2020.repl4nlp-1.10
DOI:
10.18653/v1/2020.repl4nlp-1.10
Bibkey:
Copy Citation:
PDF:
https://aclanthology.org/2020.repl4nlp-1.10.pdf
Video:
 http://slideslive.com/38929776