Rank and run-time aware compression of NLP Applications

Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, Matthew Mattina


Abstract
Sequence model based NLP applications canbe large. Yet, many applications that benefit from them run on small devices with very limited compute and storage capabilities, while still having run-time constraints. As a result, there is a need for a compression technique that can achieve significant compression without negatively impacting inference run-time and task accuracy. This paper proposes a new compression technique called Hybrid Matrix Factorization (HMF) that achieves this dual objective. HMF improves low-rank matrix factorization (LMF) techniques by doubling the rank of the matrix using an intelligent hybrid-structure leading to better accuracy than LMF. Further, by preserving dense matrices, it leads to faster inference run-timethan pruning or structure matrix based compression technique. We evaluate the impact of this technique on 5 NLP benchmarks across multiple tasks (Translation, Intent Detection,Language Modeling) and show that for similar accuracy values and compression factors, HMF can achieve more than 2.32x faster inference run-time than pruning and 16.77% better accuracy than LMF.
Anthology ID:
2020.sustainlp-1.2
Volume:
Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing
Month:
November
Year:
2020
Address:
Online
Editors:
Nafise Sadat Moosavi, Angela Fan, Vered Shwartz, Goran Glavaš, Shafiq Joty, Alex Wang, Thomas Wolf
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
8–18
Language:
URL:
https://aclanthology.org/2020.sustainlp-1.2
DOI:
10.18653/v1/2020.sustainlp-1.2
Bibkey:
Cite (ACL):
Urmish Thakker, Jesse Beu, Dibakar Gope, Ganesh Dasika, and Matthew Mattina. 2020. Rank and run-time aware compression of NLP Applications. In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 8–18, Online. Association for Computational Linguistics.
Cite (Informal):
Rank and run-time aware compression of NLP Applications (Thakker et al., sustainlp 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.sustainlp-1.2.pdf
Video:
 https://slideslive.com/38939420