Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production

Young Jin Kim, Rawn Henry, Raffy Fahim, Hany Hassan


Abstract
Mixture of Experts (MoE) models with conditional execution of sparsely activated layers has enabled training models with a much larger number of parameters. As a result, these models have achieved significantly better quality on various natural language processing tasks including machine translation. However, it remains challenging to deploy such models in real-life scenarios due to the large memory requirements and inefficient inference. In this work, we introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models and cut down the memory consumption significantly. While we achieve up to 26x speed-up in terms of throughput, we also reduce the model size almost to one eighth of the original 32-bit float model by quantizing expert weights into 4-bit integers. As a result, we are able to deploy 136x larger models with 27% less cost and significantly better quality with large scale MoE model deployment compared to the existing solutions. This enables a paradigm shift in deploying large scale multilingual MoE transformers models instead of distilling into dozens of smaller models per language or task.
Anthology ID:
2022.sustainlp-1.6
Volume:
Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP)
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates (Hybrid)
Editors:
Angela Fan, Iryna Gurevych, Yufang Hou, Zornitsa Kozareva, Sasha Luccioni, Nafise Sadat Moosavi, Sujith Ravi, Gyuwan Kim, Roy Schwartz, Andreas Rücklé
Venue:
sustainlp
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
36–43
Language:
URL:
https://aclanthology.org/2022.sustainlp-1.6
DOI:
10.18653/v1/2022.sustainlp-1.6
Bibkey:
Cite (ACL):
Young Jin Kim, Rawn Henry, Raffy Fahim, and Hany Hassan. 2022. Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production. In Proceedings of The Third Workshop on Simple and Efficient Natural Language Processing (SustaiNLP), pages 36–43, Abu Dhabi, United Arab Emirates (Hybrid). Association for Computational Linguistics.
Cite (Informal):
Who Says Elephants Can’t Run: Bringing Large Scale MoE Models into Cloud Scale Production (Kim et al., sustainlp 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.sustainlp-1.6.pdf
Video:
 https://aclanthology.org/2022.sustainlp-1.6.mp4