Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Zhuocheng Gong, Ang Lv, Jian Guan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, Rui Yan


Abstract
Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted “yes”. In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different sizes, consistently outperform vanilla transformers. More interestingly, after removing 50% of the multi-head attention modules and 25% of the feed-forward modules, an MoM model still holds comparable performance. Additionally, by properly adjusting the number of modules and compressing the model depth, one can have an MoM that achieves comparable performance to GPT-2 (774M) while saving 16% TFLOPs and 42% memory usage during forward computation.
Anthology ID:
2024.emnlp-main.1164
Volume:
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
20924–20938
Language:
URL:
https://aclanthology.org/2024.emnlp-main.1164
DOI:
Bibkey:
Cite (ACL):
Zhuocheng Gong, Ang Lv, Jian Guan, Wei Wu, Huishuai Zhang, Minlie Huang, Dongyan Zhao, and Rui Yan. 2024. Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 20924–20938, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules (Gong et al., EMNLP 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.emnlp-main.1164.pdf