RWKV: Reinventing RNNs for the Transformer Era

Bo Peng; Eric Alcaide; Quentin Anthony; Alon Albalak; Samuel Arcadinho; Stella Biderman; Huanqi Cao; Xin Cheng; Michael Chung; Leon Derczynski; Xingjian Du; Matteo Grella; Kranthi Gv; Xuzheng He; Haowen Hou; Przemyslaw Kazienko; Jan Kocoń; Jiaming Kong; Bartłomiej Koptyra; Hayden Lau; Jiaju Lin; Krishna Sri Ipsit Mantri; Ferdinand Mom; Atsushi Saito; Guangyu Song; Xiangru Tang; Johan Wind; Stanisław Woźniak; Zhenyuan Zhang; Qinghua Zhou; Jian Zhu; Rui-Jie Zhu

doi:10.18653/v1/2023.findings-emnlp.936

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, Hayden Lau, Jiaju Lin, Krishna Sri Ipsit Mantri, Ferdinand Mom, Atsushi Saito, Guangyu Song, Xiangru Tang, Johan Wind, Stanisław Woźniak, Zhenyuan Zhang, Qinghua Zhou, Jian Zhu, Rui-Jie Zhu

Abstract

Transformers have revolutionized almost all natural language processing (NLP) tasks but suffer from memory and computational complexity that scales quadratically with sequence length. In contrast, recurrent neural networks (RNNs) exhibit linear scaling in memory and computational requirements but struggle to match the same performance as Transformers due to limitations in parallelization and scalability. We propose a novel model architecture, Receptance Weighted Key Value (RWKV), that combines the efficient parallelizable training of transformers with the efficient inference of RNNs. Our approach leverages a linear attention mechanism and allows us to formulate the model as either a Transformer or an RNN, thus parallelizing computations during training and maintains constant computational and memory complexity during inference. We scale our models as large as 14 billion parameters, by far the largest dense RNN ever trained, and find RWKV performs on par with similarly sized Transformers, suggesting future work can leverage this architecture to create more efficient models. This work presents a significant step towards reconciling trade-offs between computational efficiency and model performance in sequence processing tasks.

Anthology ID:: 2023.findings-emnlp.936
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2023
Month:: December
Year:: 2023
Address:: Singapore
Editors:: Houda Bouamor, Juan Pino, Kalika Bali
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 14048–14077
Language:
URL:: https://aclanthology.org/2023.findings-emnlp.936
DOI:: 10.18653/v1/2023.findings-emnlp.936
Bibkey:
Cite (ACL):: Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, Xingjian Du, Matteo Grella, Kranthi Gv, Xuzheng He, Haowen Hou, Przemyslaw Kazienko, Jan Kocon, Jiaming Kong, Bartłomiej Koptyra, et al.. 2023. RWKV: Reinventing RNNs for the Transformer Era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, Singapore. Association for Computational Linguistics.
Cite (Informal):: RWKV: Reinventing RNNs for the Transformer Era (Peng et al., Findings 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.findings-emnlp.936.pdf

PDF Cite Search