Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe

Zhenxuan Yu, Takeshi Kojima, Yutaka Matsuo, Yusuke Iwasawa


Abstract
Large language models (LLMs) have achieved significant performance improvements in natural language processing (NLP) domain. However, these models often require large computational resources for training and inference. Recently, Mamba, a language model architecture based on State-Space Models (SSMs), has achieved comparable performance to Transformer models while significantly reducing costs by compressing context windows during inference. We focused on the potential of the lightweight Mamba architecture by applying BitNet quantization method to the model architecture. In addition, while prior BitNet methods generally quantized only linear layers in the main body, we extensively quantized the embedding and projection layers considering their significant proportion of model parameters. In our experiments, we applied ternary quantization to the Mamba-2 (170M) architecture and pre-trained the model with 150 B tokens from scratch. Our method achieves approximately 90.0% reduction in the bits used by all parameters, achieving a significant improvement compared with a 48.4% reduction by the conventional BitNet quantization method. In addition, our method experienced minimal performance degradation in both the pre-training perplexity and downstream tasks. These findings demonstrate the potential of incorporating lightweight language models into edge devices, which will become more demanding in the future.
Anthology ID:
2025.coling-main.316
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
4715–4724
Language:
URL:
https://aclanthology.org/2025.coling-main.316/
DOI:
Bibkey:
Cite (ACL):
Zhenxuan Yu, Takeshi Kojima, Yutaka Matsuo, and Yusuke Iwasawa. 2025. Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe. In Proceedings of the 31st International Conference on Computational Linguistics, pages 4715–4724, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Slender-Mamba: Fully Quantized Mamba in 1.58 Bits From Head to Toe (Yu et al., COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.316.pdf