Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya; Ayush Kaushal; Vineet Jain; Francis Couture-Harpin; Prashant Shishodia; Majid Behbahani; Yuriy Nevmyvaka; Irina Rish

doi:10.18653/v1/2025.acl-long.1294

Scaling Laws and Efficient Inference for Ternary Language Models

Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture-Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, Irina Rish

Abstract

Large language models (LLMs) are increasingly used across research and industry applications, yet their inference efficiency remains a significant challenge. As the computational power of modern GPU architectures continuously improves, their memory bandwidth and capacity have not scaled proportionally, creating a critical bottleneck during inference. To address this, we investigate ternary language models (TriLMs) that employ quantization-aware training to significantly reduce memory requirements. We first analyze the scalability of TriLMs by conducting a scaling law analysis, revealing that TriLMs benefit more from increasing training data than from scaling model parameters. Based on this observation, we introduce TriTera, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights, which demonstrate accelerated inference across various CPU architectures. Building on the 2-bit packing, we develop a GPU kernel called TriRun that accelerates end-to-end model inference by up to 5 × compared to floating-point baselines. To encourage further exploration and development of TriLMs, we will release the TriTera suite and TriRun inference kernels. Overall, our work lays the foundation for building and deploying efficient LLMs, providing a valuable resource for the research community.

Anthology ID:: 2025.acl-long.1294
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 26679–26710
Language:
URL:: https://aclanthology.org/2025.acl-long.1294/
DOI:: 10.18653/v1/2025.acl-long.1294
Bibkey:
Cite (ACL):: Tejas Vaidhya, Ayush Kaushal, Vineet Jain, Francis Couture-Harpin, Prashant Shishodia, Majid Behbahani, Yuriy Nevmyvaka, and Irina Rish. 2025. Scaling Laws and Efficient Inference for Ternary Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26679–26710, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: Scaling Laws and Efficient Inference for Ternary Language Models (Vaidhya et al., ACL 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.acl-long.1294.pdf

PDF Cite Search Fix data