Qian Tao
2025
AsymKV: Enabling 1-Bit Quantization of KV Cache with Layer-Wise Asymmetric Quantization Configurations
Qian Tao
|
Wenyuan Yu
|
Jingren Zhou
Proceedings of the 31st International Conference on Computational Linguistics
Large language models have shown exceptional capabilities in a wide range of tasks, such as text generation and video generation, among others. However, due to their massive parameter count, these models often require substantial storage space, imposing significant constraints on the machines deploying LLMs. To overcome this limitation, one research direction proposes to compress the models using integer replacements for floating-point numbers, in a process known as Quantization. Some recent studies suggest quantizing the key and value cache (KV Cache) of LLMs, and designing quantization techniques that treat the key and value matrices equivalently. This work delves deeper into the asymmetric structural roles of KV Cache, a phenomenon where the transformer’s output loss is more sensitive to the quantization of key matrices. We conduct a systematic examination of the attention output error resulting from key and value quantization. The phenomenon inspires us to propose an asymmetric quantization strategy. Our approach allows for 1-bit quantization of the KV cache by implementing distinct configurations for key and value matrices. We carry out experiments across a variety of datasets, demonstrating that our proposed model allows for the quantization of up to 75% decoder layers with 1 bit, while simultaneously maintaining performance levels comparable to those of the models with floating parameters.
2023
A Structure-Aware Generative Adversarial Network for Bilingual Lexicon Induction
Bocheng Han
|
Qian Tao
|
Lusi Li
|
Zhihao Xiong
Findings of the Association for Computational Linguistics: EMNLP 2023
Bilingual lexicon induction (BLI) is the task of inducing word translations with a learned mapping function that aligns monolingual word embedding spaces in two different languages. However, most previous methods treat word embeddings as isolated entities and fail to jointly consider both the intra-space and inter-space topological relations between words. This limitation makes it challenging to align words from embedding spaces with distinct topological structures, especially when the assumption of isomorphism may not hold. To this end, we propose a novel approach called the Structure-Aware Generative Adversarial Network (SA-GAN) model to explicitly capture multiple topological structure information to achieve accurate BLI. Our model first incorporates two lightweight graph convolutional networks (GCNs) to leverage intra-space topological correlations between words for generating source and target embeddings. We then employ a GAN model to explore inter-space topological structures by learning a global mapping function that initially maps the source embeddings to the target embedding space. To further align the coarse-grained structures, we develop a pair-wised local mapping (PLM) strategy that enables word-specific transformations in an unsupervised manner. Extensive experiments conducted on public datasets, including languages with both distant and close etymological relationships, demonstrate the effectiveness of our proposed SA-GAN model.