QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li; Han Shi; Sitong Wu; Chuanyang Zheng; Zhenguo Li; Xin Jiang; Hong Xu; Jiaya Jia

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li, Han Shi, Sitong Wu, Chuanyang Zheng, Zhenguo Li, Xin Jiang, Hong Xu, Jiaya Jia

Abstract

The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn’t require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the ∞-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code is in https://github.com/dvlab-research/Q-LLM.

Anthology ID:: 2025.coling-main.34
Volume:: Proceedings of the 31st International Conference on Computational Linguistics
Month:: January
Year:: 2025
Address:: Abu Dhabi, UAE
Editors:: Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:: COLING
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 508–528
Language:
URL:: https://aclanthology.org/2025.coling-main.34/
DOI:
Bibkey:
Cite (ACL):: Jingyao Li, Han Shi, Sitong Wu, Chuanyang Zheng, Zhenguo Li, Xin Jiang, Hong Xu, and Jiaya Jia. 2025. QuickLLaMA: Query-aware Inference Acceleration for Large Language Models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 508–528, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):: QuickLLaMA: Query-aware Inference Acceleration for Large Language Models (Li et al., COLING 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.coling-main.34.pdf

PDF Cite Search Fix data