Sitong Wu
2025
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
Jingyao Li
|
Han Shi
|
Sitong Wu
|
Chuanyang Zheng
|
Zhenguo Li
|
Xin Jiang
|
Hong Xu
|
Jiaya Jia
Proceedings of the 31st International Conference on Computational Linguistics
The capacity of Large Language Models (LLMs) to comprehend and reason over long contexts is pivotal for advancements in diverse fields. Yet, they still stuggle with capturing long-distance dependencies within sequences to deeply understand semantics. To address this issue, we introduce Query-aware Inference for LLMs (Q-LLM), a system designed to process extensive sequences akin to human cognition. By focusing on memory data relevant to a given query, Q-LLM can accurately capture pertinent information within a fixed window size and provide precise answers to queries. It doesn’t require extra training and can be seamlessly integrated with any LLMs. Q-LLM using LLaMA3 (QuickLLaMA) can read Harry Potter within 30s and accurately answer the questions. On widely recognized benchmarks, Q-LLM improved by 7.17% compared to the current state-of-the-art on LLaMA3, and by 3.26% on Mistral on the ∞-bench. In the Needle-in-a-Haystack and BABILong task, Q-LLM improved upon the current SOTA by 7.0% and 6.1%. Our code is in https://github.com/dvlab-research/Q-LLM.