Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs

Runyu Peng; Yunhua Zhou; Kai Lv; Yang Gao (扬 高); Qipeng Guo; Xipeng Qiu (邱锡鹏)

Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs

Runyu Peng, Yunhua Zhou, Kai Lv, Yang Gao, Qipeng Guo, Xipeng Qiu

Abstract

The rapid advancement of Large Language Models (LLMs) has significantly enhanced performance across various natural language processing (NLP) tasks, yet the high computational costs and latency associated with deploying such models continue to pose critical bottlenecks, limiting their broader applicability. To mitigate these challenges, we propose a dynamic hybrid inference framework, Firewall Routing, which efficiently selects between a strong and a weak LLMs based on the complexity of the query. A lightweight routing model is trained to optimize resource allocation by learning from response quality and preventing long-tail queries, which are often too hard to solve by LLMs, from being routed to the stronger model. Moreover, our method incorporates multiple sampling to enhance query evaluation reliability while leveraging Hard Blocking and Soft Blocking to handle long-tail queries along with refining labels for model selection. Extensive experiments show our method outperforms existing routing strategies by up to 5.29% in APGR, demonstrating state-of-the-art performance across multiple benchmarks.

Anthology ID:: 2025.emnlp-main.331
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 6540–6565
Language:
URL:: https://aclanthology.org/2025.emnlp-main.331/
DOI:
Bibkey:
Cite (ACL):: Runyu Peng, Yunhua Zhou, Kai Lv, Yang Gao, Qipeng Guo, and Xipeng Qiu. 2025. Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6540–6565, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Firewall Routing: Blocking Leads to Better Hybrid Inference for LLMs (Peng et al., EMNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.emnlp-main.331.pdf
Checklist:: 2025.emnlp-main.331.checklist.pdf

PDF Cite Search Checklist Fix data