LatentGate: Low-Latency Semantic Routing via Frozen-Backbone Probing of Small Language Models

Shivam Ratnakar; Abhiroop Talasila; Vinayak K Doifode

LatentGate: Low-Latency Semantic Routing via Frozen-Backbone Probing of Small Language Models

Shivam Ratnakar, Abhiroop Talasila, Vinayak K Doifode

Abstract

As Multi-Agent Systems scale to hundreds of specialized agents, routing becomes a critical bottleneck. Prompt-based LLM routers deliver strong semantic reasoning but incur prohibitive latency (~1500–2000ms) and cost that scales with agent count, while embedding-based routers are fast (25–50ms) but collapse semantically similar yet functionally distinct agents. We identify *representation anisotropy*, the geometric collapse of hidden-state vectors into a narrow cone, as a key mechanism underlying embedding-based routing failure. We propose **LatentGate**, a non-generative router that extracts mean-pooled hidden states from a frozen small language model (SLM), applies PCA-whitening to resolve the anisotropy, and trains a lightweight linear probe for agent classification. Across 5 SLM backbones and 100 enterprise agents, LatentGate achieves 98.8% in-domain and 80.0% OOD accuracy on natural queries, 13–22 absolute points above embedding baselines, and 92.9% on CLINC150. It takes ~28ms to run on a T4 GPU, with the SLM forward pass independent of agent count and classification adding a negligible O(Ck) term. We demonstrate the potential of using a lightweight linear probe to enable sub-10ms warm-start retraining from user feedback, providing a foundation for continual learning in production environments. Benchmarking prompt-based routing with GPT-4.1, GPT-4.1-nano, and Gemini 2.5 Flash confirms degradation to 70–77% accuracy at 100 agents with 1500–2000ms latency, motivating non-generative alternatives.

Anthology ID:: 2026.acl-industry.153
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:: July
Year:: 2026
Address:: San Diego, California, USA
Editors:: Yunyao Li, Georg Rehm, Mei Tu
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2284–2294
Language:
URL:: https://aclanthology.org/2026.acl-industry.153/
DOI:
Bibkey:
Cite (ACL):: Shivam Ratnakar, Abhiroop Talasila, and Vinayak K Doifode. 2026. LatentGate: Low-Latency Semantic Routing via Frozen-Backbone Probing of Small Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 2284–2294, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):: LatentGate: Low-Latency Semantic Routing via Frozen-Backbone Probing of Small Language Models (Ratnakar et al., ACL 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.acl-industry.153.pdf

PDF Cite Search Fix data