Quick Dense Retrievers Consume KALE: Post Training KullbackLeibler Alignment of Embeddings for Asymmetrical dual encoders

In this paper, we consider the problem of improving the inference latency of language model-based dense retrieval systems by introducing structural compression and model size asymmetry between the context and query encoders. First, we investigate the impact of pre and post-training compression on the MSMARCO, Natural Questions, TriviaQA, SQUAD, and SCIFACT, finding that asymmetry in the dual-encoders in dense retrieval can lead to improved inference efficiency. Knowing this, we introduce Kullback–Leibler Alignment of Embeddings (KALE), an efficient and accurate method for increasing the inference efficiency of dense retrieval methods by pruning and aligning the query encoder after training. Specifically, KALE extends traditional Knowledge Distillation after bi-encoder training, allowing for effective query encoder compression without full retraining or index generation. Using KALE and asymmetric training, we can generate models which exceed the performance of DistilBERT despite having 3x faster inference.


Introduction
A bi-encoder-based retrieval, often called dense retrieval, is a retrieval function that leverages the vector representation of queries and documents as a proxy for relevance. Using two encoders, one for the query and one for the document, the input data is mapped into a common latent space where closeness becomes a proxy for relevance. Dense retrievers have become increasingly popular due to their ability to capture the semantic relationships between query and document terms. However, bi-encoder-based models can also be computationally expensive, particularly when dealing Figure 1: Using KALE and asymmetric training on the lead to when measuring QPS vs. Recall at 100 on the NQ dataset. Using Asymmetry and KALE, it is possible to 3x QPS with no loss in accuracy and 4.5x with 1% loss in performance. We calculate QPS as the mean number of queries per second with a batch size of 1 and a max sequence length of 32 on a T4 GPU with large datasets. As a result, there has been a growing interest in methods for compressing these models to reduce their computational complexity without sacrificing performance. In this paper, we explore the role of asymmetry in the size of query and document encoders that leverage language models. Through experiments on several benchmarks, we demonstrate that our approach can significantly reduce the number of parameters in the bi-encoder model without sacrificing performance. As shown in figure 1, the combination of asymmetric bi-encoders and post-training KALE allows for 3x more QPS than an uncompressed bi-encoder with less than 1% loss in accuracy and nearly 5x with less than 2%. Building on the favorable implications of asymmetry for efficient inference, we introduce a compression mechanism called Kullback-Leibler Allingment of Embeddings (KALE). KALE uses a divergence-based alignment of representations to compress models without requiring any form of retraining or index regeneration. To ground our approaches, we evaluate the effectiveness of KALE and asymmetry on several benchmark datasets and compare the results to existing efficient inference approaches. The following research questions drive our work: • Is the performance of dense retrieval methods more driven by the query or document encoder size?
• Is it possible to compress query encoders without retraining and index regeneration?
• How can dense retrieval asymmetry and posttraining alignment be leveraged to improve query encoder latency?
It is in answering these questions that we deliver the following contributions: • We present the first robust study on the role of document-query encoder symmetry, demonstrating that the size of the document encoder dominates performance.
• We introduce and demonstrate the effectiveness of KALE, a post-training compression and alignment approach demonstrating its effectiveness and • We empirically demonstrate on various benchmarks how Asymmetric Compression can lead to 4.5 better QPS with 1% loss in recall accuracy at 100.

Related Work
Transformer Based Language Models such as BERT (Devlin et al., 2019) provide contextual language representations built on the Transformer architecture (Vaswani et al., 2017) which can be specialized and adapted for specific tasks and domains (Lee et al., 2020). Using contextual word representations, it becomes relatively easy to excel at a broad range of natural language processing tasks such as Question Answering, Text Classification, and sentiment analysis. Bi-Encoders, commonly called dual-encoders or dense retrievers, decompose ranking by leveraging the inner product of query and document representations to produce a relevance score for query document pairs. While not as accurate at cross-encoders (Reimers and Gurevych, 2019), they are more efficient for inference and easier to deploy. Bi-encoder document representations are query invariant, allowing them to be pre-computed and loaded into an Approximate Nearest Neighbor (ANN) such as FAISS (Johnson et al., 2019). At runtime, a query is an encoder into a latent space, and the k documents are retrieved using a nearest neighbor algorithm such as HNSW (Malkov and Yashunin, 2016). Since the entire document index has been pre-computed, the retrieval latency is limited to a single call of the document encoder.
Bi-encoders commonly leverage LLM such as BERT (Devlin et al., 2019) to retrieve short passages of text leading to the task descriptor of Dense Passage Retrievers (DPR) (Karpukhin et al., 2020). Driven by their efficiency in deployment and relevance performance, DPR-based models have rapidly become the building blocks for systems doing product search (Magnani et al., 2022), open domain question answering (Karpukhin et al., 2020) and customer support (Mesquita et al., 2022). Efficient Inference study methods and models which decrease the model execution cost while minimizing the losses to model performance. Knowledge Distillation (Hinton et al., 2015) is a training method where a model, called the student, learns to emulate a teacher model, which is commonly larger or better performing than the student. Unstructured pruning removes individual weights or groups of weights in a model by applying a mask or setting the weight values to 0. When paired with a sparsity-aware inference engine, it is possible to gain 3-5x speedups in inference throughput with little to no loss in accuracy (Kurtić et al., 2022

Method
The use of representation models for retrieval begins with a document space d and a query space q where, each of which is generated by some model m. Models do not need to share the same initialization, shape, or size, but their representation vectors must share size without some projection. These two models learn a notion of relevance by training to minimize the distance of positive querydocument pairs as shown in equation 1 where x is a query vector and y is a document vector, and · denotes the dot product of the vectors.
The query and document encoder models are commonly initialized with a pre-trained language model such as BERT. Then, using pairs of labels for positive relevance scores for queries and documents, the models are trained to minimize the distance between queries and their relevant documents (Karpukhin et al., 2020) While it is common practice to initialize the query  Figure 2: Measuring the impact on recall at 20 on the NQ retrieval dataset by varying the number of transformer layers for the query encoder and document encoder encoder and document encoder with identical language models, this ignores the cost asymmetry of the usage patterns. The document encoder is usually only used once during a large-scale batch generation of the index. Index generation happens in a latency-insensitive environment and can easily leverage many GPUs and large batch sizes to improve efficiency. The query encoder runs every time a user issues a query, which can be irregular and sporadically. The query encoder responds to each user query independently. Thus, query encoders often use a batch size of 1 and commonly leverage small inference-optimized hardware like the T4 GPU or small CPUs. Since the document encoder does not run very often, any improvement in latency produces a single fixed gain utterly dependent on the corpus size and index refresh cycle. The query encoder's userfacing nature means latency improvements occur whenever a user queries.

Role of model symmetry with
Bi-encoders The variation in latency sensitivity between the query and document encoder leads to the question: Is there some form of asymmetry between the query encoder and the document encoder that can be exploited? Do the two encoders need to be compressed symmetrically?
To answer this question, we explore the impact on the performance of pruning the query and document encoders on the NQ passage retrieval dataset (Kwiatkowski et al., 2019). Using a BERT-base uncased model with 12 transformer encoder layers, we generate structurally pruned models with 9,6,3,2 and 1 layers. We also further pre-train the three and six-layer models using knowledge distillation, represented as 6 KD and 3 KD , from a 12layer model on the Wikipedia-book corpus similar to distilBERT (Sanh et al., 2019). Then, using each of these seven models, we train dense retrieval models on the NQ passage retrieval dataset with variations of query and document models resulting in 72 variants. With each of these models, we generate a full index and evaluate retrieval performance on the development portion of the dataset. We do not tune any parameters to avoid overfitting and to explore asymmetry without overoptimizing. Each model's retrieval accuracy is evaluated with retrieval sets of depth 20, 100, and 200. We compare the impact of varying the encoders to the uncompressed baseline and a dis-tilBERT model (denoted by 6 db . Looking at the impact of symmetric compression as shown in table 1, we see that the impact of compression is more pronounced with a small recall set as retrieval accuracy impact at 20 is 3x that of at 200. We also observe major accuracy gains by fine-tuning the pruned model with a 4% gap for the 6-layer model and a 6% gap for the 3-layer model for recall at 20. We find that the document encoder drives retrieval accuracy by looking at the results of asymmetrical training in table 2. As shown in figure 2, retrieval accuracy is driven by the document encoder following prior work showing performance out of domain is dependent on the document encoder (Li and Lin, 2021). The size of the Document encoder sets the upper bound on a model's performance as a model with 12 layers in the query encoder and 9 in the document encoder performs worse than one with the numbers flipped. The dominance of the document encoder is a logical outcome, as the latent space for queries is simpler than the latent space for documents. As a result, system performance is governed by how well the document encoder can generate representations. Similar results can be seen with the introduction of fine-tuned three and 6-layer models as shown in 6. Unsurprisingly, KD-optimized language models outperform non-distilled models, and any asymmetrical variant that leverages a distilled model outperforms the un-distilled variant. Without further optimization, a model with a distilled 3-layer query encoder and a 12-layer document encoder will outperform a model with symmetrical 6-layer models despite being 2x faster.

Inference Benchmarks
To evaluate the impact of structural pruning, we benchmark inference speeds of query encoding while varying the number of transformer layers. We perform benchmarking using an Intel Xeon Gold 6238R Processor and a T4 Nvidia GPU. For each model, we evaluate the performance on encod- ing 6500 queries with a batch size of one and a max context length of 32. For CPU inference, we evaluate the performance of models using the ONNX library 1 , and for GPU inference, we evaluate native Pytorch inference. We repeat each run five times to ensure consistency and report the mean. Summary statistics can be found in 3 and full results, including percentile, standard deviation, and confidence intervals, can be found in the appendix .5.

KL Alignment of Embeddings
While training asymmetric models can improve latency, it requires novel training regimes and experimentation, and existing workloads need to regenerate their entire index to take advantage of any inference speedups. Generation of the passage index can take longer than model training (Karpukhin et al., 2020), which makes regenerating a new index and retraining a model to meet changing latency requirements an inefficient exper-  Table 3: Variation in model throughput according to the serving method and the number of transformer layers. Structural pruning can lead to a 6 and 8-layer performance increase on GPU and CPU and pruning a model to 3 layers allows a CPU to offer better inference performance than the GPU. imentation pathway. Moreover, coupling asymmetry into training makes generating query encoder variants more difficult, as each encoder requires its own index and document encoder. Motivated by this bottleneck, we introduce Kullback-Leibler Allingment of Embeddings (KALE), a simple method of improving bi-encoder latency by aligning the embeddings of compressed models. KALE is applied after model training and leverages large batch sizes to make compression computationally inexpensive and independent of training. A single V100 GPU KALE can produce a compressed query encoder in less than 5 minutes. First, a bi-encoder model trains with separate query and document encoders. When training is complete, the document encoder, e document , is frozen, and using the query encoder, e q , a structurally pruned copy, e q , is made. Then, using a sample of queries, the e q model is fine-tuned to minimize the KL divergence of their query representations as shown in equation 2.
We explored the use of various distance functions such as cosine similarity, Manhattan distance, and the KL divergence but found little sensitivity in any metric besides KL divergence. We believe this is due to us freezing the document representations, and as a result, cosine distance allows the query embeddings to drift more than probability distribution matching methods. To explore this further, we experiment with tuning the temperature for the KL divergence and add a loss scaling factor but find a temperature of one and a scaling factor of ten to be most optimal. Additionally, we explored using a contrastive loss with random negative and hard negatives mined from the trained encoder but found no positive impact for either method. We leave further exploration of training objective improvement for future work.

Experimental Results
We evaluate the effectiveness of KALE by taking uncompressed BERT BASE models and pruning them with and without KALE on a variety of wellestablished passage retrieval benchmarks. First, models are trained, and indexes are generated using un-optimized BERT BASE models. Next, the document encoders are frozen, and the query encoders are structurally pruned to have 9,6,3,2 or 1 transformer layer. Finally, query encoders are aligned using KALE, and we compare the performance of compressed models by comparing the impact on retrieval accuracy at 20,100, and 200.
To aid reproducibility, each model is trained using the Tevatron (Gao et al., 2022) 2 library, which makes use of hugginface's transformers to provide a simple interface for exploring neural ranking models. Our experiments focus on the plain BERT BASE -uncased 12-layer transformer model. While never more capable models exist, the unaltered BERT model is widely used in production workloads, which our experiments seek to emulate. Our work aims not to produce the highest possible retrieval accuracy for a dense encoder. Instead, our goal is to find the role of asymmetry in biencoder models. As a result, we leverage the wellestablished parameters in all of our experiments without using an advanced methodology like contrastive or curriculum learning. There are fewer parameters for using KALE, and we deliberately do not optimize on anything but the loss between e q and e q . In general, higher degrees of pruning require longer training with smaller batches.
Datasets We use a wide variety of standard dense retrieval benchmarks, including MSMARCO V1. For each dataset, we evaluate performance by measuring the recall accuracy with retrieval depths of 20,100, and 200. Additionally, for the MSMARCO dataset, we also report MRR@10; for Scifact, we also report NDCG @10 and RR@10. Computational Experiments Our experimentation on fine-tuning our compressed models uses a 16 GB V100 GPU. Experiments in bi-encoder model training leverage 1 V100 for the MS-MARCO and 4 for each other experiment. Due to the vast number of models and datasets we train on, each experiment happens with the same fixed seed.

Evaluating KALE
We compare the performance of using KALE for post-training compression in figure 3 across the five datasets and see a fairly consistent trend. When the recall set is small and the query encoders are pruned to a high degree, the impact of KALE is most visible, often driving over 50 improvements in retrieval accuracy. Additionally, using KALE allows the models to have a steady and gradual drop in recall accuracy relative to speedup instead of the sharp drop shown by the regular usage of  Figure 3: Impact of structural pruning with and without KALE on the NQ, MSMARCO, TriviaQA, SciFACT, and SQuAD Passage Retrieval dataset with the recall set sizes of 20,100, and 200. Across datasets, we see a consistent trend where KALE is effective but most effective when the network is heavily pruned and recall set sizes are small. When the model is pruned to 2 or 1 layer with a recall set size of 20, the difference between using KALE or not can be up to 10 times the loss in recall accuracy structural pruning. Without KALE, post-training compression causes a 20-50% loss in retrieval accuracy. With the use of KALE, these losses are cut to 1-10%. In practice, this allows using one or 2-layer encoder models running with CPU-based inference with minor impacts on accuracy. We also notice a surprising performance improve-

Aiding Asymmetry with KALE
Seeking to optimize compression further, we combine KALE with asymmetrical finetuning and evaluate the results similarly to our earlier experiments. Results on the impact of KALE and asymmetry on the five datasets on the recall accuracy at 100 can be found in table 5 where 3 kd − 6 kd denotes a three-layer query encoder and six-layer document encoder, 3 kd − 3 kd denotes dual three layer encoders. Full results and metrics for each task can be found in the appendix section .4. First, it is immediately observable that posttraining compression via KALE performs worse than models natively designed for that size. We believe this is due to the convergence of the KALE models to have some distance from the uncompressed model because of dropout. We experimented with not using dropout in KALE, but model performance quickly suffered. Looking at the best retrieval accuracy vs. the model speedups shown in figure 4, we can see a substantial variation in the impact of compression across datasets. In tasks like SCIfacts, it is possible to get over 4x speedup while improving accuracy, while on tasks like SQuAD, even minor speedups lead to major losses in accuracy. We believe this variation is driven by the relative difficulty of each dataset, where easier tasks are more compressible than harder tasks. We believe these variations in results highlight the utility of post-training compression methods like KALE. Given the task variability in the impact of compression, iteration speed and cost are essential to effectively tuning model inference speed and accuracy.

Conclusion and Future Work
In this work, we have demonstrated how the use of asymmetry between the query and document encoders in bi-encoder models can be leveraged for improved inference efficiencies across CPUs and GPUs. Using our post-training compression framework, KALE, we can compress models up to 6x with little loss in accuracy. Compressing models without regenerating the document index or the document encoder makes it practical to have many query encoders tailored to each use case's latency needs. In the future, we wish to study how asymmetry in retrieval can be implemented with models which are widely different and may have different hidden sizes, such as using MiniLM for the query model and RoBERTA-Large for the document model. .1 Asymmetrical Dense Retrieval the impact of structural pruning with asymmetrical dense retrieval can be found in table 6. Similar to other works studying the use of knowledge distillation found (Sanh et al., 2020), the use of distillation improves performance by a non-negligible level.   The impact of pruning and KALE is fairly consistent across datasets, but there are larger losses on some smaller datasets, such as SCIfacts and SQUAD.

.4 KALE and Asymmetric Training
Building on the impact of asymmetry and KALE, we explore comparing them across various datasets as shown in 14, 15,16, 17, 18. .