Improving Embedding-based Large-scale Retrieval via Label Enhancement

Current embedding-based large-scale retrieval models are trained with 0-1 hard label that indicates whether a query is relevant to a document, ignoring rich information of the relevance degree. This paper proposes to improve embedding-based retrieval from the perspective of better characterizing the querydocument relevance degree by introducing label enhancement (LE) for the first time. To generate label distribution in the retrieval scenario, we design a novel and effective supervised LE method that incorporates prior knowledge from dynamic term weighting methods into contextual embeddings. Our method significantly outperforms four competitive existing retrieval models and its counterparts equipped with two alternative LE techniques by training models with the generated label distribution as auxiliary supervision information. The superiority can be easily observed on English and Chinese large-scale retrieval tasks under both standard and cold-start settings.


Introduction
Retrieval systems such as search engines have been a vital tool in helping people access the vast amount of information online. As shown in Figure 1, existing methods for large-scale retrieval will first utilize a less powerful but more efficient retrieval algorithm (Retriever) to reduce the potential candidates, and then employ more powerful models (Ranker) to re-rank the retrieved documents (Padaki et al., 2020;Mass and Roitman, 2020). We will focus on improving Retriever in this paper.
With pre-trained word embeddings (Mikolov et al., 2013b,a;Pennington et al., 2014; and language models (e.g., BERT (Devlin et al., 2019) and RoBERTa ) achieving great success in a wide variety of NLP tasks, researchers have begun to leverage BERTstyle models to solve large-scale retrieval problems. These models consider the retrieval phase as a regression task trained with 0-1 hard labels, representing only two types of relevance degrees (relevant or irrelevant) between query-document pairs (Chang et al., 2020;Lu et al., 2020).
The relevance degrees between queries and documents, however, can have much more possibilities. For example, we present a query and three actual results retrieved by the Google search engine in Figure 2. Though all three documents are relevant to the query, the relevance degrees can vary significantly if we assign a real-valued number indicating to what extent a query and a document relate. On the other hand, even if a query and a document are marked as irrelevant by the hard label, a weak relevance degree may exist between them. In such scenarios, label distribution (Geng, 2016), which involves the relevance degrees between queries and documents, is a more reasonable description of an instance. The observation inspires us to explore the label distribution to improve existing large-scale Retriever models trained with hard labels. We can easily expect the following two novel LE methods for Retriever models.
• One straightforward LE method in our scenario is to exploit the semantic relevance between queries and documents based on classic term weighting methods (e.g., TF-IDF (Spärck Jones, 1972, 2004). The problem with this method is that term weight will be static and context-free. For example, given the sentence "EMNLP 2021 is held after ACL 2021, accepted papers will be published in ACL Anthology." the first "ACL" is a conference name, while the second "ACL" is a professional society, they should have different term weights. However, TF-IDF cannot distinguish them and will assign them unreasonable equal term weights.
• Another way to generate label distribution is by training a contextual-embedding-based model with hard labels and then exploiting the prediction scores as label distribution, widely used for knowledge distillation (Hinton et al., 2015) and performance improvement (Zhang et al., 2019). This label distribution, called dark knowledge by Furlanello et al. (2018), is generated implicitly and lacks clear physical interpretation. From this perspective, term weighting methods can bring complementary and more explainable prior knowledge beneficial to the Retriever model.
To this end, we choose to generate label distributions based on term weights method in a way that integrates the merits of the two paradigms above. Specially, we employ BERT to generate contextualized text representations and learn to predict term weight for each word with its TF-IDF value as the supervised signal. In this way, we achieve a dynamic term weight scorer, named BERT-Scorer. Based on BERT-Scorer, we can predict each word's contextual term weights in a query and a document. We then generate label distributions for the querydocument pairs based on their term weights of over-lapped words and finally train Retriever models with generated label distributions as auxiliary supervision information.
We have conducted extensive experiments on English and Chinese large-scale retrieval tasks under both standard and cold-start settings. Experimental results show that our approach significantly improves state-of-the-art models and has superiority over alternative label enhancement methods.
Our main contributions are as follows: 1. We propose to exploit query-document relevance degree to improve embedding-based Retriever models. This work is the first pioneer investigation on leveraging label enhancement to characterize relevance degree and incorporating it into the Retriever models to the best of our knowledge.
2. By designing a novel dynamic term-weight scorer that integrates contextual BERT representation and static TF-IDF information, we achieve a novel and effective label enhancement method that automatically generates label distributions for the retrieval tasks.
3. Our method significantly outperforms state-ofthe-art models and its counterparts equipped with alternative label enhancement techniques on English and Chinese large-scale retrieval tasks under both standard and cold-start settings. 2 Background and Related Work

BERT-style Retriever and Ranker
Large-scale retrieval is usually solved in two steps. The retrieval phase (Retriever) first reduces the solution space, returning a subset of candidate documents. The ranking phase (Ranker) then re-ranks the documents (Chang et al., 2020). Unlike Ranker witnessing significant advances recently due to the BERT-style pre-training tasks on cross-attention models (see left side in Figure 3) (Padaki et al., 2020;Mass and Roitman, 2020), the retrieval phase, which is the focus of this paper, remains less well studied.
Existing BERT-style Rankers can not be applied to large-scale retrieval problems. Since the prediction function f (query, doc) with BERT is a pre-trained deep bidirectional Transformer model (Vaswani and Shazeer, 2017), we can not afford to apply the prediction process for every possible document given a query. Therefore, BERT-style Retriever will employ a multi-tower architecture (see the right side in Figure 3), in which embeddings of documents can be first predicted offline and then fetched to calculate the final relevance score efficiently. For example, we can deploy an inverted index based ANN (approximate near neighbor) search algorithms (Shrivastava and Li, 2014;Guo et al., 2016) to Retriever, and employ Faiss library (Johnson et al., 2017) to quantize the vectors and then implemented the efficient embedding search in Retriever.
As a representative BERT-style Retriever, Reimers and Gurevych (2019) use siamese and triplet network structures based on BERT to de-rive semantically meaningful sentence embeddings, which can be compared using cosine similarity. Some researchers further improve model performance by introducing external knowledge or data. For example, Chang et al. (2020) build a two-tower Transformer model with more pre-training data, which can significantly outperform the widely used BM-25 algorithm. Lu et al. (2020) distill knowledge from larger BERT into a two-tower architecture network for efficient retrieval. Liu et al. (2021) build a four-tower BERT model that leverages the distances between simple negative and hard negative instances for embedding-based large-scale retrieval.

Label Distribution and Label Enhancement
The process of generating label distributions from hard labels is defined as label enhancement (LE). LE has achieved remarkable results in many fields, e.g., computer vision (Gao et al., 2020;Xu et al., 2020) and biological information classification Lv et al., 2019). Knowledge distillation from the deep learning community (Hinton et al., 2015) is another way to generates label distributions, also known as soft labels. The distillation process mainly refers to using prediction scores (e.g., SoftMax logits) of pre-trained models as auxiliary objectives. We focus on embedding-based large-scale retrieval problems as the first touch on incorporating label enhancement into this field. It is worth noting that the primary purpose of LE is incorporating the possibility (or uncertainty) into the original hard label to facilitate model performances, rather than generating the ground truth label distribution.

The Proposed Approach
denotes whether a query x i and document y i are relevant or not. Our proposed LE method can automatically generate label distributions d i for each query-document pair x i , y i , which is further introduced to assist retrieval tasks. The details are demonstrated in the following subsections.

Initial Term Weights
Given a positive training instance ( x i , y i , l i = 1), where x i contains n tokens {w 1 , w 2 , ..., w n }, proper term weights should reflect whether a term  Figure 4: BERT is firstly adopted to generate contextualized representation. A linear regression layer is then used to estimate term weights for each token, with the corresponding TF-IDF scores as supervision signals. Two concrete queries are used as examples. Based on TF-IDF, the word "human" in q 2 can be easily identified as a critical term. Since the second "the 2 " in q 1 has a similar context with "human", we can predict a more reasonable weight for "the 2 " by incorporating TF-IDF into contextualized representations.
w i is essential to the document or not. We propose to generate initial term weights by the TF-IDF method as follows: is the term weight of w j in x i corresponding to y i , and η w i ,y i equals the number of times w i appears in document y i . Y is the set of all documents, and η w i ,Y equals the number of documents in which w i appears in Y .

BERT-Scorer
The traditional term weight method such as TF-IDF is based on statistical features of documents. They produce static and context-free term weight and fail to capture the complex semantic features. To estimate the importance of a word in a specific text, the most critical problem is to generate features that characterize a word's relationships to the context. Recent contextualized neural language models like BERT have been shown to capture such properties through a deep neural network effectively (Dai and Callan, 2019).
As shown in Figure 4, for the example sentence q 1 "What does the word 'the' mean", the first "the 1 " is a definite article and the second "the 2 " is a noun. Another example sentence q 2 is "What does the word 'human' mean", which has the same context as the first sentence except for the keyword "human". Although the TF-IDF scores of "the 1 " and "the 2 " are equal, most words that have a similar context with "the 2 " (e.g., the word "Human" in q 2 ) will be given reasonable TF-IDF scores. BERT can generate contextualized representations that characterize words' syntactic and semantic role in a given context. In this way, we can get relatively similar contextual embeddings for these words, hence predicting similar scores (e.g., 0.92 for "Human" and 0.89 for "the 2 " according to actual BERT-Scorer predictions).
Based on BERT, we build a regression model named BERT-Scorer to generate dynamic contextaware term weights for queries and documents. Given the query x with n tokens {w 1 , w 2 , ..., w n }, BERT is firstly adopted to encode each word sequence into a sequence of continuous representations as following: A linear regression layer is then used to estimates the term weight for each word w i as follows: where W and b are model parameters. Under such circumstance, our BERT-Scorer can effectively discriminate "the 1 " and "the 2 " according to the differences between h the 1 and h the 2 . The "human" and "the 2 " have similar weights while the weight of "the 1 " is much smaller.
During training, the initial term weights by TF-IDF are utilized as supervised signals. The optimization objective function is defined as the mean square error (MSE) between the predicted weightŝ t and the target weights t as follows: Note that tokens with negative term weight are recognized as insignificant thus discarded in the following.

Adaptation For Chinese
BERT-Scorer estimates weights for word-level terms while existing pre-trained BERT-style models for Chinese are character-level. To bridge the gap, we evenly distribute the weight of a word to each character in-between. Besides, we utilize the position information where character lies within the word by tagging each character via the widely-used BMES (Begin, Middle, End, and Single) schema and incorporating BMES embedding into BERT's input representation.

Label Distribution Generation
After BERT-Scorer generates term weights for query x i and document y i respectively, we calculate the label distribution based on their term weights of overlapped words as follows:

Retriever Models Utilizing Label Enhancement
We exploit a two-tower BERT-style Retriever model in this paper, as Figure 3 (b) shows. Each tower of our Retriever model exactly follows the architecture and hyper-parameters of the 12 layers BERT model 1 , except the sequence length is set to be 64. An average-pooling operation is adopted on the output of BERT to produce the final representation for query and document (u and v respectively). Finally, the output score f is calculated by the cosine distance between u and v as follows: We incorporate the generated label distributions into the Retriever model as auxiliary supervision 1 https://github.com/google-research/ bert information. Given the training data with both hard labels and label distributions as follows: The model parameters are estimated by minimizing the following loss function: where α ∈ [0, 1] denotes the loss weight of label distribution, which is used as a trade-off to get a suitable fitting target.

Datasets
Following Chang et al. (2020), we consider the Retrieval Question-Answering (ReQA) benchmark proposed by Ahmad et al. (2019). We use SQuAD (Rajpurkar et al., 2016) and Natural Questions (Kwiatkowski et al., 2019) for English, and CMRC 2018 (Cui et al., 2019) and DRCD (Shao et al., 2018) for Chinese. Note that Ahmad et al. (2019) is targetting at Ranker, while our goal is to improve the Retriever. Therefore our approaches are not directly comparable to the results presented in their paper.
Each entry of QA datasets is a tuple (q, a, p), where q is the question, a is the answer span, and p is the evidence passage containing a. Following Ahmad et al. (2019); Liu et al. (2021), we split a passage into sentences p = s 1 s 2 ...s n . For a query q, we need to retrieve the correct sentence from a candidate set consisting of sentences of all passages. A query-sentence pair (q, s) is labeled as 1 if s is the sentence containing the corresponding answer span, and labeled as 0 otherwise. This problem is more challenging than retrieving the evidence passage only since the larger number of candidates to be retrieved.
For each dataset, the training/test split of the data is 60%/20%, and the 20% of the training set is held out as the validation set for hyper-parameter tuning 2 . We apply four-fold cross-validation to do significant tests.

Baselines
We compare our method against the following six baselines. The first four are existing widely used large-scale Retriever models, and the latter two are models equipped with alternative label enhancement methods.
• F-EBR is the most widely used wordembedding-based multi-tower Retriever model proposed by Facebook Search (Huang and Sharma, 2020).
• LE-TFIDF is a variant of our method in which the label distribution is generated based on static TF-IDF weights.
• LE-Distill is another variant in which the label distribution set as predicting scores of SBERT. This method is similar to self-distillation process in born-again networks (Furlanello et al., 2018).
For the convenience of comparison, we refer to our Label Enhancement method based on BERT Scorer as LE-BS.

Evaluation Metric
Since the goal of the retrieval phase is to capture the positives in the top-k results, we select Recall@k as the evaluation metric. Recall@k is computed by the following equation: where R k is the top k results recalled by our model. D is the dataset. x i and y i are the i-th query and i-th document separately.

Comparison with Retriever Models
The experimental results 3 are shown in the Table 1, from which we have three observations: 1. Term weighting methods perform exceptionally well for the SQuAD benchmark, as the data collection process and human annotations of this dataset are biased towards question-answer pairs with overlapping tokens. They perform poorly in the Natural Questions dataset, where there are fewer overlapping tokens and the embedding-based model perform well. Our LE-BS combines the advantage of term weighting and embeddingbased methods to perform well in all datasets.
2. It is as expected that LE-BS and SBERT outperform F-EBR by a large margin since pretrained language models yield much more robust representation than word embeddings.  3. LE-BS further achieves significant improvement over SBERT. LE-BS can be viewed as an enhanced SBERT variant that incorporates label enhancement. We could observe the improvement of LE-BS over SBERT on both English and Chinese datasets, verifying that the label distributions generated by our BERT-Scorer provide helpful supervision signals for Retriever models in a language-independent manner.

Impact of Label Distribution
We further investigate why label distribution can bring recall improvement observed above. We take the SQuAD dataset as an example and get all predicting distance scores of testing pairs. We split the range of [0,1] into ten equal sub-ranges including (0, 0.1], (0.1, 0.2],..., and (0.9, 1], and count proportions of pairs whose scores are in each sub-range. The three multi-tower models' statistics are shown in Figure 5. From Figure 5, we find the distance scores of most testing pairs are close to 1. It is a natural result since most testing pairs are labeled as irrelevant by hard labels. Compared with F-EBR and SBERT, the curve of LE-BS is much smoother, meaning more pairs have a smaller query-document distance. We attribute this to the supplementary training objective of fitting the label distribution in addition to the 0-1 hard label. The trend of LE-BS's curve partly expresses why LE-BS achieves much better recall scores. In other words, we can safely conclude that with label distribution LE-BS can  identify more relevant candidates without introducing too many false positives. Note that better recall is a fundamental goal of Retriever because we want to feed Ranker with as many relevant candidates as possible.

Analysis of Label Enhancement Method
The intuition of our label enhancement method in retrieval scenarios is to incorporate prior knowledge from static term weighting methods into dynamic contextual embeddings. To verify the superiority of our label enhancement method, we compare two alternative label enhancement techniques. The empirical results are demonstrated in Table  2. For the convenience and clarity of comparison, here we also put the performance of SBERT. Its experimental results are demonstrated as LE-None to indicate that no LE method is employed.
To further analyze the effectiveness of label enhancement, we consider two different settings for each dataset. The first one is the standardsetting, where the training/test split of the data is 60%/20%, and the 20% of the training set is held out as the validation set. The second one is the cold-start setting that assumes there are not enough training data to use. The only difference from the standard-setting is that the training/test split of the data is 20%/60%. We have the following five observations: 1. All LE-based models outperform the LE-None model, which clearly verifies the effectiveness of label distribution for the retrieval task.
2. The improvement of LE-TFIDF over LE-None shows that static TF-IDF weights serve as beneficial prior knowledge to characterize label distribution.
3. LE-Distill also achieves notable enhancements. This observation is consistent with other knowledge distillation works (Hinton et al., 2015;Furlanello et al., 2018). The self-distillation process brings valuable dark knowledge via the generated soft predicting scores even without utilizing TF-IDF information.
4. Relative performance improvement brought by LE under the cold-start setting is more evident than the standard-setting. The possible reason is that relevance degree information could play a more important role when there are not enough training data. This observation is also consistent with other data-lacking scenarios of using label distribution (e.g., knowledge distillation (Hinton et al., 2015)).
5. Our LE-BS has clear superiority over LE-Distill and LE-TFIDF among all datasets under both the standard and cold-start settings. Rather than predicting relevance score directly as LE-Distill, LE-BS predicts dynamic term weights by BERT-Scorer in a way incorporating useful TF-IDF information into contextual BERT representation. Therefore, the final generated label distribution integrates explicit prior TF-IDF knowledge, and some helpful "dark" knowledge (Furlanello et al., 2018) is produced during the training step. We believe that is the main reason behind this superiority of our method.

Collaboration between Label Distribution and Hard Label
As a critical hyper-parameter of our LE-BS method, α denotes how to weight the optimization objectives of hard labels and label distributions. This section investigates the collaboration between hard labels and label distributions with different α settings. This analysis could provide more systematic guidance on how to incorporate label distribution.
We train our LE-BS with α is set to 0, 0.2, 0.5, 0.8, and 1, respectively, and report the empirical results of the SQuAD and Natural Questions datasets. Note that setting α as 0 means using only hard labels, and setting α as 1 means using only label distributions. The experimental results are shown in Table 3, from which we find tuning α is essential -different α can result in recall variation of 5% − 10%.
For the standard-setting, we find that when α is set to be larger, our LE-BS performs exceptionally well for the SQuAD benchmark. Note that the data collection process and human annotations of SQuAD are biased towards question-answer pairs with overlapping tokens (Rajpurkar et al., 2016). We can naturally expect that the generated label distribution could better characterize query-document relevance degree in the SQuAD dataset due to the capability of BERT-Scorer to identify overlapped highly-representative tokens. Regarding the Natural Question dataset, LE-BS is best performed when the α is set as 0.2. This dataset is built based on Google search logs, so the connection between queries and document are more challenging to capture. In this scenario, if we rely too much on the supervision signal from the generated label distributions, unreasonable noisy information can be brought in and thereby hinders model performance.
For the cold-start setting, models with a larger α consistently achieve better performance. In such data-lacking scenarios, models cannot get sufficient supervision information from training sets' hard labels. When α becomes larger, more auxiliary supervision information from the label distribution could be utilized. Though this is a rather rough explanation for this observation, it can serve as trustworthy guidance in practice for information retrieval researchers and engineers.

Conclusion
This paper first introduced label distribution to characterize the relevance degree between queries and documents in large-scale retrieval problems. Then we designed a novel and effective label enhancement method that generates label distributions via fusing context-free TF-IDF information and contextual BERT representation. An improved Retriever model was achieved easily by incorporating the generated label distributions as auxiliary supervision information. Our method's superiority can be observed on four datasets of English and Chinese.