Contextualized Query Embeddings for Conversational Search

This paper describes a compact and effective model for low-latency passage retrieval in conversational search based on learned dense representations. Prior to our work, the state-of-the-art approach uses a multi-stage pipeline comprising conversational query reformulation and information retrieval modules. Despite its effectiveness, such a pipeline often includes multiple neural models that require long inference times. In addition, independently optimizing each module ignores dependencies among them. To address these shortcomings, we propose to integrate conversational query reformulation directly into a dense retrieval model. To aid in this goal, we create a dataset with pseudo-relevance labels for conversational search to overcome the lack of training data and to explore different training strategies. We demonstrate that our model effectively rewrites conversational queries as dense representations in conversational search and open-domain question answering datasets. Finally, after observing that our model learns to adjust the L2 norm of query token embeddings, we leverage this property for hybrid retrieval and to support error analysis.


Introduction
With the growing popularity of virtual assistants (e.g., Alexa and Siri), information seeking through dialogues has attracted many researchers' attention. To facilitate research on conversational search (ConvS), Dalton et al. (2019) organized the TREC Conversational Assistance Track (CAsT) and defined ConvS as the task of iteratively retrieving passages in response to user queries in a conversation session. An example conversation in the CAsT dataset is shown at the top of Figure 1(a).
There are two main challenges for the task of conversational search: (1) User utterances are often ambiguous when treated as stand-alone queries since omission, coreference, and other related linguistic phenomena are common in natural human CAsT Example q < 4 ; q 4 (b) End-to-end conversational search: query reformulation is directly incorporated into the IR pipeline, thus enabling end-to-end training. dialogues. Hence, directly feeding the utterances into IR systems would lead to poor retrieval effectiveness. Understanding queries through conversational context is required. (2) There is limited data regarding conversational search for model training. To address the aforementioned challenges, existing papers (Lin et al., 2021c;Yu et al., 2020;Voskarides et al., 2020;Kumar and Callan, 2020) take a multi-stage pipeline approach. They train a conversational query reformulation (CQR) model using publicly available datasets (Elgohary et al., 2019;Quan et al., 2019) and feed the automatically decontextualized queries to an off-the-shelf IR pipeline (Nogueira and Cho, 2019). However, such ConvS pipelines can be slow (i.e., over 10s per query on GPUs). Furthermore, this design assumes that the reformulated queries are independent of the downstream IR pipeline, which may not be true.
In this paper, we study a low-latency end-to-end approach to ConvS. Specifically, we adopt a biencoder model and incorporate CQR into the query encoder, illustrated in Figure 1(b). To overcome the challenge of limited training data, we create a dataset with pseudo-relevance labels to guide the query encoder to rewrite conversational queries in latent space directly. One may consider this approach as throwing conversational queries into a black box since the reformulated queries are represented as dense vectors. However, we find that the fine-tuned contextualized query embeddings (CQE) are easily interpretable. They can be transformed into text for failure analysis and can facilitate densesparse hybrid retrieval.
Our contributions are summarized as follows: (1) We integrate two tasks in ConvS, query reformulation and dense passage retrieval, into our dense representation learning framework. Due to the lack of human labeled data, we create a dataset with pseudo-relevance labels for model training. We empirically show that our model successfully learns to reformulate conversational queries in a latent representation space. (2) We uncover how CQE learns to reformulate conversational queries in a latent space. Based on this finding, we can easily transform CQE into a text (sparse) representation. We demonstrate that the CQE text representation also performs well on sparse retrieval and can further improve CQE retrieval effectiveness using a hybrid of sparse and dense retrieval. The CQE text also helps us understand why the technique fails or succeeds. (3) We show that the query latency of CQE (without re-ranking) is at least an order of magnitude lower than existing multi-stage ConvS pipelines while yielding competitive retrieval effectiveness. Hence, CQE is superior for integration with other models in downstream tasks. (4) We empirically demonstrate its effectiveness in open-domain conversational question answering in a zero-shot setting.

Preliminaries
Let us define a sequence of conversational queries Q = (q 1 , · · · , q i−1 , q i ) for a topic-oriented session s, where q i stands for the i-th user query (i ∈ N + ) in the session. The goal of conversational search is to find the set of relevant passages P + i for the user query q i at each turn, given the conversational context q <i . Thus, the task can be formulated as the objective: where F ConvS θ is the function (parameterized by θ) to compute a relevance score between the conversational query (q <i ; q i ) and passage p.
Since end-to-end training data for conversational search is extremely limited, a common approach is to factorize F ConvS θ into a multi-stage pipeline. In a multi-stage pipeline, the components can be tuned with data collected at different stages: where q * i is the stand-alone oracle query that best represents the user's information need given the context q <i ; q i . F ir φ and F cqr ϕ denote the components of information retrieval (IR) and conversational query reformulation (CQR), respectively. Thus, Eq. (1) can be approximated by separately maximizing F ir φ and F cqr ϕ . For example, we can reuse the representative ad hoc retrieval pipeline comprised of BM25 + BERT re-ranking for F ir φ , then conduct the CQR task for F CQR ϕ . Specifically, the most common current approach (Lin et al., 2021c;Voskarides et al., 2020;Vakulenko et al., 2020;Kumar and Callan, 2020;Yu et al., 2020) is to fine-tune a pretrained language model (LM) supervised by decontextualized queries manually rewritten by humans, and then use the fine-tuned LM to reformulate user queries for BM25 retrieval and BERT re-ranking, as illustrated in Figure 1(a). While effective, this approach has two limitations: (1) although mimicking the way humans rewrite queries is reasonable, it hypothesizes that the optimal decontextualized queries are manually rewritten queries, which may not be true; (2) the CQR and IR modules rely on computation-demanding pretrained LMs; thus, when combined together, they are often too slow for practical applications.

Bag of words
Figure 2: Our contextualized query token embeddings can be used both for dense and sparse retrieval. The left side illustrates CQE for dense retrieval by average pooling of token embeddings. The right side shows that the token embeddings can be used to select tokens from the context to form a decontextualized bag-of-words query for sparse retrieval.

Bi-encoder Model
Recently, dense passage retrieval based on biencoders (Reimers and Gurevych, 2019;Karpukhin et al., 2020;Xiong et al., 2021;Lin et al., 2021b) has attracted the attention of researchers due to its good balance between efficiency and effectiveness. Bi-encoder models are trained to encode queries and passages in a shared latent space. At query time, only query texts are encoded to search for the nearest passage embeddings, which are precomputed by the passage encoder. Formally speaking, the relevance score φ of a given query q i (with its context q <i ) and a passage p is computed as the dot product of their embeddings: where E (·) θ ∈ R h is the BERT representation of the input texts, which can be the average or maximum pooling over token embeddings or a specific token embedding (e.g., the [CLS] embedding in BERT), and θ represents the parameters of BERT.
In this study, we adopt average pooling over token embeddings, which lets us interpret CQE easily, as we will discuss later. Thus, we can formulate conversational search as maximizing the following log likelihood: where τ denotes the temperature parameter and D is the set of passages comprising the corpus. In practice, D is replaced by the subset D B , consisting of the passages in a training batch, i.e., the positive and negative passages from all the queries in the same batch. With Eq. (4), the optimization problem of Eq. (1) can be approached by end-to-end representation learning, which can be interpreted as projecting a conversational query E (q <i ;q i ) θ into the latent space such that it has maximum dot product with its relevant passage p + .

Contextualized Query Embeddings
Given the conversational context and query tokens (q 1 <i · · · q j <i , q 1 i · · · q k i ), we define contextualized query embeddings (CQE) formally as E cqe θ ∈ R (j+k)×h based on BERT's contextualized token embeddings: Here, we take the last layer's hidden representations from BERT. From E cqe θ , a single vector query embedding can be computed by average pooling the query token embeddings: We then use E (q <i ;q i ) θ to conduct nearest neighbor search for the top-k passages in the corpus using an off-the-shelf library (Facebook's Faiss, in our case), as shown in Figure 2 (left).

Interpreting CQE
While condensing a multi-stage pipeline into singlestage dense retrieval is attractive, it may be difficult for interpretation (i.e., we cannot examine the reformulated queries). In this subsection, we explain how to interpret CQE. With Eq. (6), we rewrite Eq. (3) as the average dot product of each token Before fine-tuning After fine-tuning Figure 3: L 2 norm distribution of context token embeddings, normalized by the mean of L 2 norms among the context tokens. After fine-tuning, L 2 norms of context tokens that are non-relevant (e.g., start, end) decrease, and the relevant ones (e.g., neolithic) increase. embedding E cqe θ (l) and a single-vector passage embedding E p θ : whereÊ cqe θ (l) is a unit vector. Intuitively, to maximize Eq. (4), CQE can learn to adjust the L 2 norm of E cqe θ (l) when we freeze the passage embeddings. To be more specific, it appears that CQE learns to increase the L 2 norm for relevant query-passage pairs and decrease it otherwise. Thus, we can consider the L 2 norm of each token embedding as its term importance for retrieving relevant passages.
For the example in Figure 2, we empirically analyze the query token embeddings of our CQE model. Figure 3 shows the normalized L 2 norm for the context of the user query ("why did it start?"). We observe that after fine-tuning, the terms "neolithic" and "revolution" show greater L 2 norms than the others. On the other hand, the L 2 norms for the terms "start" and "end" decrease.
With this observation, we can use CQE to generate decontextualized queries. Specifically, inspired by the term weighting ideas of , we conduct query expansion by selecting the terms ( E q<i θ (·) 2 ≥ γ, where γ is a hyperparameter) from the context using CQE and concatenate them to the user query q i , illustrated on the right side of Figure 2. Note that the decontextualized queries generated by CQE are bag-of-words sets rather than fluent natural language queries. However, in Section 5, we show that the decontextualized queries can be used for sparse retrieval and even for conducting failure analysis.

Training Data and Strategies
In this section, we first introduce how we create weakly supervised training data for conversational search. Then, we discuss some possible strategies to fine-tune CQE.
Weakly supervised training data. By taking the idea of pseudo-labeling, we create our weakly supervised training data for end-to-end conversational search. There are human rewritten queries that help models learn to decontextualize them in conversation; however, only limited labels are available for end-to-end conversational search, as shown in Table 2. Hence, we combine three existing resources to train our model with weak supervision: (1)  To combine the three resources, we make a simple assumption: decontextualized queries can be paired with their relevant passages selected by "good enough" ad hoc retrieval models. Thus, for each human reformulated query in the CANARD dataset, we retrieve 1000 candidate passages from the CAsT collection using BM25, and then re-rank them using ColBERT. We assume the top-3 passages are relevant for each query, while treating the rest as non-relevant.
Bi-encoder warm-up. Training a bi-encoder model for dense retrieval requires lots of data, not to speak of conversational search. Following previous work on conversational search (Yu et al., 2020;Lin et al., 2021c;Vakulenko et al., 2020), we adopt MS MARCO as our bi-encoder warm-up training dataset, where the training procedure is adopted from the work of Lin et al. (2020).
CQE fine-tuning. After bi-encoder warm-up, we fine-tune the query encoder to consume conversational queries and generate contextualized query embeddings. Specifically, for each query q i in our training data, we sample a triplet ([q <i ; q i ], p + , p − ) for fine-tuning, where p + and p − are sampled from positive passages (labeled by ColBERT) and top-200 BM25 passages (without replacement), respectively. Note that, at this stage, we freeze our passage encoder and only fine-tune the query encoder; thus, we can precompute all the passage embeddings in the CAsT corpus, and only encode queries for evaluation. In this work, we further explore different strategies to better train CQE using our weakly supervised training data.
Hard negative mining. Although sampling negatives from BM25 top-k candidates is effective for dense retrieval training, Xiong et al. (2021) demonstrate that hard negatives bring more useful information for training dense retrievers. In this work, we explore whether hard negatives benefit the fine-tuning of our CQE model. Instead of using asynchronous index refreshing, as in the work of Xiong et al. (2021), we sample hard negatives p − from the top-200 passages re-ranked by ColBERT.
Training with soft labels. Due to the strong assumptions we make for weak supervision, using cross entropy for one-hot pseudo-label training may be sub-optimal because our model could be overconfident about its predictions. To address this issue, we use the logits of ColBERT as soft labels to fine-tune CQE to have similar confidence predictions, i.e., knowledge distillation. It is worth noting that we only minimize the KL divergence of softmax normalized dot products with respect to in-batch query-passage pairs without using cross entropy for interpolation, as in the traditional (strongly) supervised setting.

Experimental Setup
Datasets. We conduct experiments on TREC CAsT datasets. TREC organized the Conversational Assistance Tracks (Dalton et al., 2019), aiming to collect reusable collections for conversational search. The organizers have created relevance judgments (relevance grades 0-4) for each query using assessment pools from participants. In total, there are three datasets available, CAsT19 (training and eval) and CAsT20 (eval). 1 The dataset statistics are listed in Table 2. All relevant passages come from the CAsT corpus (consisting of 38M passages). In addition, we demonstrate the generalization of CQE on an open-domain conversational question answering dataset ORConvQA in a zeroshot setting, detailed in Section 6.2.
Query reformulation baselines. A reasonable setting to compare CQE with existing query reformulation models is to directly feed the reformulated queries into a bi-encoder model for dense retrieval. For a fair comparison, we encode the reformulated queries into query embeddings using our pretrained bi-encoder model, which is suitable for stand-alone queries. Note that the passage embeddings in the corpus for CQE and the other models are the same since we freeze the passage encoder while finetuning CQE. We compare CQE with three state-ofthe-art conversational query reformulation models and a human baseline, described below: • Few-Shot Rewriter: Yu et al. (2020) fine-tune the pretrained sequence-to-sequence LM GPT2medium on CAsT manually reformulated queries and synthetic queries created by a rule base. For the CAsT19 and CAsT20 eval sets, we directly use their publicly released queries. 2 • QuReTeC: Voskarides et al. (2020) conduct query expansion using BERT-large as a term classifier, which is fine-tuned on the CANARD dataset. We directly use the reformulated queries provided by the authors. 2 • NTR (T5): Lin et al. (2021c) fine-tune the pretrained sequence-to-sequence LM, T5-base, on the CANARD dataset. Following their work, we use the released model 2 with beam-search inference (setting width to 10).
• Humans: We also conduct experiments on the manually reformulated queries provided by the TREC CAsT organizers as a reference.
Since CQE can be used to decontextualize conversational queries, as discussed in Section 3.3, we also apply CQE reformulated queries (denoted CQE-sparse) to sparse retrieval (after conversion to text). The optimal L 2 threshold γ (10.5) is tuned on the CAsT19 training set. Model details. We fine-tune CQE using BERTbase for 10K steps with batch size 96 and learning rate 7 × 10 −6 on all queries (training, dev, and test) in the CANARD dataset (see Table 1), and use the CAsT19 training set as our development set. In our main experiments, we use our best training strategy, combining hard negative mining and soft labeling (see the ablation study in Section 6.3). We perform dense retrieval using Faiss (Johnson et al., 2017) (Faiss-GPU, brute force) and sparse retrieval using Pyserini (Lin et al., 2021a) (BM25, k 1 = 0.82, b = 0.68). In addition, we measure the latency of conversational query reformulation for each model (Latency). For CQE, we report the latency of generating the contextualized query embeddings. Note that for encoder-only models (BERT), we set maximum input length to 150, while for decoder-only and encoder-decoder models (GPT and T5), we further set maximum output length to 32 and use greedy search decoding. All latency measurements are from Google Colab using a single GPU (12GB NVIDIA Tesla K80). Finally, we report the size of each model (# Params).
Evaluation metrics. Following Dalton et al. (2020), for each approach, we compare overall retrieval effectiveness using nDCG and recall (at cutoff 1000), and top-ranking accuracy using nDCG@3. For recall, we take relevance grade ≥ 2 as positive. The evaluation is conducted using the trec_eval tool. In addition, for each model, we report the number of queries win (tie) against manual queries on nDCG@3. All significance tests are conducted with paired t-tests (p < 0.05).

Results on CAsT
First-stage retrieval comparisons. Table 3 reports the sparse and dense retrieval effectiveness of various methods. Overall, dense retrieval yields better effectiveness than BM25 retrieval. Observing the first block in Table 3, CQE-sparse yields reasonable effectiveness compared to the other CQR models, indicating that CQE can be well represented with text. As for dense retrieval, CQE is able to beat the other CQR models. Although NTR (T5) and CQE yield comparable top-ranking accuracy, it is worth mentioning that unlike CQE, the other CQR modules are built independently. Thus, when incorporated with dense retrieval, the overall memory and latency required increase, i.e., # params of NTR (T5) increases from 220M to 330M and is much slower. Finally, we also conduct CQE dense-sparse hybrid retrieval using their linear score combination (denoted by CQE-hybrid); see Appendix A for detailed settings. CQE-hybrid retrieval effectiveness shows significant gains over CQE dense only. The gains from the dense-sparse hybrid suggest that the textual interpretation of CQE not only helps us understand the query reformulation mechanism in dense retrieval but also improves effectiveness, all using a single, unified model.
A comparison of win (tie) entries shows that CQE has more wins against human queries than all the other CQR models. On the other hand, the other CQR models have relatively more ties against human queries than CQE. The difference between  Figure 4: Case studies. We choose cases based on nDCG dense retrieval scores; the CQE text shown is for sparse retrieval. Underline denotes terms not appearing in human queries. CQR + BM25 + BERT-base: latency = 5,350 ms QuReTec (Voskarides et al., 2020) .476 Few-Shot Rewriter (Yu et al., 2020) .492 3CQR + BM25 + BERT-base: latency = 8,025 ms (est.) MVR (Kumar and Callan, 2020) .565 CQR + BM25 + BERT-large: latency = 16,450 ms Transformer++ (Vakulenko et al., 2020) .529 NTR (T5) (Lin et al., 2021c) .556 HQE + NTR (T5) (Lin et al., 2021c) .565 the queries is probably because CQE learns to reformulate conversational queries through the guide of pseudo-relevant passages, meaning that CQE approaches the task in a different way from the other CQR models, which are trained to mimic the way humans reformulate queries. This observation indicates that CQE provides different "views" from other CQR models and could further benefit from fusion with state-of-the-art CQR models, which we demonstrate in Appendix B.
Multi-stage pipeline comparisons. We compare our CQE method with other multi-stage pipelines in terms of top-ranking effectiveness, reported in Table 4. All of these pipelines consist of conversational query reformulators (CQR), BM25 retrieval, and BERT re-ranking. Here, we also list systems that use a BERT-large re-ranker for reference. As for the retrieval latency, since the CAsT corpus requires 55 GiB for the dense vector index, we measure the latency of CQE on two V100 GPUs. For the other BERT re-ranking pipelines, we divide the numbers reported in Khattab and Zaharia (2020), which is measured on a single V100 GPU, by two for a fair comparison. We observe that single-stage CQE (with much lower latency) can compete with all the multi-stage pipelines that use a BERT-base re-ranker, except for MVR, which fuses three re-ranked lists from three different neural CQR models. As expected, re-ranking with BERT-large can yield higher effectiveness, but is also much slower. Of course, we can take CQE results and further re-rank them also.
Case studies. We demonstrate how CQE reformulates queries by comparing CQE and human reformulated queries on the CAsT19 eval set. Figure 4(a) shows cases where CQE beats humans in terms of nDCG (in the dense retrieval setting). The first example shows that humans mistakenly rewrite the query by omitting "sea peoples" in the context. The second example shows that humans reformulate the query correctly; however, CQE further adds the key term "Stanford" to the original query and obtains better ranking accuracy. These cases tell us that manually reformulated queries may not be optimal for the downstream IR pipeline, and CQE can actually do better. On the other hand, Figure 4(b) illustrates cases where CQE performs worse than humans. In both cases, we observe that CQE adds related terms (i.e., "avengers" and "batman"), but these terms degrade retrieval effectiveness. This suggests that a better negative sampling strategy may be required to guide CQE to select key terms and generate more accurate embeddings under such challenging contexts.  Qu et al. (2020) share an extensive corpus using the English Wikipedia dump from Oct. 20th, 2019. Then, they split 5.9 million Wikipedia articles into passage chunks with at most 384 BERT WordPiece tokens, resulting in a corpus of 11M passages. Thus, the task is to first retrieve passages from the corpus using conversational queries and then extract answer spans from the retrieved passages. Since the task shares the same conversational queries as our created dataset (both are built on CANARD), we fine-tune CQE only on the training set listed in Table 1. For a fair comparison between the retrievers, we directly use the reader provided by Qu et al. (2020), 3 which extracts the answer span from the top-5 retrieved passages. We first compare our CQE retrieval effectiveness to baselines, where the numbers are from Qu et al. (2020). To fairly compare with the dense retriever of Qu et al. (2020) (with 128 dimensions), we first conduct unsupervised dimensionality reduction using Faiss (OPQ128, 1VF1, PQ128) from 768 to 128 dimensions. As shown in Table 5, CQE beats the other models in terms of retrieval effectiveness. It is worth noting that the baselines are fine-tuned on ORConvQA, with the passages containing answer spans as positives. In contrast, CQE is only fine-tuned on our weakly supervised training data. This difference suggests that CQE has a degree of generalization capability. More importantly, we observe that the retrieval effectiveness gain from CQE directly benefits F 1 scores. Finally, with hybrid retrieval, a unique feature of CQE, we further improve both retrieval and the downstream task.

Fine-Tuning Ablation
We explore different training strategies with our weakly supervised training data. We use the CAsT19 training set for evaluation and the results are reported in  (Qu et al., 2019a,b) focuses on improving answer span extraction using dialogue context information. Qu et al. (2020) first create an open-domain ConvQA dataset on top of QuAC (Choi et al., 2018) and then tackle this dataset with a pipeline consisting of a retriever and a reader. In this work, we demonstrate that weakly supervised CQE can directly serve as a strong retriever without further fine-tuning, and it improves the accuracy of answer span extraction. Furthermore, different from Qu et al. (2020), CQE provides a single model that supports dense-sparse hybrid retrieval for conversational search, which further improves retrieval effectiveness.

Conclusions
In this paper, we study how to simplify the multistage pipeline for conversational search and propose to integrate modules for conversational query reformulation (CQR) and dense passage retrieval into our dense representation learning framework.
To address the lack of training data for conversational search, we create a dataset with pseudorelevance labels and explore different training strategies on this dataset. Experiments demonstrate that our model learns to reformulate conversational queries in a latent space and generates con-textualized query embeddings (CQE) for conversational search. In addition, our analyses provide insight into how CQE learns to rewrite conversational queries in this latent space. Finally, we show that there are two main advantages of CQE: First, the effectiveness of CQE is on par with state-of-theart multi-stage pipelines for conversational search, but with much lower query latency. Second, CQE serves as a strong dense retriever for open-domain conversational question answering.
Limitations and future work. Our work shows the feasibility of integrating conversational query reformulation and ad hoc retrieval into a bi-encoder dense representation learning framework. However, it is unclear whether the same strategy can be applied to a cross-encoder re-ranker, which, although much slower, still achieves the highest levels of effectiveness. Another limitation of our work is that only historical queries are considered as context; nevertheless, in a real conversational scenario, other types of contexts should also be considered, e.g., system responses and conversations between multiple speakers (if present). There is still much to explore around dense representations in these scenarios, which we leave to future work. Finally, as shown in , incorporating sparse retrieval signals into the training of dense retrieval improves dense-sparse fusion effectiveness. We suspect that there is more to be gained from better fusion of dense and sparse results for conversational search.
Eq. (8) is an approximation of a linear combination of sparse and dense relevance scores. If p / ∈ D sp (or D ds ), we directly use the minimum score of φ sp (q, p ∈ D sp ) or φ ds (q, p ∈ D ds ) as a substitute. For the sparse and dense retrieval combination, we select the best hyperparameters α (0.1) and γ (12) optimizing nDCG@3 on the CAsT19 training set.

B Model Fusion Study
We conduct experiments on model fusion to see whether CQE can complement other CQR models in terms of retrieval effectiveness. Specifically, we use reciprocal rank fusion (RRF) of ranked lists from different queries. Figure 5 shows the effectiveness (nDCG@3) of different fusion combinations on the CAsT19 eval set. We observe that CQE better fuses with all the other CQR models, even in sparse retrieval, where CQE does not perform as well. In addition, CQE shows even better fusion results than human queries in dense retrieval.