Search-Oriented Conversational Query Editing

,


Introduction
With the rise of intelligent assistants (e.g., Siri and Alexa), conversational search is becoming a new search paradigm of the future (Gao et al., 2022).As users can interact with the search engine in the form of natural dialogue, one of the main challenges of conversational search is to accurately understand users' search intents within the dialogue context (Yu et al., 2020;Wu et al., 2022).
Inspired by the success of dense retrieval in adhoc search (Karpukhin et al., 2020), recent studies (Lin et al., 2021a;Mao et al., 2022b;Kim and Kim, 2022) show that using a similar contrastive learning approach to train a conversational dense retriever with a dual encoder architecture can effectively resolve such a complex context understanding problem.However, since the ad-hoc search systems have been built, deployed, and optimized for a long time in the industry, replacing the wellestablished ad-hoc retriever with a totally newtrained conversational dense retriever would be too expensive and even not realistic in the current early days of conversational search.
As such, another type of method, i.e., Conversational Query Rewriting (Vakulenko et al., 2021b;Tredici et al., 2021), which explicitly reformulates the whole search dialogue into a contextindependent query rewrite and thus can be seamlessly incorporated into any existing ad-hoc search pipelines to realize conversational search, shows greater practical value.Currently, a typical CQR model is built by fine-tuning an autoregressive pretrained language model (PLM) using search dialogue (input) and manual rewrite (target) text pairs.Despite the promising results obtained, we argue that it has the following two significant limitations, as shown in Figure 1.
First, the single training objective to simply fit the manual rewrite is not aligned with our ultimate goal, i.e., achieving better performance on the downstream conversational search task.On one hand, the quality of manual rewrites is probably not the best from the view of retrieval, since humans are only instructed to rewrite queries to be selfcontained outside the dialogue context, but have no knowledge of the downstream retrieval process.On the other hand, no ranking signals from the retrieval side are taken into account when training the model (Wu et al., 2022).Second, for conversational query rewriting, many expected rewrite tokens can be found in the search dialogue in most cases.However, the autoregressive rewriting model still generates the rewrite completely from scratch, " ' : Jordan.
! " : How many rings does he have?" !" : How many NBA championship rings does Michael Jordan have?

Rewrite
Downstream Retrieval All Token-by-token Generation ☹ ☹ Figure 1: Illustration of the two major limitations of existing autoregressive CQR models.First, the learning of the rewriter does not consider the downstream retrieval process, which would affect the final conversational search performance.Second, most of the rewrite tokens can often be found in the current query (q 2 ) and dialogue context (q 1 and r 1 ) while they are still generated from scratch, which is inefficient.
which introduces an over-large search space for token generation and can be unnecessary.
To overcome these limitations, we propose a text Editing-based (Malmi et al., 2022) conversational query Rewriting model tailored for Conversational Search, called EDIRCS.Instead of autoregressively generating the rewrite from scratch, in EDIRCS, most of the rewrite tokens are selected from the search dialogue in a non-autoregressive fashion and only a few new informative tokens are generated to supplement the final rewrite, which makes EDIRCS highly efficient.More importantly, EDIRCS is augmented with two conversational search-oriented learning objectives.Specifically, we add a contrastive ranking loss calculated between the dialogue embedding and passage embeddings to improve the model learning toward downstream retrieval performance.Considering that the manual rewrites are not ideal from the view of retrieval, we leverage a specific SPLADE-based (Lassance and Clinchant, 2022) conversational dense retriever, which is fully trained toward conversational search that shows superior context understanding ability, to identify the key tokens that have significant contributions to the retrieval performance, and transfer the knowledge about these key tokens to enhance our rewriting model for both existing token selection and new token generation.
We conduct extensive experiments on three public conversational search datasets and results show that EDIRCS outperforms state-of-the-art conversational query rewriting models when evaluating with both BM25 and a dense retriever ANCE (Xiong et al., 2021) while having low query rewriting latency, and is more robust to out-of-domain search dialogues and long dialogue context.

Related Work
Conversational Search.Currently, there are mainly two types of methods to solve the difficult context understanding problem to achieve conversational search, including conversational dense retrieval and conversational query rewriting.Specifically, conversational dense retrieval methods (Yu et al., 2021;Lin et al., 2021a;Mao et al., 2022a,b;Kim and Kim, 2022) encode both the whole search dialogue and passages into embeddings to perform dense retrieval in an end-to-end way, which generally achieve stronger performance but are unfriendly for real deployment.In contrast, conversational query rewriting converts the conversational search problem into an ad-hoc search problem by reformulating the search dialogue into a standalone query rewrite, which is the focus of this work.Existing conversational query rewriting methods include only selecting relevant tokens from the dialogue context (Voskarides et al., 2020;Lin et al., 2021b) and using (dialogue, manual rewrite) text pairs to fine-tune PLMs to be the rewriter (Yu et al., 2020;Lin et al., 2020;Vakulenko et al., 2021a).A significant drawback of these methods is that they focus solely on fitting the manual rewrites but are not trained toward search performance.To tackle it, recent studies (Wu et al., 2022;Chen et al., 2022) have investigated leveraging reinforcement learning for model optimization with retrieval-related rewards.Unlike the existing work, we internalize retrieval information into the learning of our rewriting model through two search-oriented objectives to help it achieve better performance toward conversational search.
Text Editing.Text editing (Malmi et al., 2022) is an effective technique to solve text generation tasks (Reid and Zhong, 2021;Mallinson et al., 2022), where the source and target texts have a large amount of overlap, by predicting efficient edit operations applied to the source sequence, thus often showing lower inference latency and better control over the outputs.In particular, it has been successfully applied in utterance rewriting for dialogue systems (Huang et al., 2021;Hao et al., 2021;Jin et al., 2022), which is very similar to conversational query rewriting.Their major distinction is that the former's utterances usually do not have specific search intents while the latter's user utterances are queries and the latter focuses on improving the downstream search performance.Different from previous work for utterance rewriting, our EDIRCS is particularly improved toward conversational search with simple editing operations and search-oriented learning objectives.

Task Definition
In this work, we focus on the task of the first-stage passage retrieval of conversational search.Given a search dialogue s k = (q k , r k−1 , q k−1 , r k−2 , ..., q 1 ) (we call it a session), our target is to retrieve the relevant passage p for this session, where q i and r i denote the query and the system response of the i-th turn, respectively, q k is the current query, and other turns are the dialogue context.For simplicity, we omit the subscript k in the rest of the paper if not specified.

Existing Two Types of Methods
Conversational Query Rewriting (CQR) transforms the session s into a de-contextualized query rewrite q.Then we can feed q into any off-the-shelf ad-hoc retriever to realize conversational search.
In contrast, Conversational Dense Retrieval (CDR) uses dual encoders f S and f P to map the session and passages to latent vectors, and perform dense retrieval to achieve conversational search.The training usually adopts the ranking loss based on contrastive learning with N negative samples: , where p + and p − are the relevant and irrelevant passages for the current turn.It is worth noting that, as the passage information has no change in conversational search compared with that in ad-hoc search, it is common to start with a well-trained adhoc dense retriever and only fine-tune the session encoder while freezing the passage encoder (Yu et al., 2021;Lin et al., 2021a).

Our Model: EDIRCS
We present EDIRCS, a CQR model that aims to achieve more effective and efficient conversational search based on text editing and augmentation from search-oriented objectives.On the whole, as shown in Figure 2, EDIRCS follows a selecting-thengenerating text editing architecture, which first selects a few necessary tokens from the session and then generates more useful new tokens to supplement the rewrite.In particular, both the session token selection and new token generation processes benefit from the proposed search-oriented learning to help the generated rewrite achieve better conversational search performance.We finally summarize the whole training and inference processes.

Conversational Query Editing
In this section, we elaborate on our efficient conversational query editing architecture, including session token selection and new token generation.
Session Token Selection.Since most expected rewrite tokens can be found in the session (Dalton et al., 2020;Voskarides et al., 2020;Anantha et al., 2021), we employ a non-autoregressive selector to directly select those necessary tokens from the session instead of autoregressively generating them from scratch.Specifically, given the input session s, we concatenate all of its tokens and feed them into a 12-layer transformer encoder to obtain the contextualized token embeddings (h 1 , ..., h n ), where n is the number of tokens of the session.Then, we feed the embeddings into a classification layer to predict the probability ŷrt i of retaining each token: where W ∈ R 1×d and b are trainable parameters of the classification layer, d is the embedding size, and r i is the retaining logit.Basically, the selector is trained using the binary cross-entropy loss: where y rt is the binary gold label annotated based on the alignment between the session and the

Generator
① Pick out activated session tokens ! and perform Softmax.② Get the retaining logits for tokens ! and perform Softmax.③ Add Top-K activated tokens (not in the session) for generation.manual rewrite.For the detailed label annotation process, we refer the readers to Appendix A.

Contextualization Knowledge Transfer Contrastive Ranking Augmentation Conversational Query Editing
New Token Generation.Although the session often contains most of the rewrite tokens, there are still some important tokens that are not in the session and can only be generated from scratch.Mao et al. (2021) also showed that combining some informative generated text snippets with the original query can improve retrieval performance.Therefore, we incorporate a generator, which will attend to all the token embeddings to generate additional useful tokens for supplementing the rewrite.Considering that the generation difficulty is alleviated in our task since most of the rewrite tokens have been selected before and only a few new tokens are needed to be generated, we employ a lightweight four-layer transformer decoder as the generator and we empirically find that using four layers can already perform well (see § 5.3).The generator is also trained with the standard cross-entropy loss: where j is the index of token generation and y gen denotes gold labels for all the generated tokens.Label annotation is also introduced in Appendix A.

Search-Oriented Learning
Considering that simply fitting the manual rewrites is not aligned with our real goal (i.e., passage retrieval), we exploit two search-oriented objectives, including contrastive ranking augmentation and contextualization knowledge transfer, to enhance the training of our text editing-based rewriting toward better retrieval performance.
Contrastive Ranking Augmentation.Borrowing from the conversational dense retrieval, we incorporate a similar contrastive ranking loss (i.e., Eq. 1) upon the selector to reduce the distance between the session and its relevant passage and increase the distances between the session and the irrelevant passages.Specifically, we first update the token embeddings of the session with their predicted tags (retain if y rt i > 0.5, otherwise delete): where T E is a trainable tag embedding layer containing two tag embeddings.Then, the session embedding is obtained by performing the mean pooling over all the token embeddings.Similarly, to obtain the passage embedding, we feed the passage into the selector to get its token embeddings, uniformly update them with retain tags, and perform the mean pooling.Finally, we use the session and passage embeddings to calculate the ranking loss (Eq. 1) for model optimization.
Intuitively, using these diverse ranking signals (i.e., session-passage pairs) to implicitly enhance the tokens embeddings can not only help the selector find tokens in the context that are important for search but also encourage the subsequent generator to generate more useful new tokens.
Contextualization Knowledge Transfer.As existing studies (Lin et al., 2021a;Kim and Kim, 2022) show that CDR models, which are trained directly toward retrieval performance, generally show much better performance, it would be desirable to transfer their strong context understanding abilities into our rewriting model.To this end, in addition to implicitly enhancing the token embeddings with ranking signals, we also help our rewriting model retain and generate tokens that are helpful to retrieval in an explicit knowledge transfer manner.Specifically, we leverage a high-performing teacher CDR model to explicitly identify those key tokens that have large contributions to the retrieval performance and train our rewriting model to learn this contextualization knowledge.This teacher CDR model is obtained by fine-tuning a lexical ad-hoc retriever SPLADE (Lassance and Clinchant, 2022) using the common ranking loss (Eq.1).SPLADE can encode a l-length text sequence into a sparse lexical embedding v ∈ R |V | , where |V | is the vocabulary size, by predicting token importance in the whole vocabulary space based on the latent token embeddings (z 1 , ..., z l ) generated by its underlying contextualized encoder: where Q ∈ R d×d and b ∈ R |V | are trainable parameters, d is the embedding size, and E ∈ R |V |×d is the input embedding matrix.Note that the output lexical embedding v is trained to be sparse, i.e., only a few important tokens can be activated with non-zero weights.We feed the session and its gold relevant passage into the SPLADE-based teacher CDR model to get their lexical embeddings v s and v p .The retrieval score is computed as where v[i] is the predicted weight of the i-th token.Therefore, the product term can represent the contribution of the i-th token to the retrieval.We leverage this knowledge of token retrieval contributions to enhance both our selector and generator toward better search effectiveness.
For the selector, we first pick out a subset S from the vocabulary that contains the session tokens which have contributions to retrieve the gold passage (i.e., c > 0).We can obtain the teacher token importance distribution p = softmax([c 1 , ..., c m ]) for these tokens based on their retrieval contributions, where m is the number of tokens in S.Then, we obtain the student token importance distribution q = softmax([r 1 , ..., r m ]) based on the retaining logits (Eq.2) predicted by the selector1 .To enhance our selector to be aware that these m important session tokens should be properly selected from the retrieval perspective, we incorporate the following transfer loss: where the first term is to minimize the KL divergence between the teacher and student importance distributions and the second term is to encourage the token i to be selected.σ is sigmoid function.
For the generator, we pick out the tokens that have the Top-K largest contributions to the retrieval while not in the session and simply label these tokens as needing to be generated.

Training and Inference
EDIRCS is trained in a multi-task learning manner: where λ and β are hyper-parameters to balance two search-oriented losses.When inference, we only retain the session tokens selected by the selector in their original order and append the new tokens generated by the generator to form the final rewrite.
Compared Systems.We compare our model with three groups of conversational query and we sum the logits of all positions for the token in this case.

Method
#Params QR Latency BM25 Dense Retrieval MRR R@10 R@100 MRR R@10 R@100 and CQE-sparse (Lin et al., 2021a).The third group is directly trained towards retrieval performance using reinforcement learning, including CONQRR (Wu et al., 2022).Besides, we report the performances of Human (i.e., using the manual rewrites) and two conversational dense retrievers, Conv-SPLADE (i.e., the teacher model) and Conv-ANCE, which are fine-tuned from SPLADE (Lassance and Clinchant, 2022) and ANCE (Xiong et al., 2021) with the standard ranking loss (Eq.1), respectively, for reference.After getting the rewrites, we feed them into an ad-hoc retriever (either BM25 or ANCE) to evaluate the effectiveness of rewriting models for conversational search.
Implementation Details.Experiments are conducted on four NVIDIA GeForce RTX 3080 GPUs.We use the whole encoder and the first four layers of the decoder of t5-base to initialize EDIRCS.We train EDIRCS for 5K iterations using the Adam optimizer with a 1e-5 learning rate and 64 batch size.The parameters λ, β, and K are tuned on the development set and finally set to 0.5, 0.2, and 4, respectively.For the ranking loss, we adopt the widely-used in-batch negative sampling plus one hard negative sample randomly selected from Top-50 retrieved passages by BM25.The maximum generated sequence length of EDIRCS is set to 10.For the BM25 retriever, we set its k 1 = 0.82 and b = 0.68.For the ANCE retriever, we use its checkpoint pre-trained on the MSMARCO dataset at the 600th step.We perform dense retrieval using Faiss (Johnson et al., 2021) with brute force.Code is to be released at https://github.com/kyriemao/EdiRCS.

Main Results and Analysis
In-Domain Evaluation Results.Evaluation results on the test set of QReCC are shown in Table 1, where we have the following observations: (1) EDIRCS outperforms all the other QR baselines when evaluated with both BM25 and ANCE.Compared with the second-best model (i.e., CON-QRR), the average relative gain of EDIRCS over all the three metrics is +4.5% with BM25 and +1.0% with ANCE, demonstrating the effectiveness of EDIRCS for conversational search.We find that the improvement with ANCE are less than that with BM25.This may be due to the less coherence of the rewrite generated by EDIRCS than that of CONQRR, since the former is the concatenation of the selected session tokens and new tokens while the latter is autoregressively generated from the T5-based language model.The reduction of coherence may affect the semantic understanding of the ANCE dense retriever to the rewrites, thus affecting its retrieval performance.
(2) EDIRCS and CONQRR, which learn from the downstream retrieval information, can even significantly outperform using manual rewrites (i.e., Human) when evaluating with ANCE.This demonstrates that manual rewrites are not the best from the view of retrieval and the downstream retrieval information is very valuable for improving query rewriting toward conversational search.Compared with CONQRR which is optimized toward ranking metrics through reinforcement learning, our EDIRCS can not only benefit from ranking signals but also enjoy guidance from a very highperforming CDR teacher (i.e., Conv-SPLADE) to achieve better search effectiveness.
(3) Compared with the other two text editingbased QR models (i.e., QuReTeC and CQE-sparse) which can only select tokens from the dialogue context, EDIRCS supports generating new informative tokens and is augmented with search-oriented learning, thus leading to substantial improvements.
Out-Of-Domain Evaluation Results.We perform out-of-domain evaluations on the two CAsT datasets based on the models trained on QReCC.Results are reported in Table 2.We find that EDIRCS still outperforms the other compared QR models in the zero-shot evaluation with at least +3.5% and +4.4% average relative gains on NDCG@3 and R@100, respectively.This demonstrates the better robustness of EDIRCS to out-ofdomain search dialogues.But we also notice that the performance of all QR models still lags behind that of using manual rewrites in most cases, indicating that there still have considerable room for improving the zero-shot capabilities of QR models.

Efficiency Comparisons
Table 1 also shows the number of parameters and the query rewriting latency.We find that QuReTeC, CQE-sparse, and our EdiRCS have more than 10x speedup than the purely autoregressive QR models  (i.e., GPT2+WS, T5QR, and CONQRR2 ).Compared with QuReTeC and CQE-sparse which only have non-autoregressive operations, EDIRCS just has one more lightweight autoregressive generator and is trained to generate only a few tokens (≤ 10), so it is only a little bit slower than these two models and is still quite efficient.Moreover, we find that EDIRCS achieves the best retrieval performance with the second-fewest number of parameters, which verifies the superiority of our proposed learning method.Besides, we present the impact of the number of generator layers on the retrieval performance in Figure 3.It shows that the marginal benefit of increasing the layer numbers decreases seriously, so we just use four layers to benefit the efficiency.

Ablation Studies
In this section, we investigate the effects of our proposed search-oriented learning.Specifically, we test four variants of EDIRCS, including: (1) w/o CRA: EDIRCS without contrastive ranking augmentation.(2) w/o CKT-S: EDIRCS without using contextualization knowledge transfer to enhance the selector (i.e., L ckt ).( 3   We find that removing any of the search-oriented objectives results in performance degradation.By contrast, the contextualization knowledge transfer shows to be a little bit more effective than the contrastive ranking augmentation.But the model enhanced with any of the two search-oriented objectives can substantially outperform using nothing (i.e., w/o SOL), suggesting that the proposed search-oriented learning facilitates EDIRCS to achieve better conversational search.

Multi-turn Analysis
To investigate the long context understanding ability of EDIRCS, we show the fine-grained turnlevel model performance in Figure 5.As the dialogue goes on, the context becomes longer and the context understanding problem generally becomes more difficult.We observe that EDIRCS maintains the performance superiority across different turns.Overall, the performance of EDIRCS fluctuates less in deep turns (e.g, from turn No.7 to turn No.11) compared with T5QR and QuReTeC.These observations demonstrate the decent robustness of our EDIRCS to the difficult long context.

Qualititve Analysis
To further gain qualitative insights, we show and analyze some concrete rewriting examples in Table 3, where EDIRCS achieves better MRR with BM25 than T5QR, QuReTeC, and Human.We find that the rewrite generated by T5QR is coherent while some important information may miss (e.g., "this → normal" for the left example).By contrast, EDIRCS accurately select those important tokens from the session to complement the missing semantics of the current query.Moreover, compared with T5QR, QuReTeC, and Human, a notable advantage of EDIRCS is that some new tokens that are not in the session but are helpful to retrieve the gold passages can be incorporated into the rewrite (e.g., Hyperglycemia and Philadelphia) thanks to the proposed search-oriented learning.

Conclusion
In this paper, we present a CQR model EDIRCS based on the text editing paradigm for efficient and effective conversational search.EDIRCS is augmented with two novel search-oriented objectives that can leverage the downstream retrieval information to improve the learning of query rewriting toward conversational search.Experiments on three conversational search datasets demonstrate the superior effectiveness and efficiency of EDIRCS over existing CQR models.We also show that EDIRCS has decent robustness to out-of-domain search dialogues and difficult long context.Future directions include exploring more search-oriented objectives and simultaneously improving the coherence and retrieval performance of the rewrites.

Limitations
This section does not count toward the page limit.
As illustrated in § 5.2 and shown in Table 3, the coherence of the rewrite generated by EDIRCS is not as good as that generated by purely autoregressive rewriters (e.g, T5QR).This may affect the performance of EDIRCS when using dense retrievers.Possible solutions include using an additional token reordering model (Chowdhury et al., 2021) to improve the rewrite coherence or injecting the coherence signals (Hao et al., 2021) or token positions information (Mallinson et al., 2022) into the learning of EDIRCS in an end-to-end way.Another concern is that the effect of our text editing-based model may be limited for a few long-tail cases where many expected rewrite tokens are not in the input session.How to better deal with the search dialogues whose search intents are too implicit or vague to be accurately expressed by inferring from the dialogue context alone is a valuable direction for further improvements of our model.

A Gold Label Annotation
In this section, we introduce the gold label annotation for our conversational query editing.
For the session token selection, we wish to annotate the session tokens which also appear in the manual rewrite as Retain and annotate the other tokens as Delete.Considering that a token may occur multiple times in the session, we iteratively adopt a Greedy Longest Common Substring (GLCS) algorithm for label annotation.Specifically, given the input session s and its manual rewrite t, we first find their longest common substring (lcs) s[a : b] == t[c : d], where a, b, c, d are the start and end positions.If there are multiple lcs, we greedily choose the one with the smallest a, which is closest to the current query since the current query is at the beginning of s.We annotate all the tokens of s[a : b] as Retain.Then, we remove this lcs to update both s and t to be s = s[: a] + s[b :] and t = t[: c] + t[d :], and iteratively perform the above annotating process until the length of lcs is zero.Finally, we annotate all the remaining tokens in s as Delete.
For the new token generation, we sequentially extract the unique tokens which do not in the input session from the manual rewrite to form the gold sequence of generation.
The pseudo-code of the whole annotation process is shown in Algorithm 1.

B Dataset Details
In this section, we provide more detailed description about the three used conversational search datasets.
QReCC: It is a large-scale dataset for conversational question answering, which contains 14K information-seeking conversations with 80K query-answer pairs originated from the training set of CAsT-19 (Dalton et al., 2020), QuAC (Choi et al., 2018), and NQ (Choi et al., 2018) with manually generated follow-up queries.Each query has a response answer and a corresponding human rewrite.The entire text corpus for retrieval includes 54M passages and the query-passage relevance is labeled through a heuristic span matching method based on the answer.
CAsT-19 and CAsT-20: They are two widely used conversational search evaluation datasets released by TREC Conversational Assistance Track Algorithm 1 Gold Label Annotation Require: The input session graph s and its manual rewrite t.  (CAsT).There are only 50 and 25 human-written information-seeking conversations in CAsT-19 and CAsT-20, respectively, so they are hard to support training and are suitable to be used as the evaluation datasets.The query turns in CAsT-19 can only depend on previous query turns.While in CAsT-20, query turns may also depend on the previous system response.Each query turn in both CAsT-19 and CAsT-20 has a corresponding human rewrite and CAsT-20 additionally provides a canonical response passage for each query turn.The text corpus consists of 38M passages from MS MARCO (Nguyen et al., 2016) and TREC Complex Answer Retrieval (Dietz et al., 2017).More fine-grained query-passage relevance labels is generated by the experts of TREC.

…
Not leveraging downstream retrieval information for rewriter learningSearch Dialogue!!: Who is the best player in NBA so far?

Figure 2 :
Figure 2: Overview of EDIRCS.It consists of a selector, which selects existing tokens from the input session, and a lightweight generator, which generates new informative tokens.The final rewrite is obtained by merging these two sets of tokens in the original order.Both the training of selector and generator of EDIRCS are enhanced by our proposed search-oriented learning to retain and generate tokens that are important for conversational search.

Figure 3 :
Figure 3: Impact of the number of generator layers evaluated on QReCC.

Figure 5 :
Figure 5: Turn-level performance comparisons using ANCE as the retriever.not in the original session.(4) w/o SOL: EDIRCS without the proposed search-oriented learning.Results are shown in Figure 4.We find that removing any of the search-oriented objectives results in performance degradation.By contrast, the contextualization knowledge transfer shows to be a little bit more effective than the contrastive ranking augmentation.But the model enhanced with any of the two search-oriented objectives can substantially outperform using nothing (i.e., w/o SOL), suggesting that the proposed search-oriented learning facilitates EDIRCS to achieve better conversational search.
Annotation for the new token generation.14: seq = "" 15: for i in range(0, len(t)) do 16: if t[i] not in s and t[i] not in seq then seq = seq + t[i] for 19: return seq # the gold sequence of generation.

Table 1 :
In-domain results on the QReCC test set.The two conversational dense retrieval models are not applicable to use BM25.The QR latency is the average time cost of rewriting per session, which is measured on one RTX 3080 GPU with batch size 1.‡ denotes the results are replicated from their original paper.† denotes significant improvements of EDIRCS over the other QR baselines expect CONQRR using paired t-test with p < 0.05.

Table 2 :
Out-of-domain evaluation results on the two CAsT datasets.† denotes significant improvements of EDIRCS over the other QR baselines using paired t-test with p < 0.05.
) w/o CKT-G: EDIRCS without learning to generate the key tokens that are What is a normal blood sugar level?R: Normal blood sugar levels are less than 100 mg/dL after ...And they are less than 140 mg/dL two hours ... Q: What does it mean if it's higher than this?Tariq Black Thought Trotter and Ahmir Questlove Thompson started The Roots.Q: Where did these two meet?Hyperglycemia is ... diabetes when the blood glucose level is too high because the body isn't ... hormone insulin.Eating too many ... cause your blood sugar to rise.Tariq Black Thought Trotter and Ahmir ... were both attending the Philadelphia High School for the Creative and Performing Arts...

Table 3 :
Examples of rewrites generated by Human, T5QR, QuReTeC, and EDIRCS.The current queries are shown in italics.Some important tokens for retrieval in the context and in the gold passages are in bold.