ConvGQR: Generative Query Reformulation for Conversational Search

In conversational search, the user’s real search intent for the current conversation turn is dependent on the previous conversation history. It is challenging to determine a good search query from the whole conversation context. To avoid the expensive re-training of the query encoder, most existing methods try to learn a rewriting model to de-contextualize the current query by mimicking the manual query rewriting.However, manually rewritten queries are not always the best search queries.Thus, training a rewriting model on them would lead to sub-optimal queries. Another useful information to enhance the search query is the potential answer to the question. In this paper, we propose ConvGQR, a new framework to reformulate conversational queries based on generative pre-trained language models (PLMs), one for query rewriting and another for generating potential answers.By combining both, ConvGQR can produce better search queries.In addition, to relate query reformulation to the retrieval task, we propose a knowledge infusion mechanism to optimize both query reformulation and retrieval. Extensive experiments on four conversational search datasets demonstrate the effectiveness of ConvGQR.


Introduction
Conversational search (Gao et al., 2022) is a rapidly developing branch of information retrieval, which aims to satisfy complex information needs through multi-turn interactions. The main challenge is to recover users' real search intents based on the interaction context. Existing methods can be roughly categorized into two groups. The first group directly uses the whole context as a query and trains a model to determine the relevance between the long context and passages (Qu et al., 2020;Hashemi et al., 2020;Yu et al., 2021;Lin et al., 2021b;Mao et al., 2022a,b;Kim and Kim, 2022;Li et al., 2022). This approach requires additional training to take the long context as input, which is not always feasible (Wu et al., 2021). What is available in practice is a general retriever (e.g., ad-hoc search retriever) that uses a stand-alone query. The second group of approaches aims at producing a de-contextualized query using query reformulation techniques (Elgohary et al., 2019). Such a query can be submitted to any off-the-shelf retrievers. We focus on this second approach.
Two types of query reformulation techniques have been widely studied in the literature, i.e., query rewriting and query expansion. The former trains a generative model to rewrite the current query to mimic the human-rewritten one (Yu et al., 2020;Vakulenko et al., 2021a), while the latter focuses on expanding the current query by relevant terms selected from the context (Kumar and Callan, 2020;Voskarides et al., 2020). Although both approaches achieve promising results, they are all studied separately. Two important limitations are observed: (1) Query rewriting and query expansion can produce different effects. Query rewriting tends to deal with ambiguous queries and add missing tokens, while query expansion aims to add supplementary information to the query. Both effects are important for query reformulation. It is thus beneficial to use both of them. (2) Previous query rewriting models have been optimized to produce human-rewritten queries, independently from the passage ranking task. Even though humanrewritten queries usually perform better than the original queries, existing studies have shown that they may not be the best search queries alone (Lin et al., 2021b;Wu et al., 2021). Therefore, it is useful to incorporate additional criteria directly related to ranking performance when reformulating a query. As shown in Fig. 1 (left), although the human-rewritten query recovers the crucial missing information (i.e. "goat") from the context, it is still possible to further improve the search query.
To tackle these problems, we propose ConvGQR, * : Here are some notable ... Boer goats were bred in South Africa for meat … Before Boer goats … in the late 1980s, Spanish goats were the standard meat goat breed …

Conversational Search Session
Rewriting Expansion Generation Toward Search Figure 1: An example of conversational search session and the high-level comparison between the original method and our ConvGQR. The dashed box illustrates the potential connection (underline) between the relevant passage and expansion terms. a new Generative Query Reformulation framework for Conversational search. It combines query rewriting with query expansion. The right side of Fig. 1 illustrates the differences between ConvGQR and the existing query rewriting method. In addition to query rewriting based on human-rewritten queries, ConvGQR learns to generate the potential answer of the query (e.g., the answer in the downstream question-answering task) and uses it for query expansion. This strategy is motivated by the fact that a passage containing the generated potential answer is more likely a relevant passage, either the generated answer is the right answer, or it may co-occur with the right answer in the same passage. The final query reformulation model is trained by combining both query rewriting and query expansion criteria in the loss function. Moreover, both the learning of our query rewriting and expansion are uniformly augmented by the relevant passage information based on our designed knowledge infusion mechanism to facilitate query generation toward better search performance. We carry out extensive experiments on four conversational search datasets using both dense and sparse retrievers, and the results show that our method outperforms most existing query reformulation methods. Our further analysis confirms the complementary contributions of query rewriting and query expansion.

ConvGQR
Our contributions are summarized as follows: (1) We propose ConvGQR to integrate query rewriting and query expansion. In particular, query expansion is performed by adding the generated potential answer by a generative PLM. This is a way to exploit the capability of PLMs to capture rich world knowledge.
(2) We further design a knowledge infusion mechanism to optimize query reformula-tion with the guidance of passage retrieval. (3) We demonstrate the effectiveness of ConvGQR with two off-the-shelf retrievers (sparse and dense) on four datasets. Our analysis confirms the complementary effects of both components in conversational search.

Related Work
Conversational Query Reformulation The intuitive idea is that a well-formulated search query from the conversation context can be submitted to an off-the-shelf retriever for search without modifying it. Query rewriting and query expansion are two typical query reformulation methods. Query rewriting aims to train a rewriting model to mimic human-rewritten queries. This approach is shown to be able to solve the ambiguous problem and recover some missing elements (e.g. anaphora) from the context (Yu et al., 2020;Lin et al., 2020;Vakulenko et al., 2021a;Mao et al., 2023a). However, Wu et al. (2021) and Lin et al. (2021b) argue that the human-rewritten queries might not necessarily be the optimal queries. Wu et al. (2021) enhances the rewriting model by leveraging reinforcement learning. However, this turns out that reinforcement learning requires a long time for training. To be more efficient, Lin et al. (2021b) proposes a query expansion method by selecting the terms via the normalization score of their embeddings but still needs to re-train a retriever. Some earlier query expansion methods (Kumar and Callan, 2020;Voskarides et al., 2020)  the search query and produce better retrieval results. However, these approaches have been used separately. Our ConvGQR model thus integrates both query rewriting and query expansion to reformulate a better conversational query. A new knowledge infusion mechanism is used to connect query reformulation with retrieval.

Query Expansion via Potential Answers
Earlier studies on question answering (Ravichandran and Hovy, 2002;Derczynski et al., 2008) demonstrate that an effective way to expand a query is to extract answer patterns or select terms that could be possible answers as expansion terms.
Recently, some generation-augmented retrieval methods (Mao et al., 2021;Chen et al., 2022) focus on exploiting the knowledge captured in PLMs Brown et al., 2020) to generate the potential answer as expansion terms. We draw inspiration from these studies and apply the idea to the conversational scenario.

Task Formulation
We formulate the conversational search task in this paper as retrieving the relevant passage p from a large passage collection C for the current user query q i given the conversational historical context H k = {q k , r k } i−1 k=1 , where the q k and r k denote the query and the system answer of the k th previous turn, respectively. In this paper, we aim to design a query reformulation model to transform the current query q i together with the conversational historical context H k into a de-contextualized rewritten query for conversational search.

Our Approach: ConvGQR
A first desired behavior of query reformulation is to produce a similar rewritten query as a human expert. This will solve some ambiguities arisen in the current query (e.g., omission and coreference). So, query rewriting will be an integral part of our approach. Query rewriting can be cast as a text generation problem: given the query in the current turn and its historical context, we aim to generate a rewritten query. Inspired by the large capability of PLM, we rely on a PLM for query rewriting to mimic the human query rewriting process.
However, as the human-rewritten query might not be optimal (Yu et al., 2020;Anantha et al., 2021) and the standard query rewriting models are agnostic to the retriever (Lin et al., 2021b;Wu et al., 2021), a query rewriting model alone cannot produce the best search query. Therefore, we also incorporate a component to expand the query by adding additional terms that are likely involved in relevant passages. Several query expansion methods can be used. In this paper, we choose to use the following one which has proven effective in question answering (Mao et al., 2021;Chen et al., 2022): we use the current query and its context to generate a potential answer to the question (query). The generated answer is used as expansion terms. This approach leverages the large amount of world knowledge implicitly captured in a large PLM 1 .
The generated potential answer can be useful for passage retrieval in two situations: (1) the generated answer is correct, so a passage containing the same answer could be favored; (2) the generated answer is not a correct answer, but it co-occurs with a correct answer in a passage. This can also help determine the correct passage, and this is indeed the very assumption behind many query expansion approaches used in IR. Motivated by this, we use another PLM to generate the potential answer to expand the current query.
The overview of our proposed ConvGQR is depicted in Fig. 2. It contains three main components: query rewriting, query expansion, and knowledge infusion mechanism. The last component connects query reformulation and retrieval.

Query Reformulation by Combining Rewriting and Expansion
Both query rewriting and expansion use the historical context H k = {q k , r k } i−1 k=1 concatenated with the current query q i as input. Similar to the input used in Wu et al. (2021), a separation token "[SEP]" is added between each turn and the concatenation of the input turns are in a reversed order, which is formulated as in Eq. 1.
Query Rewriting The objective of the query rewriting model is to induce a function M(H k , q i ) = q * based on a generative PLM, where q * is a sequence used as the supervision signal (which is a human-rewritten query in the training data). Then, the information contained in H k but missing in q i can be added to approach q * . Finally, the overall objective can be viewed as optimizing the parameter θ M of the function M by maximum likelihood estimation: (2) Query Expansion Recent research demonstrates that the current PLMs have the ability to directly respond to a question as the close-book question answering (Adlakha et al., 2022) through its captured knowledge. Although the correctness of the generated answer is not guaranteed, the potential answer can still act as useful expansion terms (Mao et al., 2021), which can guide the search toward a passage with the potential answer or a similar answer.
To train the generation process, we leverage the gold answer r * for each query turn as the training objection. r * could be a short entity, a consecutive segment of text, or even non-consecutive text segments, depending on the dataset. In inference for a new query, the potential answers are generated by the query expansion model and expanded to the previously rewritten query.
The final form of the reformulated query is the concatenation of the rewritten query and the generated potential answer. The two generative PLMs for rewriting and expansion are fine-tuned with the negative log-likelihood loss to predict the corresponding target with an input sequence {w t } T t=1 as Eq. 3, however, with different training data. (3)

Knowledge Infusion Mechanism
An important limitation of the existing generative conversational query reformulation methods is that they ignore the dependency between generation and retrieval. They are trained independently. To address this issue, we propose a knowledge infusion mechanism to optimize both query reformulation and search tasks during model training. The intuition is to require the generative model to generate a query representation that is similar to that of a relevant passage. If the hidden states of the generative model contain the information of the relevant passage, the queries generated by these representations would be able to improve the search results because of the increased semantic similarity.
To achieve this goal, an effective way is to inject the knowledge included in the relevant passage representation into the query representation when fine-tuning the generative PLMs. Concretely, we first deploy an off-the-shelf retriever acting as an encoder to produce a representation h p + for the relevant passage. To maintain consistency, the retriever is the same as the one we use for search. Thus, the representation space for passages is kept the same for both query reformulation and retrieval stages. Once the session query representation h S is encoded by the generative model, we distill the knowledge of h p + and infuse it into the h S by minimizing the Mean Squared Error (MSE) as Eq. 4. Both h S and h p + are sequence-level representations based on the first special token "[CLS]". Finally, the overall training objective L ConvGQR con-sists of query generation loss L gen and retrieval loss L ret . A weight factor α is used to balance the influence of query generation and retrieval.

Training and Inference
Two generative models with different targets for query rewriting and expansion are trained separately. The final output of the ConvGQR is the concatenation of the rewritten query and the generated potential answer. The knowledge infusion mechanism is applied only for the training stage, which guides optimization toward both generation and retrieval. The dense retriever is frozen to encode passages for generative PLMs training.

Retrieval Models
We apply ConvGQR to both dense and sparse retrieval models. We use ANCE  fine-tuned on the MS MARCO (Bajaj et al., 2016), which achieves state-of-the-art performance on several retrieval benchmarks, as the dense retriever. The sparse retrieval is the traditional BM25.

Experiments
Datasets Following previous studies (Wu et al., 2021;Kim and Kim, 2022), four conversational search datasets are used for our experiments. The TopiOCQA (Adlakha et al., 2022) and QReCC (Anantha et al., 2021) datasets are used for normal query reformulation training. Two other widely used TREC CAsT datasets (Dalton et al., 2020(Dalton et al., , 2021 are only used for zero-shot evaluation as no training data is provided. The statistics and more details are provided in Appendix A. Evaluation Metrics To evaluate the retrieval results, we use four standard evaluation metrics: MRR, NDCG@3, Recall@10 and Recall@100, as previous studies (Anantha et al., 2021;Adlakha et al., 2022;Mao et al., 2022a). We adopt the pytrec_eval tool (Van Gysel and de Rijke, 2018) for metric computation.
Baselines We mainly compare ConvGQR with the following query reformulation (QR) baselines for both dense and sparse retrieval: (1) Raw: The query of current turn without reformulation.

Implementation Details
We implement the generative PLMs for ConvGQR based on T5-base (Raffel et al., 2020) models. When fine-tuning the generative PLMs, the dense retriever is frozen and acts as a passage encoder. For the zero-shot scenario, we use the generative models trained on QReCC to produce the reformulated queries and retrieve relevant passages. The dense retrieval and sparse retrieval (BM25) are performed using Faiss (Johnson et al., 2019) and Pyserini (Lin et al., 2021a), respectively. More details are provided in Appendix A and our released code 2 .

Main Results
Main evaluation results on QReCC and TopiOCQA are reported in Table 1.  We find that ConvGQR achieves significantly better performance on both datasets in terms of MRR and NDCG@3 and outperforms other methods on most metrics, either with dense retrieval or sparse retrieval. For example, on QReCC with sparse retrieval, it improves 15.1% MRR and 33.9% NDCG@3 over the second best results. This indicates the strong capability of ConvGQR on retrieving relevant passages at top positions. These results demonstrate the strong effectiveness of our method. Besides, we notice that CONQRR, which also leverages the downstream retrieval information but with reinforcement learning, may achieve better performance on some recall metrics, indicating that the downstream retrieval information is helpful to conversational search and should be carefully exploited.
Moreover, we find ConvGQR can even perform better than human-rewritten queries on QReCC. It confirms our earlier assumption that the humanrewritten query (oracle query) is not the silver bul-let for conversational search. This finding is consistent with some recent studies (Lin et al., 2021b;Wu et al., 2021;Mao et al., 2023b). The improvements of ConvGQR over human-rewritten queries are mainly attributed to our query expansion and knowledge infusion, which introduce retrieval signals to the learning of query reformulation.

Ablation Study
Compared to a standard query rewriting method, our proposed ConvGQR has two additional components, i.e., a query expansion component based on generated potential answers and a knowledge infusion mechanism. We investigate the impact of different components by conducting an ablation study on both QReCC and TopiOCQA. The results are shown in Table 2. We observe that removing any component leads to performance degradation and removing all of them drops the most. In fact, when both components are removed, ConvGQR degenerates to the T5QR model. The improvement of ConvGQR over T5QR directly reflects the gains brought by query expansion and knowledge infusion. The above analysis confirms the effectiveness of the added components.

Zero-Shot Analysis
The zero-shot evaluation is conducted on CAsT datasets to test the transferring ability of ConvGQR. By comparing with the other strongest QR methods  CAsT-20  in Table 3, we have the following main findings.
The ConvGQR outperforms all the other methods on the more difficult dataset CAsT-20 and matches the best results on CAsT-19, which demonstrates its strong transferring ability. The human-rewritten queries in CAsT datasets achieve the highest retrieval scores, which indicates the high quality of human-rewritten queries in these datasets. This observation is different from the results of QReCC in Table 1. However, this observation should not lead to the conclusion that human-rewritten queries should be used as the gold standard for the training of query rewriting, because it is difficult to obtain a large number of high-quality human-rewritten queries as in the CAsT datasets. As one can see in Table 7, these datasets only contain a very limited number of queries. Therefore, the generated expansion terms based on the knowledge captured in PLM is still a valuable means to obtain superior performance for new queries.
In addition, combining Table 1 and Table 3, we notice that the effectiveness of ConvGQR for dense retrieval varies with datasets. A potential reason is the different degrees of co-occurrence of generated expansion terms within their relevant passages. This will be further analyzed in Section 4.4.

Impact of Generated Answer for Retrieval
The aforementioned hypothesis of the ConvGQR for query expansion is that the PLM-generated potential answers might contain useful expansion terms that co-occur with the right answer in the relevant passages. To understand how expansion terms are related to retrieval performance, we use three metrics to analyze their correlation with the retrieval score. Correlation Analysis Specifically, for each rewritten query with expansion terms, we first calculate the token overlaps between the generated answers and the relevant passages, which can measure their co-occurrence. However, the potential problem is that the generated answers or relevant passages are of variable lengths. Therefore, we further normalize it by the length of its corresponding relevant passage. Besides, we compute the F1 scores between the generated answers and the gold answers to explore if the generation quality has an impact on retrieval effectiveness. Finally, we calculate the Pearson Correlation Coefficient (PCC) for all these three generative evaluation metrics with the respective MRR scores of every reformulated query.
The results are shown in Fig. 3. The relative PCC value can reflect the helpfulness of generated answers for different datasets to some extent. For example, the PCC of QReCC and CAsT-20 are higher than TopiOCQA and CAsT-19, suggesting that the potential answers are more useful in the first datasets. This is consistent with our previous experimental observations that QReCC and CAsT-20 have larger improvements by ConvGQR compared to TopiOCQA and CAsT-19. Thus, the co-occurrence between generated answer and the relevant passage is crucial for the retrieval effectiveness for ConvGQR.
The PCC of generative score F1 is the highest among the three metrics, which indicates its strong correlation with retrieval effectiveness. However, utilizing generated answers as search queries could produce false positive results as we will demonstrate in the subsequent analysis. As a  result, it may not reflect the genuine correlation strength in comparison to the co-occurrence metric.
Effects of Different Generated Forms We show the performance of using three different forms of generated queries, i.e. the rewritten query, the generated answer, and the concatenation of them, as the reformulated query for retrieval in Table 4. We find that using the concatenation of both significantly outperforms the two other forms alone, indicating that these two forms can complement each other to achieve better retrieval performance, which confirms again our initial hypothesis. Besides, we find that using the rewritten query alone performs better than using the generated answer, especially on TopiOCQA. The potential reason is the different forms of answers in the datasets: QReCC is more related to factoid questions than TopiOCQA. The correct answer with non-consecutive segments is very difficult for a PLM to directly generate. So, the generated answers may be of less utility.

Impact of Knowledge Infusion Loss
We conduct an analysis of the impact of two knowledge infusion loss functions trying to approach the query representation to that of the relevant passage: contrastive learning (CL) loss and mean square error (MSE) loss. They correspond to Eq. 6 and Eq. 4. The difference between them is that the MSE loss only considers positive passages h p + while the CL loss also considers negative passages h p − for model training as follows: We compare the conversational search results of the reformulated queries training by these two loss functions on QReCC and report the results in Table 5. We can find that the reformulated queries trained by CL loss are slightly worse than those  with MSE loss. In most previous literature Karpukhin et al., 2020), the CL loss usually performs better for dense retrieval training, thus similar results were unexpected. The reason for the opposite result might be as follows: since ConvGQR is mainly a generation task rather than a retrieval task, a positive passage can provide a clear signal to instruct the right direction for the target generation, while the additional negative passages used in CL loss only suggest the wrong directions to avoid. Intuitively, the generation objective has only one correct optimization direction but many wrong directions in the high dimensional latent space. This may make it difficult for the knowledge infusion mechanism to determine the correct direction to follow, resulting in sub-optimal queries. Note that despite the above observation, our method ConvGQR trained with CL loss still outperforms most of the existing baselines.

Case Study
We finally show a case in Table 6 to help understand more intuitively the impact of expansion terms on ConvGQR. The model is expected to rewrite the query and generate the potential answer toward the human-rewritten query and the gold answer. Although the model produces the same rewritten query as the human, which solves the anaphora problem of "goat" with the context, the query expansion generated by ConvGQR with the knowledge of "Boer goat" can still improve the performance for both dense and sparse retrieval. In this case,  even though the generated answer is not a correct answer to the question, there is a strongly similar description (underlined) that co-occurs with the right answer in the relevant passage. This example shows a typical case where the generated answer can be highly useful expansion terms. More cases are provided in Appendix B.

Conclusion
In this paper, we present a new conversational query reformulation framework, ConvGQR, which integrates query rewriting and query expansion toward generating more effective search queries through a new knowledge infusion mechanism. Extensive experimental results on four public datasets demonstrate the superior effectiveness of our model for conversational search. We also carried out detailed analyses to understand the effects of each component of ConvGQR on the performance improvements.

Limitations
Our work demonstrates the feasibility of combining query rewriting and query expansion to reformulate a conversational query for passage retrieval. Within our proposed ConvGQR, the rewriting and expansion are based on two PLMs trained with different data, which introduce additional training load and model parameters for storage. Thus, design-ing an integrated model that can simultaneously generate the query rewrite and the expanded terms would be a promising improvement to our method. Another limitation is that the potential answer acting as expansion terms could be generated from more resources (e.g., pseudo-relevant feedback and knowledge graph) rather than only relying on the generative PLMs. Besides, more alternative methods for knowledge infusion can be tested to connect query reformulation with the search task. Shi Yu, Zhenghao Liu, Chenyan Xiong, Tao Feng, and Zhiyuan Liu. 2021. Few-shot conversational dense retrieval. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 829-838.

A More Detailed Experimental Setup
A.1 Datasets The statistics of each dataset are presented in Table 7 and the details are in the following: QReCC focuses on the query rewriting problem within conversational scenarios by approaching the human-rewritten query. Thus, it provides an oracle query for each conversation turn. We argue that it might not be the optimal one.
TopiOCQA focuses on the challenge of the topic switch under conversational settings, whose sessions are longer than QReCC and thus present more difficulties for query reformulation. Different from QReCC, it does not provide human-rewritten queries.
CAsT-19 and CAsT-20 are two standard conversational search benchmarks provided in the TREC Conversational Assistance Track (CAsT). The gold answers to each query are the same as their relevant passages. The newer one (CAsT-20) is known to be more challenging.
ConvGQR The experiments are conducted on one Nvidia A100 40G GPU. For generative PLMs training, we use Adam optimizer with 1e-5 learning rate and set the batch size as 8. The loss balance weight α is set to 0.5, which is the best according to the hyper-parameter selection of our experiments. For training ConvGQR on QReCC, we use its provided human-rewritten query q * and gold answer r * as generation ground-truth for two PLMs. We discard the samples without positive passages for both training and inference