SSP: Self-Supervised Post-training for Conversational Search

Conversational search has been regarded as the next-generation search paradigm. Constrained by data scarcity, most existing methods distill the well-trained ad-hoc retriever to the conversational retriever. However, these methods, which usually initialize parameters by query reformulation to discover contextualized dependency, have trouble in understanding the dialogue structure information and struggle with contextual semantic vanishing. In this paper, we propose \fullmodel (\model) which is a new post-training paradigm with three self-supervised tasks to efficiently initialize the conversational search model to enhance the dialogue structure and contextual semantic understanding. Furthermore, the \model can be plugged into most of the existing conversational models to boost their performance. To verify the effectiveness of our proposed method, we apply the conversational encoder post-trained by \model on the conversational search task using two benchmark datasets: CAsT-19 and CAsT-20. Extensive experiments that our \model can boost the performance of several existing conversational search methods. Our source code is available at \url{https://github.com/morecry/SSP}.


Introduction
The past years have witnessed the fast progress of the ad-hoc search (Dai and Callan, 2020;Dai et al., 2018;Fujiwara et al., 2013;Gao et al., 2019).However, when it confronts more complicated information needs, the traditional ad-hoc search seems to be less competent.Recently, researchers proposes conversational search which is the combination of the search engine and the conversational assistant (Radlinski and Craswell, 2017;Zhang et al., 2018;Kiesel et al., 2021;Trippas et al., 2020;Tu et al., 2022).Different from the keyword-based query in the ad-hoc search, multi-turn natural language utterance is the main interactive form in the conversational search.This yields the challenge of developing the conversational search system that existing ad-hoc retrievers and datasets cannot be directly used to derive the conversational query understanding module.
In the beginning, researchers reformulate a conversational query to a de-contextual query, which is used to perform ad-hoc retrieval (Lin et al., 2020b;Mele et al., 2021;Lin et al., 2021b).Recently the conversational dense retrieval model (Lin et al., 2021a;Mao et al., 2022) is presented to directly encode the whole multi-turn conversational context as a vector representation and conduct matching with the candidate document representations.Since the real-world conversational search corpus is hard to collect, a warm-up step is additionally employed to initialize the conversational representation ability (Yu et al., 2021;Dai et al., 2022).These conversational dense retrieval methods have achieved significantly better performance than the query reformulation methods and have been widely adopted in research of conversational search (Yu et al., 2021;Dai et al., 2022).However, these warmup methods just use the same training objective on a large dataset from other domains to initialize the parameters of the conversational encoder, which can hardly capture the structure information of the conversation which is essential for understanding the user's search intent accurately.
In this paper, we propose Self-Supervised Posttraining (SSP) for the conversational search task as shown in Figure 1.In SSP, we replace the commonly used warm-up step with a new posttraining paradigm which contains three novel selfsupervised tasks to learn how to capture the structure information and keep contextual semantics.To be more specific, the first self-supervised task is topic segmentation, which learns to decompose the dialogue structure into several segments based on the topic.To tackle the coreference problem which is a ubiquitous problem of multi-turn conversation modeling, we propose the coreference identification task which helps the model identify the most possible referred terms in the context and simplifies the intricate dialogue structure.Since understanding and remembering the semantic information in the conversational context is vital for conversational context modeling, we propose the word reconstruction task which prevents contextual semantic vanishing.To demonstrate the effectiveness of SSP, we first equip several existing conversational search methods with SSP and conduct experiments on two benchmark datasets: CAsT-19 (Dalton et al., 2020) and CAsT-20 (Dalton et al., 2021).Experimental results demonstrate that the SSP outperforms all the strong baselines on 2 datasets.To sum up, our contributions can be summarized as follows: • We propose a general and extensible posttraining framework to better initialize the conversational context encoder in the existing conversational search models.
• We propose three specific self-supervised tasks which help the model to capture the conversational structure information and prevent the contextual semantics from vanishing.
• Experiments show that our SSP can boost the performance of strong conversational search methods on two benchmark datasets and achieves stateof-the-art performance.

Related Work
Conversational search has become a hop research topic in recent years.TREC Conversational Assistant Track (CAsT) competition (Dietz et al., 2017), which holds the benchmark largely promotes the progress of conversational search.In the beginning, researchers simply view conversational search as the query reformulation problem.They suppose that if a context-dependent query could be rewritten to a de-contextualized query based on historical queries, then it directly uses the well-trained ad-hoc retriever to obtain retrieval results.Transformer++ (Vakulenko et al., 2021) fine-tunes the GPT-2 on query reformulation dataset CANARD (Elgohary et al., 2019) to rewrite query.QueryRewriter (Yu et al., 2020) exploits large amounts of ad-hoc search sessions to build a weak-supervision query reformulation data generator, then these automatically generated data is used to fine-tune the language model.However, these methods underestimate the value of context, which contains various latent search intentions and topic information.
After that, the conversational dense retriever is proposed.It straightly encodes full conversation whose last query denotes the user's real search intention to dense representation.ConvDR (Yu et al., 2021) forces the contextual representation to mimic the reformulation query representation based on the teacher-student framework, which slightly deals with the conversational search data scarcity problem.Further, COTED (Mao et al., 2022) points out that not all queries in context are useful and devises a curriculum denoising method to inhibit the influence of unnecessary contextual queries.These dense methods additionally perform the warm-up on the other domain dataset to initialize the parameters based on their own objective.However, their warm-up ignore the conversation structure information, which is crucial for capturing the relationship between utterances and understanding the search intention of the user.In this respect, we devise a novel Self-Supervised Post-training (SSP) to replace the warm-up as Figure 2.

Warm-up
Hardly capturing the dialogue structure.
Warm-up Simply initializing parameter.
SSP Modeling dialogue structure.

Problem formulation
We assume that there is a multi-turn search conversation Q = {q 1 , q 2 , . . ., q n }, where q i = {x i,1 , x i,2 , . . ., x i,l i } represents the i-th question in the conversation and x i,j is the j-th token in q i .The last query q n is the user's real search intention.We insert special tokens [CLS] and 4 Self-Supervised Post-training

Overview
In this section, we propose our Self-Supervised Post-training, abbreviated as SSP.An overview of SSP is shown in Figure 3, which consists of three self-supervised tasks: • Topic Segmentation Task aims to find the topicshifting point in the utterances.It helps the model to capture the topic structure in the conversational context.
• Coreference Identification Task aims to identify the correlation structure between two referred utterances, which helps the conversational encoder to understand the coreference relationship and produce better query representation.
• Word Reconstruction Task aims to reconstruct the bag-of-word (BOW) vector of the conversational context using the conversational vector representation.It helps the model avoid the contextual semantic vanishing during conversation encoding.
After jointly training the conversational encoder using these three self-supervised tasks, we finetune the encoder to the conversational search downstream task using the existing conversational search methods.

Topic Segmentation Task
When the user interacts with the conversational search system, the focused topic may vary from time to time.Taking the example in Figure 1, the search intention of the user changes according to the retrieval results of previous turns.This causes the topic of the conversation to shift.Since the conversation topic may shift in every utterance, to fully understand a user query, the conversational system should know what is the current topic of this query and view the utterances of the current topic as a more salient context.If the conversational encoder cannot identify the topic boundary of the current topic, it may focus on unrelated utterances and incorporate noise information into the query representation.
Thus we propose the topic segmentation task to identify the topic boundary of the conversation, which can facilitate the model to focus on more related context when encoding the query.We first randomly sample a noise conversational session with several utterances from the training corpus and then concatenate this sampled noise session at the beginning of the raw conversational context.Given the raw search conversation Q = {q 1 , q 2 , . . ., q n } and the noisy conversation where k is sampled based on reciprocal probability distribution p, which avoids the distortion of the raw context from the abundant long noisy sessions, After concatenating the sampled noise session before the raw context and separating each query by [SEP], we obtain the per- . . ., q n } and the ground truth topic label y t = {1, . . ., 1, 0, . . ., 0}, where the queries from the external conversation are labelled as 1 and the ones from the raw conversation are labelled as 0.
Next, we use the perturbed conversation Q as input to the conversational encoder, and obtain the vector representation is sent to the topic predictor (a linear layer) to decide whether an utterance is from the sampled noise conversation Q ′ or not.The binary cross entropy is used to compute topic segmentation loss L T S : Conv.Encoder (share)

Sample other topic session
What is the first sign of throat cancer?
Context BOW

Word Reconstruction Coreference Identification
What is throat cancer?
Is it treatable?where is the hidden size of model.

Coreference Identification Task
In conversational search, a common problem is the coreference, which is that the pronoun in a query usually refers to a term in its previous queries.Most of the existing methods did not explicitly train the model to tackle this problem.Here, we devise an auxiliary self-supervised task that trains the model to predict the referred utterance of the last utterance by the coreference relationship.To obtain which utterance in the conversational context has the coreference relationship with the last utterance, we use the query reformulation corpus to find.We compare the last query in Q with the reformulated query q * n by set operations to find the reformulation terms r have been omitted in Q: where S is a set operation that converts a sentence into a non-repeating word set.We can obtain the reformulation terms r by calculating the difference set between two sets.Then r will be used to locate the referred query from back to front until the first query containing the r is found.We mark the position of the referred query to the label y c = {0, 0, . . ., 1, . . ., 0}, whose i-th value is 1 only if the i-th query is the referred query.Similar to the topic segmentation task (introduced in § 4.2), we send E [SEP] into a coreference predictor to predict the referred query and use the binary cross-entropy as the loss function of this task: where W r ∈ R h×1 , b r ∈ R are all trainable parameters.With the coreference identification task, the conversational encoder will pay more attention to the most possible referred query in context when it understands the last query.

Word Reconstruction Task
The duality of a one-stage conversational retriever will encode a query to a dense vector.In the previous sections, we use the self-supervised tasks to focus on the utterance of the current topic and the highly related utterance with coreference.However, other utterances may also provide useful information to understand the current search intent.Thus, the conversational encoder should not only gather information from the related utterances but also keep the information from the whole conversational context.
To avoid the information vanishing in the final conversational vector representation, we propose to use a simple but efficient reconstruction task to help the conversational encoder to keep the overall semantic information.In this task, we train the model to reconstruct the bag-of-words (BOW) vector of the whole conversation using the representation of [CLS] produced by the conversational encoder.Specifically, all of the words appearing in the context are converted to a BoW vector y w , where the length of y w is the vocab size and y w i = 1 only if the i-th word in vocab appears in the context otherwise y w i = 0. We use a linear layer after the last layer of the model to process E [CLS] and optimize the WR loss based on mean squared error,

Optimization
Inspired from the previous studies (Yu et al., 2021;Mao et al., 2022), we also employ the knowledge distillation objective in SSP to accelerate the learning process.Specifically, a pre-trained ad-hoc search encoder TEnc which uses the decontextualized query as the input and produce the vector representation.We use TEnc as the teacher model and employ a knowledge distillation loss function to train our conversational encoder to mimic the vector representation produced by the teacher encoder TEnc.We formulate the knowledge distillation loss L KD as follows: .
where the q * n is the manual rewritten query of q n , (•) [CLS] means only taking the [CLS] representation of TEnc's last layer output.We make the representation of conversation E [CLS] to approximate the representation of reformulation query E *

[CLS]
processed by TEnc to distill its powerful retrieval ability.
Finally, we combine all the training objective of each self-supervised task and optimize all the parameters in the conversational encoder: where the L final is the final training objective for SSP, α, β, and γ denotes the hyper-parameter as a trade-off between the self-supervised tasks.For fine-tuning the conversational encoder on the conversational search task, we choose two fewshot datasets to evaluate our proposed model based on K-fold cross-validation.

Datasets
CAsT-19 (Dalton et al., 2020) is the acronym of the TREC Conversational Assistance Track (CAsT) 2019 benchmark dataset.It is built by human annotators who are required to mimic real dialogues under specified topics and contains frequent coreferences, abbreviations, and omissions.In this work, we pay attention to query de-contextualization and but only the test set provides manual oracle decontextualized queries.Since the queries in TREC CAsT dataset are used in the conversational search fine-tuning phrase, it will cause the data leaking problem.For a fair comparison, we filter the queries from TREC CAsT from QReCC.The statistics of the filtered QReCC dataset are shown in Table 5.
CAsT-20 (Dalton et al., 2021) refers to next year's TREC CAsT.Its most obvious modification is that the coreference could appear in the response (a summarized answer of gold passage)compared with CAsT-19, where a query only refers to its previous queries.Both manual response and automatic response (generated by neural rewriter (Yu et al., 2020)) are provided in CAsT-20.It contains 216 queries in 25 dialogues which have decontextualized queries and most of queries have relevance judgments.Additionally, CAsT-20's corpus is the same as CAsT-19's.Detailed statistics are shown in Table 1.

Baselines
Following (Mao et al., 2022), we split baselines into two categories: sparse retrieval methods and dense retrieval methods respectively.Sparse retrieval methods rewrite the contextualized query to a context-independent query and use the ad-hoc sparse retriever to obtain the results.The dense retrieval methods use the ad-hoc dense retriever or directly encode the conversational queries via a conversational dense retriever.
• Raw denotes simply using the last contextindependent query in the dense or sparse retriever to retrieve the documents.
• QueryRewriter (Yu et al., 2020) is a data augmentation method that first generates query reformulation data using large amounts of ad-hoc search sessions based on rules and self-supervised learning.Then the automatically generated data is used to train the query rewriter.
• QuReTeC (Voskarides et al., 2020) deals with the query reformulation task as a binary term classification problem.It will decide whether to add terms appearing in the dialogue history to the current turn query or not.
• ContQE (Lin et al., 2021a) employs a welltrained ad-hoc search encoder TCT-ColBERT (Lin et al., 2020a).It uses the mean-pooling method to get the contextual embedding and fine-tunes on pseudo-relevance labels.
• ConvDR (Yu et al., 2021) develops the few-shot learning method to train the conversational dense retriever.It takes ANCE (Xiong et al., 2020) as the teacher model to teach the conversational student model.Integrating the distilling loss and ranking loss, it obtains a pretty performance on the few-shot dataset.
• COTED (Mao et al., 2022) further introduces the curriculum denoising to inhibit the unhelpful turns in context.An additional two-step multi-task learning improves the performance of ConvDR.
• T5(WikiD+WebD) (Dai et al., 2022) trains on two large automatically generated conversational search dataset WikiDialog(11.4M dialogues) and WebDialog(8.4Mdialogues) from a T5-large encoder checkpoint.Otherwise, it further warm-ups on the QReCC dataset.Though it does not finetune on  and , the extremely time-consuming training procedure makes its performance up to a stable level.

Evaluation Metrics
Following the previous works on conversational search, we evaluate all models based on Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain @3 (NDCG@3).MRR deems the ranking reciprocal of a positive sample as its score and counts the average of all samples.It is a simple yet effective metric for ranking tasks.NDCG@3 considers the importance of positive samples based on their relevance and chooses scores of the top 3 samples to normalize.The statistical significance of two runs is tested using a two-tailed paired t-test and is denoted using † and ‡ for significance (p ≤ 0.05) and strong significance (p ≤ 0.01).

Implementation Details
Most settings in this work are similar to Con-vDR (Yu et al., 2021).We employ the ad-hoc retriever ANCE (Xiong et al., 2020) as the teacher module to calculate the knowledge distillation loss.Following previous conversational search work, for CAsT-19, we concatenate the historical query and the current query as the model inputs, and we additionally take account of the historical responses for CAsT-20.The leading words in the conversational context will be truncated if the concatenation length exceeds a maximum length, which is 256 and 512 for CAsT-19 and CAsT-20 respectively.We implement experiments using PyTorch and Transformers library on an NVIDIA A40 GPU.Adam optimizer is employed with the learning rate of 2e − 5 and batch size of 64 for CAsT-19 and 32 for CAsT-20.Our model will post-train 2 epochs and then fine-tune on the conversational search corpus.The self-supervised task weights α, β and γ are set as 1e − 2, 1e − 3, 1e − 2 for CAsT-19 and 1e − 1, 2e − 3, 2e − 2 for CAsT-20.We use faiss (Johnson et al., 2019) to index the passages, whose representations are generated by ANCE and fixed.Following the TREC Conversational Assistance competition official evaluation setting, we use relevance scale ≤ 2 as positive for CAsT-19 and relevance scale ≤ 1 for CAsT-20 and obtain our result based on official evaluation scripts.We compare our model with all baselines in Table 2.We can find that the sparse methods generally achieve less satisfying performance than the dense conversational methods, which demonstrates the dense methods can understand the search intent of users better.Our model performs consistently better on two datasets than other sparse and dense conversational search models with improvements of 1.4% and 0.4% on the CAsT-19 dataset and achieves 7.1% and 6.7% improvements on the CAsT-20 dataset compared with COTED in terms of MRR, and NDCG@3 respectively.This demonstrates that our proposed self-supervised tasks provide a useful training signal for the conversational encoder module than the simple parameter warmup method used in previous methods.
In Table 2, we find that ContQE outperforms ConvDR-SSP on CAsT-19 in terms of NDCG@3.The possible reason is that Mao et al. (2022) has illustrated that ContQE introduces a stronger query encoder TCT-ColBERT (Lin et al., 2020a) and it takes multi-stage methods to train their conversational encoder.In contrast to the complexity of the multi-stage method, our SSP can boost the performance of the existing conversational search model in an end-to-end manner which is easier to train and deploy in real-world applications.We will leave adapting this stronger encoder TCT-ColBERT into the post-training paradigm in our future work.
To verify the generalization ability of SSP, we equip our proposed Self-Supervised Posttraining to two strong conversational search meth-

Ablation Study
We remove each self-supervised task to analyze the effectiveness of each component, and TS is the acronym for topic segmentation, CI denotes the coreference identification and WR denotes word reconstruction.The performance of ablation models is shown in Table 3, and we can find that all of the ablation models perform less promising than the best model ConvDR-SSP, which demonstrates the preeminence of each self-supervised task in SSP.
We ablate the topic segmentation task in ConvDR-SSP w/o.TS and observe the decline in search performance.The topic segmentation task helps the model identify the topic boundary in the long session and pay more attention to the utterances in the related topics This makes the retrieval performance raises 3.6% and 2.5% in terms of MRR on the CAsT-19 and CAsT-20 datasets respectively.In the method ConvDR-SSP w/o.CI, we remove the coreference identification self-supervised task and the performance of this ablation model dropped dramatically, which demonstrates that it plays the most important role in SSP.The experiment shows that our ConvDR-SSP achieves 4.1% and 1.7% increments compared with ConvDR-SSP w/o.CI in terms of MRR score on the CAsT-19 and CAsT-20 datasets.We also remove the word reconstruction task yielding ConvDR-SSP w/o.WR, and the dropped score shows that it is effective to keep the contextual semantic in the context representation.All of our self-supervised tasks, which provide extra supervision signals to understand dialog structure and prevent the semantic vanishing, help ConvDR-SSP achieves the best performance  according to the experimental results.

Robustness of Topic Segmentation
To verify the effectiveness of the topic segmentation of our method, we conduct an experiment that concatenates different lengths of randomly sampled utterances to the beginning of the current conversation session.In this experiment, we use the ConvDR as our baseline.Figure 4 shows the search performance of our SSP and ConvDR with different length of random sampled noise utterances as input.
From Figure 4, we find that our SSP is more robust to concatenate more random sampled utterances.
When we concatenate more random sampled utterances, the performance of ConvDR dropped dramatically while ConvDR-SSP slightly dropped in the beginning and kept stable.The reason for this phenomenon lies in that our model can identify the topic segmentation boundary and reduce the impact of unrelated utterances when encoding the current conversational query.This demonstrates that the topic segmentation helps the model focus on the utterances of relevant topics.

Case Study
We show three cases in Table 4 to intuitively understand how our self-supervised tasks of SSP improve the performance of the existing conversational search methods.
In the first case, ConvDR, which equally treats every historical query, struggles with the long dialogue history and retrieves the irrelevant passage.After incorporating SSP, the topic segmentation makes ConvDR-SSP split out several most related utterances in conversational history.With the help of modeling the topic boundary, it easily discovers that "throat cancer" is the referred term for the current query.
In the second case, due to the complex historical queries, ConvDR is confused about whether the "ones" in the last query means "database" or "realtime database" and results in a unrelated retrieved passage.Our proposed coreference identification task makes ConvDR-SSP bypass these obstructions and straightly point out the referred query, and ConvDR-SSP successfully finds the accuracy result.
The contextual semantic vanishing will harm the performance since the incomplete contextual semantics cannot accurately represent the search intent.In the last case, it makes ConvDR misunderstand the meaning of "avoid" in the current query to "recover".Then its retrieved passage mainly illustrates "how to recover from sports injuries".The word reconstruction demonstrates its effectiveness and keeps the semantic information of "avoid", which is indispensable during representation learning.The complete contextual semantic leads ConvDR-SSP to more accurate retrieval.In this section, we analyze how much the hyperparameters α, β, and γ influence the retrieval performance and explore the best setting of hyperparameters.We design five-group experiments for each parameter and each dataset and the performance comparison as Figure 5.We find that the performance of ConvDR-SSP slightly drops when the parameter changes, and this demonstrates the hyper-parameter robustness of SSP.Finally, we determine the best setting of α, β, and γ to be 1e − 2, 1e − 3, 1e − 2 for CAsT-19 and 1e − 1, 2e − 3, 2e − 2 for CAsT-20.

Conclusion
In this work, we propose a novel Self-Supervised Post-training framework SSP for conversational search, which could easily be applied to existing methods and boost their performance.Different from the conventional warm-up method, our proposed SSP introduces three self-supervised tasks to

A Post-training Dataset
Followed by the existing conversational dense retrieval methods, we also use the query reformulation dataset for our proposed SSP model.QReCC (Anantha et al., 2021) is a query rewriting dataset which contains 14K conversations.The queries in QReCC are collected from three sources: TREC CAsT (Dalton et al., 2020), QuAC (Choi et al., 2018) and NQ (Kwiatkowski et al., 2019).The queries in NQ were used as prompts to create conversational queries.
We notice that the queries in TREC CAsT dataset are used in the conversational search finetune phrase, it will cause the data leaking problem.For fair comparison, we filter the queries from TREC CAsT from QReCC.The statistics of the filtered QReCC dataset are shown in Table 5.

Figure 1 :
Figure 1: Example of modeling the conversational structure in conversational search.The model should capture the structure including the topic has been shifted at the 3rd utterance and the last utterance has coreference with the previous utterance.This information can help the model understand the search intent of users accurately.

Figure 2 :
Figure 2: The comparison between the training procedure of conversational search with warm-up and the SSP paradigm.

Figure 3 :
Figure 3: Overview of SSP.It consists of three self-supervised tasks to conduct post-training of conversational encoder: (1) Topic Segmentation predicts which utterances are the randomly sampled perturbation utterances from other conversation sessions; (2) Coreference Identification predicts which utterance in the conversational context is related to the last utterance; (3) Word Reconstruction uses the conversational context vector representation to reconstruct the Bag-of-Word vector of conversational context.

Figure 4 :
Figure 4: Robustness evaluation by adding the different numbers of off-topic utterances.We randomly sample irrelevant utterances from other search sessions and evaluate the results of ConvDR and ConvDR-SSP.

Table 1 :
The statistics of test dataset for fine-tuning.

Table 3 :
Comparison between ablation models.