Logic Unveils Truth, While Disguise Obscures It: Transition Logic Augmented Response Selection for Multi-Turn Dialogue

,


Introduction
Recently, retrieval-based dialogue draws rising interest from the NLP community (Liu et al., 2022;Lee et al., 2022;Tao et al., 2023;Feng et al., 2023), since it is a promising way towards intelligent human-machine dialogue.Moreover, the technologies in building it show great potential in various applications such as tasked-oriented dialogue assistant (Shu et al., 2022), conversational recommendation (Li et al., 2018) or the recently released interactive large language model (OpenAI, 2022).In this study, we focus on the core of a retrieval dialogue system, i.e., the multi-turn response selection task, which aims to retrieve the best response (golden response) from a pre-defined candidate pool given a dialogue context.
For improving the discriminating power of a retrieval dialogue system, the key is the construction of the candidate pool, or the selection of the negative examples.Since the trivial solution, i.e., randomly sampling utterances from the entire training set, results in too simple and informative negatives (Li et al., 2019), a large body of previous works (Lin et al., 2020;Su et al., 2020;Penha and Hauff, 2020) discuss how to excavate hard negatives that are lexically or semantically similar to the golden response and therefore hard to differentiate.However, due to the inherent one-to-many property of open-domain dialogue, sometimes hard negatives are in fact positive in disguise, or false negatives (Gupta et al., 2021;Lee et al., 2022) and are detrimental to the convergence of retrieval model (Xiong et al., 2020;Zhou et al., 2022).
To verify this point, we perform a pilot study (Section 2) to find that previous negative sampling methods do yield a portion of false negatives and the ratio is higher than random sampling.It is easy to understand since previous methods tend to overlook and thus have little control over the false negative issue, except that Gupta et al. (2021) try to filter out the false negatives in a heuristic manner.Nevertheless, heuristically filtering out negatives that are similar to the golden response is far from enough, and mitigating the false negative issue is a non-trivial problem: A major challenge is that a randomly sampled utterance could also be more or less appropriate, although it may differ from the golden response in many aspects.Owing to the one-to-many nature (Zhao and Kawahara, 2021;Towle and Zhou, 2022), there usually exists more than one possible dialogue flow, with each flow reflecting different transitions in dialogue topic, user emotion and many other characteristics.To recognize and mitigate false negatives, it is required that we capture the diverse potential transition logic of multiple characteristics in open-domain dialogue, which is difficult to acquire (Xu et al., 2021).
Another challenge lies in the balance between excavating hard negatives and removing false negatives (Cai et al., 2022;Yang et al., 2022).Specifically, we may largely avoid false negatives by always selecting naive negative examples that are obviously inappropriate, which is however uninformative and useless.In other words, the dividing line between the false negatives and the hard negatives is bounded to change dynamically in the training process according to the retrieval model capacity.Though there exist some works employing curriculum learning (Su et al., 2020;Penha and Hauff, 2020), their adjustment of negative sampling is performed in an empirical way independent of the model capacity.
In this research, to cope with the first challenge, we propose to decompose the characteristics in multi-turn conversation into multiple dimensions and represent each with a latent label respectively.To achieve this, we design a sequential variational ladder auto-encoder (SVLAE) to model the transition logic of multiple characteristics and disentangle each other.For the second challenge, we update the negative sampling dynamically in the training process in pace with the optimization of the retrieval model.Specifically, we propose a TRIGGER (TRansItion loGic auGmentEd Retrieval) framework, which consists of T-step and R-step and optimizes the negative sampling process and retrieval model in two steps iteratively.
To summarize, our contributions are three-fold: (1) We devise a sequential variational ladder auto-encoder to model the multiple orthogonal characteristics in a compositional and disentangled way.
(2) We propose a TRIGGER framework that combines the updating of negative sampling together with the optimization of a retrieval model such that the criterion for negative sampling dynamically changes to pace with the capacity of the retrieval model.
(3) Extensive experiments on two benchmarks verify that when combined, our method significantly improves the existing retrieval models by a large margin and achieves a new state-of-the-art.

Pilot Study On False Negatives
In this pilot study, we conduct a human evaluation to investigate false negatives hidden in the candidate pool, where the negative samples in the candidate pool are constructed by random sampling or
The experiment results are shown in Table 1.We can see that compared with random sampling, HCL (Su et al., 2020), CIR (Penha and Hauff, 2020), Grey (Lin et al., 2020) and Semi (Li et al., 2019) have substantially more false negatives.Although mask-and-fill (Gupta et al., 2021) partially solves the problem thanks to its semantic limitation mechanism, it still has a higher mislabel ratio than random sampling.

Related Work
The task of multi-turn response selection aims at selecting a response to match the human input from a large candidate pool (Lowe et al., 2015;Yan et al., 2016;Zhou et al., 2016;Wu et al., 2017;Zhou et al., 2018;Tao et al., 2019a;Jia et al., 2020).Close related to conversational recommendation (Li et al., 2018) and interactive large language model (Ope-nAI, 2022), it has extensive application in the commercial area.With the recent huge success of PLMs (Devlin et al., 2018;Liu et al., 2019), posttraining PLMs with diverse self-supervised tasks become a popular trend and achieve impressive performance (Xu et al., 2020;Gu et al., 2020;Whang et al., 2021;Han et al., 2021;Fu et al., 2023).
Apart from designing new architecture or new self-supervision task, another branch of work put emphasis on negative sampling.Namely, since the quality of negative candidates has a great influence on the retrieval model, a large body of work curate hard negative candidate by searching within the corpus (Su et al., 2020), synthesizing from language model (Gupta et al., 2021) or a combination of both (Lin et al., 2020).However, most previous methods pay little attention to preventing false negatives or updating the negative sampling to adapt with the optimization of the retrieval model, albeit the semantic limitation in Gupta et al. (2021) and the curriculum learning in Penha and Hauff (2020) and Su et al. (2020).The comparison of our methods against previous ones is shown in Table 2.

Methodology
Problem Formulation.Given a dialogue context c = (u 1 , u 2 , ⋯, u N ) with u i denoting the i-th utterance and N is the number of turns, the objective of a retrieval model D(⋅|c, R) is to find the golden response r + from a candidate pool − n } are n negative samples.To enable a retrieval model to differentiable multiple candidates and pick out the r + , the core is the meticulous construction of the candidate pool R.
Overview.The proposed approach has two stages: (1) transition logic estimation and (2) dynamic negative updating.In the first stage, we train a transition model to capture the transition logic of multi-level characteristics in a conversation.The transition model is used for detecting characteristics in multiple facets to decide whether a candidate utterance is a potential false negative.However, with the growing model capacity, the criteria for potential false negatives should change accordingly.So in the second stage, we introduce a policy network to determine the negative sampling criteria regarding multi-facet characteristics according to the feedback from the retrieval model.In this way, the negative sampling process paces with the evolution of the retrieval model.At the test stage, the transition model and the policy network are discarded so it causes no extra latency.

Transition Logic Estimation
To represent and learn the transition logic, we describe each utterance in a dialogue as generated by both a discrete latent label y and a continuous latent feature z following Kingma and Welling (2013).
Because the transition logic is influenced by multiple (orthogonal) factors including but not limited to the dialogue topic, dialogue acts, or speaker's emotion, it would be unwieldy to enumerate each combination of these factors with a single y, which would lead to the hypothesis space of y scaling exponentially with the number of factors considered.Generation Specifically, we use L latent labels y 1∶L and latent features z 1∶L to describe the characteristics of an utterance in L different facets.In our probabilistic framework, to generate a multiturn dialogue with N utterances, we first sample N latent labels for every facet l ∈ {1, 2, . . .L}, and then sample corresponding z l from a mixture of Gaussian 2 : With all L latent features z 1∶L , the corresponding i-th utterance u i is generated by: More details about the neural parameterization could be found in appendix ??.
Inference For the recognition of latent labels y 1∶L and latent continuous variables z 1∶L for utterance 3 u, we propose to factorize the variational posterior q(z 1∶L , y (3) In this way, the inference for each facet is conducted independently, encouraging the capturing of multi-facet disentangled characteristics.
In implementation, q ϕ (y ized by a 1-layer GRU and a multi-layer perception.Regarding the inference of z 1∶L , we draw inspiration from Tenney et al. (2019); Niu et al. (2022).These studies find that the representation after lower BERT layers usually encodes basic syntactic information while higher layers usually encode high-level semantic information.In light of this, we prepend a special [CLS] token to each utterance u and employ the encoding of the [CLS] 2 We use superscript to denote the facet and the subscript to denote the utterance turn in the rest of the paper.
3 we omit the subscript in this paragraph to avoid the clutter of notation token after each layers to discover multiple diverse features about u: where Optimization For optimizing the SVLAE, we exploit the evidence lower bound objective (ELBO) to jointly optimize the generation parameter θ and inference parameter ϕ.
where KL denotes the Kullback-Leibler divergence.After the training process, the transition model is fixed and used for inferring the latent label of all utterances in the training corpus.

Dynamical Negative Updating
We assume that utterances sharing more latent labels with r + are more likely to be false negatives and thus should be excluded from R at the beginning.But with the training process proceeding, the retrieval model gradually acquires the ability to discern the subtle differences in multiple facets and the exclusion criterion should also update.To achieve this, we develop a TRIGGER framework that updates the negative sampling in pace with the retrieval model in two iterative steps: T-step At T-step, we introduce a policy network for predicting the characteristics of the utterance that is most suitable to be negative samples given the current model capacity.Then the policy network receives a reward from the retrieval model to update its parameter.Specifically, the policy network P takes the predicted latent label of the golden response p θ (y 1∶L N +1 ) as input and predicts the latent label distribution of the suitably difficult negatives, denoted as π(y

1∶L
).Then we sample the latent labels Intuitively, the reward for the policy network D(r ) is the probability that the retrieval model regards the sampled r − as a better candidate than r + .In implementation, the policy network is a lightweight bi-directional transformer.
More parameterization details could be found in Appendix ??.
R-step At R-step, the retrieval model D is trained to discriminate the golden response r + apart from the negatives excavated by the policy network.In detail, after the T-step is finished, the parameter of the policy network is fixed and we reconstruct the set S(ỹ 1 , ỹ2 , . . ., ỹL ) with the updated policy network.Note that it is possible that the safe set is an empty set.If this is the case, we re-sample y 1∶L from the sampling policy π(y

1∶L
).The objective of the retrieval model is thus: Compared with the original training objective, the new one in Eq. 7 restricts the scope of negatives to the set S(ỹ 1 , ỹ2 , . . ., ỹL ), thus mitigating the false negatives and rendering the selection of negatives pace with the optimization of the retrieval model.A high-level algorithm for our proposed framework is shown in Algorithm 1.

Datasets
We conduct experiments on two benchmarks: Ubuntu Corpus V1, and Douban Corpus.The statistics of these three datasets are shown in Table 3. Recognize the latent labels and latent features with the inference model ϕ.

5:
Generate the conversation with generation model θ.

6:
Optimize the SVLAE with the objective in Eq. 5. 7: end for 8: for m ← 1 to M 2 do 9: {An new episode begins.}10: Sample a mini-batch of (c, r) training corpus.11: Compute the latent label distribution of the suitably difficult negatives π(y
Ubuntu Corpus V1 (Lowe et al., 2015) is a multi-turn response selection dataset in English collected from chatting logs, mainly about seeking technical support for problems in using the Ubuntu system.We use the copy shared by (Xu et al., 2016), which replaces all the numbers, URLs and paths with special placeholders.
Douban Corpus (Wu et al., 2016) is an opendomain Chinese dialogue dataset from the Douban website, which is a popular social networking service.Note that for the test set of Douban corpus, one context could have more than one correct response as the golden response is manually labeled.

Evaluation Metrics
Following previous works (Tao et al., 2019b;Xu et al., 2020), we use recall as our evaluation metrics.The recall metric R 10 @k means the correct response is within the top-k candidates scored by the retrieval model out of 10 candidates in total.

Implementation Details
Our method is implemented by Pytorch and performed on 2×24 GiB GeForce RTX 3090.The code is implemented with Hugging Face4 and the code is available at https://github.com/TingchenFu/EMNLP23-LogicRetrieval.
For hyper-parameter selection in the transition model, retrieval model, and policy network, we sweep the learning rate among [5e − 6, 1e − 5, 2e − 5, 4e − 5, 5e − 5] and sweep the batch size among [4,8,16,32] for each dataset.We only keep the last 15 turns of a dialogue context and the maximum length of a context-candidate pair is 256.The gradient is clipped to 2.0 to avoid the gradient explosion.All models are learned with Adam (Kingma and Ba, 2015) optimizer with β 1 = 0.9 and β 2 = 0.999.An early stop on the validation set is adopted as a regularization strategy.We report the averaged performance over three repetitive experiments for our method.
For transition model, p θ (y ) is implemented as a transformer.Similar to word embedding, we obtain the embedding of y l 1 , y l 2 , . . ., y l N by looking up a randomly initialized and learnable embedding matrix, and then the embeddings are modeled by a uni-directional 3-layer transformer. Similarly, ) is implemented as a 6-layer uni-directional transformer.The z1 is mapped to the same dimension as the hidden representation of the transformer with a learnable matrix before prepending as a special token to the word embedding of u.
For inference the latent labels given an utterance, q ϕ (y l | u) is composed of a 1-layer GRU and a multi-layer perceptron.The former encodes an utterance u into a dense vector u while the latter maps the dense vector to a K l -way categorical distribution.
The policy network is a bi-directional transformer together with L dense matrices W p 1∶L .The matrices map p θ (y 1∶L N +1 ) into L vectors in the same dimension before concatenating them together as an embedding sequence and input into the transformer.The hidden representation after the last layer of transformer is then mapped back to the dimension K 1 , K 2 , . . ., K L ,

Retrieval Models
As a negative sampling approach, our proposed TRIGGER framework is orthogonal to the retrieval models and theoretically our approach can be combined with any dialogue retrieval model seamlessly.
To validate the universality of our approach, we perform experiments on the following base retrieval model: BERT (Devlin et al., 2018) is the vanilla BERT model fine-tuned on the response selection task with no post-training.SA-BERT (Gu et al., 2020) enables the BERT model to be aware of different speakers in the concatenated context sequence by applying speaker position embedding.BERT-FP (Han et al., 2021) uses fine-grained posttraining to help the Bert model distinguish the golden response and negatives from the same dialogue session.

Baselines
To verify the effectiveness of our approach, we draw a comparison with the following baseline method on negative sampling methods: Semi (Li et al., 2019) re-scores negative samples at different epochs to construct new negatives for training.CIR (Penha and Hauff, 2020) exploits curriculum learning and transits from easy instances to difficult ones gradually.Gray (Lin et al., 2020) trains another generation model to synthesize new negatives.HCL (Su et al., 2020) proposes an instance-level curriculum and a corpus-level curriculum with a pacing function to progressively strengthen the capacity of a model.Mask-and-fill (Gupta et al., 2021) propose a mask-and-fill approach that simultaneously considers the original dialogue context as well as a randomly picked one to synthesize negative examples.

Experiment Results
The quantitative results are shown in Table 4, We can observe that our method significantly improves the original base retrieval models on most metrics, showing the universality and robustness of our method.When combined with BERT-FP, our method achieves the new state-of-the-art for most metrics, with an absolute improvement of 0.6% and 4.3% on R 10 @1 for Ubuntu benchmark and Douban benchmark respectively.

Analysis
Apart from the overall performance, we are particularly curious about and make further analysis to understand the following questions: Q1: How does each component and mechanism contribute to the overall performance?Q2: How does the lexical similarity between the context and the response influence the performance of our method?Q3: How is the robustness of our method under adversarial The experiment results are shown in Table 5.From the table, we could observe that: (1) There is an evident degradation when the meticulously excavated negatives are replaced with randomly sampled ones This suggests that our negative sampling methods are crucial to the retrieval performance, and that simply increasing the number of negative examples may not be sufficient.(2) The updating of the sampling policy in T-step plays an important role as its removal of it causes an obvious drop.This result justifies the necessity of updating hard negatives dynamically.(3) Both the latent labels y 1∶L and latent features z 1∶L contribute to the model performance.This is likely because Models Ubuntu Douban R 10 @1 R 10 @2 R 10 @5 MAP MRR P@1 R 10 @1 R   they provide complementary information that can be used to learn the transition logic.

The Impact of Lexical Similarity
Answer to Q2.To have a better understanding of the impact of lexical similarity, we bin the test set of Ubuntu and Douban into different bins according to the similarity between the context and the golden response 5 .The improvement on BERT-FP (Han et al., 2021) is shown in Figure 2 and Figure 3.We could see that our method is helpful and substantially improve the retrieval performance on most bins.The improvement is obvious, especially in the harder scenario where the context and the golden response are less similar.We attribute the gains to the transition logic, which exempts the retrieval model from the interference of false negatives.

Performance under Adversarial Attack
Answer to Q3: Reducing false negatives can effectively mitigate the noise in supervision signal and therefore stabilize the training process (Zhou  Whang et al. (2021), we alter the candidate pool of each case by substituting all the negative candidates with the utterances in the dialogue context, which is a more challenging setting than previous scenarios (Whang et al., 2021).The experiment results on Ubuntu and Douban are shown in Figure 2 and Figure 3 respectively.According to the experiment results, all three base models deteriorate severely under our attack.It also reveals that relying on superficial cues is in fact a common phenomenon in PLM-based retrieval models, not limited to BERT-FP (Han et al., 2021).In comparison, we could see that the three models are much more robust when combined with our proposed TRIGGER strategy.

Case Study
Answer to Q4: In this section, we aim at interpreting what is learned by the transition model in an intuitive way.As a simplification, we only consider the "2-gram" transition of latent labels at the L-th facet (The most abstract one).Specifically, we first recognize the latent label at facet L for all utterances in the corpus with our transition model.Next, we investigate the statistics of latent labels pairs (y The most frequent "2-gram" transitions in Ubuntu, as well as their transition probability (The percentage that the first state is succeeded by the second one, other than the frequency of the "2gram"), is shown in Table 6.Besides, we review the 5 most frequent latent labels in facet L. By observing randomly sampled 100 utterances with the corresponding latent label from the Ubuntu corpus, we manually induce their implications as shown in Table 7.

Limitations
All technologies built upon the large-scale PLM more or less inherit their potential harms (Bender et al., 2021).Besides, we acknowledge some specific limitations within our methods: We only verify the effectiveness of our method on several recent PLM-based methods, but not on early methods without PLM, like SMN (Wu et al., 2017) or ESIM (Chen and Wang, 2019).But since our approach is orthogonal to the base retrieval model, we are promising that our proposal could be easily adapted to these methods.

Ethical Considerations
This paper will not pose any ethical problems.First, multi-turn response selection is an old task in natural language processing, and several papers about this task are published at EMNLP conferences.Second, all the datasets used in this paper have been used in previous papers.Our method should only be used to boost the performance of the retrieval dialogue system or other research use but not for any malicious purpose.
1 Instead, inspired fromZhao et al. (2022) andFalck et al. (2021), we propose a Sequential Variational Ladder Auto-Encoder (SVLAE) to model the multiple characteristics in a disentangled and compositional way.The generation and inference in our SVLAE architecture are shown in Figure1and elaborated as below.
In implementation, p(y l 1 ) is a uniform distribution over K l possible values.p θ , µ θ (y l ) and Σ θ (y l ) are multi-layer perceptrons.The sequence modeling p θ (y l i | y l <i ) and the utterance generation d θ (w t | w <t , z1 ) are both implemented with light-weight transformer.

Algorithm 1
The proposed TRIGGER framework.1: Input: A retrieval model D, training corpus, maximum training step for the transition model and retrieval model M 1 and M 2 .2: for m ← 1 to M 1 do 3: Sample a mini-batch (c, r) from the training corpus.4:

Table 1 :
The false negative ratio (%) of selected negative candidates on Douban dataset.

Table 3 :
Statistics of two datasets used in our experiments.

Table 4 :
Evaluation results on the test sets of the Ubuntu and Douban.Numbers in bold are best results.† denotes that the improvement over the most competitive baseline is statistically significant (t-test, p-value <0.05)

Table 5 :
Results of ablation study on two benchmarks.

Table 6 :
The top-5 2-gram transition pairs in Ubuntu and their transition probability.

Table 7 :
The implications of top-5 latent label at facet L in Ubuntu.