Contextual Rephrase Detection for Reducing Friction in Dialogue Systems

For voice assistants like Alexa, Google Assistant, and Siri, correctly interpreting users’ intentions is of utmost importance. However, users sometimes experience friction with these assistants, caused by errors from different system components or user errors such as slips of the tongue. Users tend to rephrase their queries until they get a satisfactory response. Rephrase detection is used to identify the rephrases and has long been treated as a task with pairwise input, which does not fully utilize the contextual information (e.g. users’ implicit feedback). To this end, we propose a contextual rephrase detection model ContReph to automatically identify rephrases from multi-turn dialogues. We showcase how to leverage the dialogue context and user-agent interaction signals, including the user’s implicit feedback and the time gap between different turns, which can help significantly outperform the pairwise rephrase detection models.


Introduction
Large-scale conversational AI based dialogue systems like Alexa, Siri, and Google Assistant, are getting more and more prevalent in real-world applications to help users across the globe. Natural Language Understanding (NLU) technology is an established component that produces semantic interpretations of a user request. Improving the accuracy of the NLU component is a key consideration for satisfactory end-to-end user experience, especially when the NLU component misinterprets the semantics due to ambiguity or errors that come from the previous component (e.g., Automatic Speech Recognition). For instance, the ASR system may incorrectly recognize "play jacking the ball" as " play jack in the fall". These errors accumulate and introduce friction in the dialogue * Equal contribution.
† Work done when Zhuoyi Wang was interning at Amazon Alexa AI. Figure 1: Difference between the contextual rephrase detection and pairwise approach. Pair-wise rephrase detection model computes the similarity score for each pair that appears in the multi-turn dialogue, and selects the maximum score among them for the rephrase prediction. In this case, the pair-wise model without considering context information incorrectly predicts "play tyler hero explicit by jack harlow" as the rephrase of user's defective request "play tyler hero explicit" since it has highest similarity score.
conversation. Fixing these frictions would help users to have a better experience, and engage more with the AI agents.
Previous works (Yuan et al., 2021;Chen et al., 2020;Park et al., 2020) focus on friction reduction in the ASR and NLU components using Query Rewriting (QR) (Grbovic et al., 2015). These approaches reformulate the ASR transcription of user's query, such that it conveys the same meaning/intent, to minimize user dissatisfaction. An important aspect of the QR approaches is to detect user rephrase of a previous query that leads to a satisfactory response. However, these approaches focus only on the pairwise semantic similarity of queries, which does not consider the corresponding user feedback, with proper dialogue context. As shown in Fig. 1, dissatisfied users might provide implicit feedback, i.e., they rephrase the previous query (e.g. the first user request "play tyler hero explicit" in the left of Fig. 1) multiple times unless the agent does the needful. If we only consider the semantic similarity between different queries from pairwise-based models, the system may choose an unsuitable query "play tyler hero explicit by jack harlow" to correct the problematic request. The dialogue context includes additional information like previous turns, the responses of the dialogue agent, and time differences between user queries. By leveraging this context, we can detect the correct rephrase -"play tyler hero by jack harlow", with a much higher probability.
In this paper, we propose an automatic user rephrase detection approach ContReph, which leverages implicit user feedback and dialogue context in a multi-turn dialogue setting. ContReph detects if any of the user queries in the dialogue session is rephrased, and then extracts the most probable rephrase span that led to a satisfactory response from the dialogue agent. Specifically, we input the full dialogue session to the model, including agent's responses and capture time gaps between user queries using a novel time-difference encoding scheme. We evaluate the performance of our proposed framework by conducting an extensive set of experiments on production data of a large scale dialogue agent and showcase the effectiveness of our approach against existing methods. Although, in this work, we focus on rephrase detection only, the rephrases identified by our approach can directly be used as query rewrites for reducing friction in dialogue systems.

Query Rewriting in Dialogue Systems
Query Rewriting (QR) in dialogue systems aims to correct the ASR interpretation of user's queries to deal with errors across the entire dialogue system pipeline in a single generalized framework. Existing QR approaches tend to apply neural embedding and retrieval based approaches (Yuan et al., 2021;Chen et al., 2020), generation-based approaches (He et al., 2016;Dehghani et al., 2017) and Absorbing Markov Chain (AMC) (Ponnusamy et al., 2020). Chen et al. (2020) apply the language model to pre-train query embeddings on historical user conversation data, Yuan et al. (2021) leverage Graph Neural Networks (Kipf andWelling, 2017) for the same, and then fine-tune on QR training set that consists of (source query, rephrase) pairs. To generate such training set without human annotations, they rely on pairwise rephrase detection models to identify rephrase pairs in historic dialogue sessions. Ponnusamy et al. (2020) propose AMC to identify rephrases within multi-turn dialogues and treat the rephrases directly as rewrites instead of training a neural model. However, the approach is purely statistical and ignores the semantic relevance between the source query and rephrase, which has been proven effective across different datasets and tasks (Conneau and Kiela, 2018;Gao et al., 2021). Our work alleviates this problem by using the BERT model incorporated with dialogue context information.

Rephrase Detection
Given a pair of sentences P and Q, existing rephrase/paraphrase detection approaches estimate the probability distribution P r(y|P, Q), where y = 1 if P and Q are rephrases, and y = 0 otherwise. Typically, these approaches use encoders to embed P and Q, followed by semantic or syntactic similarity measurement. For example, BiMPM (Kim et al., 2019) uses BiLSTM layers for encoding the sentences, and performs a bilateral matching to compute P r(y|P, Q). Gao et al. (2021) propose SimCSE, which leverages the contrastive learning framework and is shown to produce superior sentence embeddings, from either unlabeled or labeled data. However, the existing approaches are limited by the information they can exploit, especially for dialogue sessions, where a lot of contextual information is available. Hence, we extend these approaches from pairwise to dialogue context level, as described in the next section.

Notations and Problem Definition
We consider a dataset D of M multi-turn dialogue sessions, such that D = {S i } M i=1 , and every session S is an ordered set of N turns: Here i indicates the index of turn, and each turn i consists of a pair (Q i , R i ), where Q i is the user's query and R i is the agent's response to query Q i . Any two successive turns have a time gap of less than a minute. Given a dialogue session S and a source turn, i.e., input pair of query and response (Q i , R i ), the goal of our model is to predict whether Q i is rephrased in any of the following turns (Q j , R j )| i < j ≤ N . If so, the model should predict the span of Q j and return null otherwise. encoding dialogue sessions. We flatten the dialogue session into one sequence and feed it to a pre-trained BERT, to compute the dialogue session embedding. We introduce two special tokens: '[USER]' and '[AGENT]', which are used to prefix the user query and agent response in the contextual input, respectively. We cast rephrase detection as a span prediction problem where we predict the probability of start and end span locations on each token's position, using the embedding output of the final BERT layer. We introduce a start vector W S and an end vector W E . Assuming the final hidden vector for the i th input token as T i ∈ R H , the probability of token i being the start/end of the rephrase span is computed after applying a softmax on the dot product between T i and W S (or W E ) over all of tokens:

Model Architecture
The score of a candidate span from position i to position j is defined as: to represent the score of no-rephrase span. We set threshold τ to decide whether to predict norephrase or not. If max j>i s ij > s none + τ , then we regard the maximum score span as the rephrase span and null otherwise. Time difference encoding: In addition to capturing the full dialogue context when making a rephrase prediction, ContReph also considers the time difference between multiple turns. This is an important factor as users are more likely to interrupt the agent and rephrase their query sooner than later, if they don't get the right response. We  capture the time differences using time-bin token embeddings. Consider a source turn (including a request and a response) t src = (Q src , R src ), for which we want to detect a rephrase in the session. We refer to its timestamp as ω src . We calculate the time difference ∆ i = ω i − ω src , where ω i is the timestamp of a turn t i , for all the turns in the session. ∆ i ∀i ∈ [1, n] are then mapped to their respective time-bin tokens. These time-bin tokens represent equal sized intervals in ∆'s range of [-60, 60] seconds. We then map these tokens to their embeddings. As shown in Fig. 2, the corresponding time-bin token embeddings are added to each token of the turn at the input layer of the model, depending on the turn's bin.

Data
Machine-Annotated set: We sample multi-turn dialogue sessions between users and a large scale conversational AI agent from anonymous historic interactions. We use an existing model based on Absorbing Markov Chain (AMC) (Ponnusamy et al., 2020) to discover rephrase turns in these sessions, and only keep the instances where the AMC model is highly confident in predicting a rephrase (if the session has one) and a no-rephrase. Based on this, we divide the dataset into two types: Has-Rephrase and No-Rephrase, respectively. We split this dataset into train, validation and test sets, with the statistics shown in Table 2. Since this dataset is labeled using a model, we refer to it as Machine-Annotated set. We use the training split to fine-tune our model ContReph and other baselines (Section 4.3). An example of this dataset with labels is shown in Table 1.
Human-Annotated set: For a more comprehensive evaluation of ContReph with other baselines, we construct another test set where the rephrases are identified by human annotators. We sample historic sessions and keep only those sessions where AMC model predicted a no-rephrase, but human annotators labeled rephrases with high confidence. We refer to this one as Human test set. This is a

Evaluation Metrics
We use the following evaluation metrics: Exact Match (EM): For a Has-Rephrase instance, this score is 1 if the predicted span exactly matches the labeled rephrase, and is 0 otherwise. For a No-Rephrase instance, the score is 1 if the model predicts a null span, and 0 otherwise.
Trigger Rate (TR): Trigger Rate is the fraction of instances on which the model makes a non-null prediction.

Baselines and Experimental Setup
To evaluate the rephrase detection performance, we compare our method with a few baselines which are pairwise-based. For ContReph, we choose the official pretrained BERT-base model 1 and fine-tune on it. Models are selected by early stopping on valida-1 https://github.com/google-research/bert tion set. More implementation details and hyperparameters can be found in Appendix.

Results
In Table 3 2 , we show evaluation of our approach against other baselines. ContReph consistently achieves better performance on machine and human-annotation test sets. It is better than the state-of-the-art pairwise BERT-NSP method on human test set by 27.89% on EM score, and also improves overall EM score by almost 6% on machineannotated set. This clearly shows the benefits of capturing dialogue context. Moreover, removing time difference encoding from ContReph leads to a drop of 1.55% and 1.37% in EM score on machine and human-annotation test sets, respectively. This proves that capturing time difference between turns can further improve rephrase detection. We notice that human test set is more challenging due to different domain distribution, and hence EM scores for it are much lower, compared to machineannotated set. BERT-NSP achieves the best results amongst the baselines, which highlights the benefits of utilizing transformer's self-attention mechanism across the queries: it encodes the two queries as a single sequence with a separator, while other baselines encode the queries independently with BERT and then apply a similarity function. Note that ContReph utilizes the self-attention mechanism across all turns of the dialogue.

Conclusion and Future Work
In this paper, we presented a novel approach for detecting user rephrases in multi-turn dialogue systems. Users tend to rephrase their queries until they get the desired response from AI agents. Our system can detect these rephrases with a high accuracy using the dialogue context and significantly outperforms other approaches that consider queries in a pair-wise manner only. The output of our model is a crucial step towards building self-learning mechanisms in dialogue agents to fix issues with minimal human intervention.
For future work, we plan to leverage contrastive learning strategies as a post-training step, which could help us obtain better query representations, before we do fine-tuning for rephrase detection. We also want to deploy the detected rephrases as query rewrites to gauge how much we can improve the UX of a real world dialogue system.

References
Zheng Chen, Xing Fan, and Yuan Ling. 2020. Pretraining for query rewriting in a spoken language understanding system. In ICASSP.

A.1 Implementation Details
Baselines For the BERT-NSP baseline, similar with the BERT next sentence prediction task (Devlin et al., 2019), we fine-tune the BERT model with a binary classification objective. We break down all the Has-Rephrase and No-Rephrase sessions into (query, rephrase) pairs with a 0-1 label (whether the rephrase is true or not). For the DPR model, we follow the DPR (Karpukhin et al., 2020) training scheme which compares all pairs of questions and passages in a batch. We only use the positive rephrase pairs extracted from Has-Rephrase sessions, and use the cosine similarity scores for cross entropy. The most recent baseline is SimCSE (Gao et al., 2021), which is a simple contrastive learning framework but greatly advances the sentence embeddings. For the unsupervised setting (SimCSE-unsup), we extract all the queries from both Has-Rephrase and No-Rephrase sessions as the unlabelled training data, and follow Gao et al. (2021) to take an input sentence and predict itself with a contrastive objective, with only standard dropout used as noise. For the supervised setting (SimCSE-sup), we extract the positive rephrase pairs from Has-Rephrase sessions, and use the other queries from the same session as the hard negatives. Moreover, in order to fully use the training data and make a fair comparison, we also use the source query and itself as the positive pairs from No-Rephrase session, with only standard dropout used as noise. Other queries from the same session were used as hard negatives. For the other model configurations and related hyper-parameters, we are consistent with the original works (Devlin et al., 2019;Karpukhin et al., 2020;Gao et al., 2021). We set the threshold to 0.70 for BERT-NSP, 0.75 for DPR and 0.85 for the SimCSE models.
Our models We set a mini-batch size of 64 and use Adam optimizer for optimization during the fine-tuning for 10 epochs. We set an initial learning rate of 4 · 10 −5 . We select the threshold τ for no-rephrase span prediction on the validation set, following the same approach as Devlin et al.

A.2 Analysis
Effect of τ This is the major parameter that we tune for determining whether the prediction should be null or not (as described in Sec. 3.2). We change τ in the range of 0.0 to 1.0, and evaluate the model performance on validation set. Fig. 3 shows the effect of changing τ on the Exact Match score for Has-Rephrase, No-Rephrase and all the dialogue sessions together. As we increase τ , the model predicts a null rephrase span more often and hence gets better performance on the No-Rephrase set, and vice-versa. To balance this trade-off, we choose the value of τ that maximizes the EM score on "All" validation set, i.e. looking at Has-Rephrase and No-Rephrase together. We also ensure that our data splits are balanced, i.e., contain almost equal fraction of Has-Rephrase and No-Rephrase cases (See Table 2).

Number of turns in Dialogue Sessions
We show the performance variation with the number   of turns in the dialogue sessions in Fig. 4. The number of turns has a significant effect on the EM score, even if we balance the length distribution of dialogues during training. This result shows that for the sessions with more turns, the context can be unrelated to the current request, and this unrelated context can impact the accuracy negatively. Interestingly, capturing the time difference between the turns helps here, especially for the longer sessions. With the temporal information, the model can automatically decide which context is irrelevant and can thus ignore it.

A.3 Case Study
We show four scenarios in Table 4, where the first one is a correct prediction and other three are failure cases: 1) False-trigger, where the model predicts that current query/request should be rephrased, but actually the dialogue does not contain a rephrase; 2) No-trigger, where the model judges the request need not be rephrased, but actually the dialogue has a rephrase; and finally, 3) wrong match, which means the model predicts a wrong span. False triggering usually happens if the user issues similar back-to-back queries with very small time gaps in between. Wrong match mostly happens if there are multiple successful rephrases in the session.
We also show comparison between the predictions of ContReph w/o Time and ContReph in Table 5. In the first scenario, the user says "turn off alarm" 20 seconds after "turn off apartment". The model without time tends to pick the last successful query as rephrase, whereas ContReph is aware of the fact that "turn off alarm" happened long after "turn off apartment", and hence picks the latter as the rephrase.
The second scenario is similar where the user listens to Kendrick Lamar, 45 seconds after listening to Indie Music. Hence, the request "Play pride by kendrick lamar" is not a rephrase, but just another song that the user listened to. ContReph, being aware of the temporal information, picked the right rephrase again, which is "Play indie music".