Reinforced Question Rewriting for Conversational Question Answering

Conversational Question Answering (CQA) aims to answer questions contained within dialogues, which are not easily interpretable without context. Developing a model to rewrite conversational questions into self-contained ones is an emerging solution in industry settings as it allows using existing single-turn QA systems to avoid training a CQA model from scratch. Previous work trains rewriting models using human rewrites as supervision. However, such objectives are disconnected with QA models and therefore more human-like rewrites do not guarantee better QA performance. In this paper we propose using QA feedback to supervise the rewriting model with reinforcement learning. Experiments show that our approach can effectively improve QA performance over baselines for both extractive and retrieval QA. Furthermore, human evaluation shows that our method can generate more accurate and detailed rewrites when compared to human annotations.


Introduction
Interacting through conversations is a natural information-seeking procedure for humans, therefore it is important for AI assistants like Apple Siri and Amazon Alexa to enable and improve such experiences.In recent years Conversational Question Answering (CQA) has gained more attention, where a user can ask a series of related questions and ideally obtain answers that leverage the conversational context.Different from widelystudied question answering (QA) tasks that happen in single-turn (Rajpurkar et al., 2016(Rajpurkar et al., , 2018;;Tay et al., 2018;Tang et al., 2019), the interpretation of conversational questions in CQA depends on questions and answers from previous turns.
Previous approaches to CQA usually train new models from scratch, which can achieve promising results but also are expensive in terms of obtaining domain-specific training data.In industry settings, there are many single-turn QA models deployed.
Training new CQA models with additional annotations to replace each existing single-turn QA model is expensive, and generally not feasible.Moreover, discarding existing single-turn models and datasets is impractical, and studying how to reuse these existing resources to tackle CQA merits attention.
Existing approaches to this task, called Conversational Question Rewriting (CQR), often train sequence-to-sequence models supervised by human rewrites to generate self-contained questions (Ren et al., 2018;Vakulenko et al., 2021).Such methods have several limitations.First, the CQR training objective is disconnected from CQA performance.The annotation process of existing rewriting datasets has no knowledge of the QA systems, and more human-like rewrites do not guarantee better CQA performance.Second, the rewriting model does not take into account the feedback from downstream QA systems.In industry settings, multiple single-turn QA systems trained with different datasets serve in the backend.It is impractical to replace them with new CQA models, and we argue that their output can still be used as signals to help train rewriting models.
To overcome these limitations, we propose an effective CQR approach upon the recent success of Reinforcement Learning (RL) techniques for text generation (Rennie et al., 2017).RL enables flexible ways to incorporate training objectives in the form of reward functions.We systematically analyze different rewards and their effectiveness in terms of final QA performance, as well as the quality of the question rewrites (i.e. the question still has to be understandable and interpretable by humans).To optimize QA performance, we propose various QA rewards to measure the likelihood of a question yielding a better answer.In comparison with the QA rewards, we also propose to use the same RL approach with question rewriting (QR) rewards reflecting the similarity between a modelgenerated question and the human's ground-truth.
We summarize our contributions as follows: • To the best of our knowledge, we are the first to study how to incorporate QA signals to improve CQR using RL.

Problem Definition
In CQA, each conversation contains a sequence of (question, answer) pairs D = {q 1 , a 1 , ..., q n , a n }, where a i is the answer for question q i .A conversational question q i can be ambiguous and its interpretation depends on the conversational context c i = {q 1 , a 1 , ..., q i−1 , a i−1 }.The goal of CQR for QA is to learn a model R θ , parameterized by θ, that can translate q i associated with c i into q ′ i , so that the semantic meaning of q ′ i is equivalent to q i .q A pretrained single-turn QA model is expected to answer q ′ i better than q i .Note that the QA model can be trained from a single-turn dataset different from D and is fixed when training the rewriter.The motivation is to explore whether the already deployed single-turn QA models can be exploited to train a rewriter and reused without further training by accepting the rewritten questions.

Model Overview
We show our CQR approach with a modularized design in Figure 2.There are two major components: a CQR model R θ as introduced in Section 3 and a reward function F that evaluates rewrite q ′ i generated by R θ by producing a reward score.
Then the CQR training can be formulated as a reinforcement training problem, where the objective

Learning
Rewriter

Reinforcement
Figure 2: Overview of our CQR approach.h i is human rewriting of q i and a i is the ground-truth answer of q i . is to maximize an expected reward or equivalently minimize the following loss function: where q i comes from data distribution T .During training, we push R θ to generate q ′ i that achieves a higher reward by minimizing Equation 2. Hereinafter, we omit θ from R θ for simplicity.
We define two types of rewards: QR rewards evaluate how similar a question rewrite is to the ground truth one produced by human annotators; QA rewards evaluate how well a QA model can answer a question rewrite.We summarize the characteristics of different rewards in Table 1.By maximizing one of the QR or QA rewards, we can explicitly optimize the model to achieve the QR or QA target.Next, we describe the two types of rewards.
Table 1: Characteristics of different rewards.

QR Rewards
The rationale of maximizing QR rewards is similar to the aims of prior work: a good question rewrite should be similar to a human rewrite.We use the ROUGE-L score (Lin, 2004) between the question rewrite q ′ i and the ground-truth h i as the QR reward: This reward has been widely used by RL methods for language generation tasks.Note that Eq. 3 does not depend on the QA model and prior work can be considered as maximizing QR rewards.

QA Rewards
We define QA rewards that reflect how well the question rewrites can help a QA model obtain better answers.Since QA rewards are task/modeldependent, we introduce QA rewards for the following two sub-types.

Extractive CQA
Extractive CQA is a machine reading comprehension (MRC) task and an extractive QA model M extracts the most likely span answer given a question q and an evidence document p: We assume that M is trained on regular singleturn QA data, and expects the input question q to be self-contained.Therefore, CQA questions should be rewritten by R before being sent to M. Next, we introduce supervised and unsupervised QA rewards.Supervised QA rewards.A straightforward way to measure the quality of a question rewrite q ′ i in terms of QA is to calculate the similarity between the predicted answer by M with q ′ i as input and the ground-truth answer a i .We denote a ′ s as the extracted answer span by M using the rewritten question q ′ i as input.We measure the overlap between a ′ s and a i by F1 score: Intuitively, the rewrite q ′ i is better if a ′ s is closer to the ground-truth answer.Compared with Equation 3, Equation 5 depends on the ground-truth answers instead of human rewrites.Unsupervised QA rewards.For a predicted span a ′ s , M assigns a probability r c = P M (a ′ s |q ′ i , p) that reflects the model's confidence about the answer.We assume that a higher confidence score of an answer indicates that the QA model has a better question understanding.Therefore, we directly use the probability of the most likely answer as the confidence reward for a question rewrite: F1 rewards can be considered as judgment scores on predicted answers by humans since the groundtruth answers are used, while confidence rewards represent the model's self-judgments.

Retrieval CQA
We also evaluate our method's generalization on a different retrieval CQA task, where the goal is to return a list of documents in descending order of relevance scores produced by a retrieval CQA model: where p is a document.A retrieval CQA model usually consists of two stages.In the first stage, a lightweight ranking algorithm such as BM25 is used to retrieve top-k candidate documents.In the second stage, a more complex model such as BERT (Devlin et al., 2019) is used to rerank candidate documents.Here, we use the BM25 score between a question and a document, which is a type of QA reward that does not use annotated answers: We expect the rewrite q ′ i can retrieve documents with higher BM25 scores in the first stage than q i so that the performance in the re-ranking stage can also be improved.

Training
There are two steps in our training framework.The first step, the pre-training step, which has the same supervised target as prior work.The objective is to minimize the cross-entropy loss between the model's prediction q ′ and human ground-truth rewrites h: where y h is the one-hot vector of h and y q ′ is the distribution over tokens in q ′ predicted by the model.Supervised pre-training ensures the model has the basic ability to rewrite the original question given the conversational context.The second step continues training R with RL to maximize different rewards.In this work, we use Self-Critical Sequence Training (SCST) (Rennie et al., 2017).Given a question q, we generate two question rewrites q s and q ′ .q s is generated by sampling the word distribution from R at each step, and q ′ is generated by R using greedy decoding.Then we minimize the following loss function: Here, P R (•), which is defined by R, is the probability of generating the t-th word conditioning on previously generated tokens of q s , the original question q and conversational history c.Intuitively, minimizing L rl increases the likelihood of q s if it obtains a higher reward than q ′ (i.e.r s > r ′ ), and thus maximizes the expected total reward.Given a reward function, we can obtain r ′ = F(q ′ ) (F can be one of Equation 3,5,6,8) and r s = F(q s ).
We only choose one of the reward functions to obtain the reward for a question.We leave the combination of different rewards as future work.
Additional training procedure details are described in Appendix A.  2021), we experiment with CANARD (Elgohary et al., 2019) for extractive CQA and CAsT-19 (Dalton et al., 2020) for retrieval CQA.As CAsT-19 is small compared to CANARD, prior work (Vakulenko et al., 2021) uses the same model trained on CANARD to evaluate the rewriting performance on the test set of CAsT-19.Similarly, we start with the modelnon CA-NARD, and continue RL training with the BM25 reward on the training set without using any human annotations provided by CAsT-19.

Evaluation Metrics
We use BLEU-1, BLEU-4, ROUGE-1 and ROUGE-L for automatic evaluation.We also evaluate the performance of rewrites on downstream QA tasks.For CANARD, we use F1 and Exact Match (EM).For CAsT-19, we report MAP, MRR and NDCG@3 as in Vakulenko et al. (2021).

Baselines
We consider the following baselines: Origin uses the original conversational question as input of QA.BART CQR We fine-tune BART (Lewis et al., 2020) as a supervised baseline which has the same training procedure as the pre-training step of our method.Co-reference (Vakulenko et al., 2021) is a rulebased method.We replace anaphoric expressions in original questions with their antecedents from the previous conversation turns.A public neural co-reference model (Lee et al., 2018)   Human uses the human rewrites and can be considered as an upper bound.However, we later show that the human baseline is the upper bound for QR target but not QA target.

Implementation Details
For all the QA models, we simulate the scenario where they are trained on single-turn QA data and cannot be updated when interacting with the rewriting component.The goal is to improve single-turn QA models for CQA, which means the input for QA models does not include any previous context.
Single-turn Extractive QA Model.To simulate a single-turn extractive QA model, we fine-tune ALBERT-XXLarge-v2 (Lan et al., 2020) on the CANARD training set.
Single-turn Retrieval QA Model.Same as in Vakulenko et al. (2021), we use Anserini's implementation of BM25 (Robertson et al., 2009) for the first-stage retrieval to obtain the top 1000 passages.In the second stage, we use BERT-large for passage re-ranking.Both components are fine-tuned on the MS MARCO dataset so that the two-stage pipeline resembles a single-turn retrieval QA model.
Rewriting Models.Our RL-based methods and the supervised BART baseline (BART CQR ) use BART-base model (Lewis et al., 2020). 1 We use the official CANARD validation set for early stopping.RL-QR denotes the model when QR rewards are used.RL-F1, RL-C and RL-BM25 denote models where the F1, confidence and BM25 rewards are used, respectively.

Results
Here, we study the following research questions: RQ1: Can our proposed QR and QA rewards improve the overall CQA performance?In particular, how effective are unsupervised rewards?RQ2: Does achieving the best QR target mean achieving the best QA target?RQ3: What is the quality, as judged by humans, of the reward-guided question rewrites?

Evaluation on Extractive CQA
We list the results on CANARD in formance on QA target and QR target.Specifically, RL-C outperforms BART CQR by 1.88% and 1.58% in terms of F1 and EM, respectively.RL-QR achieves marginally better scores on BLEU-1, BLEU-4 and ROUGE-L than BART CQR .RL-F1 achieves better F1 and EM scores than RL-QR and BART CQR but does not outperform RL-C.We notice that the F1 reward is less sensitive to question rewrites than the confidence reward.A small change in a question can lead to the same answer and F1 score.However, the confidence score can be different.In this aspect, RL-C seems to differentiate the fluctuations on rewrites better than RL-F1.In answer to RQ1, the confidence reward is the most effective for CQA performance.As an unsupervised reward which does not require either human rewrites or gold answers to a question, the confidence reward is even more effective than the F1 reward.However, we do not claim or target state-of-the-art performance in our work.The goal is to verify whether our RL framework for CQR with different rewards can further improve the performance of a single-turn QA system for CQA.
Second, using QR rewards (RL-QR) leads to limited performance improvement under both QA and QR metrics compared with BART CQR .Maximizing the ROUGE rewards (Eq. 3) and minimizing the cross-entropy loss (Eq.9) share the similar intuition that a good reformulation from the model should be similar to human reformulated questions.The two objectives are very close and therefore lead to similar results.It is important to note that the best scores of QR metrics and QA metrics are not achieved by the same method.Moreover, using QA rewards even lead to a large decrease in QR metrics.Therefore, in response to RQ2, achieving the best QR target does not mean achieving the best QA target, and vice versa.
Third, RL-C achieves higher F1 scores than the human baseline.Previous work (e.g.Vaku-lenko et al., 2021) treats human annotations as an upper bound.However, we argue that more humanlike rewrites do not guarantee better QA performance.The results verify our hypothesis that QA target does not necessarily align with QR target.In §6.4,we qualitatively analyze if rewrites generated by RL-C are better than the ground-truth.

Training with Fewer Samples
For a real-world CQA system, we can obtain a large number of user questions with no corresponding ground-truth rewrites or answers.Since the confidence reward can be obtained easily from the downstream QA models without requiring human annotations, we can use RL-C to continue training the rewriting model.We first train a baseline using 50% of training data from CANARD (denoted as BART CQR (50%) ).Then we continue RL training with the confidence reward using either the same 50% data used in pre-training (denoted as RL-C (50%)) or all questions in CANARD training set (denoted as RL-C (50%+100%)).The results are summarized in Table 3.We can see that RL-C (50%+100%) benefits from the large amount of questions during RL training and achieves better F1 and EM scores than RL-C (50%).Interestingly, RL-C (50%+100%) outperforms the human baseline in Table 2 by 0.31% in terms of F1.We also experimented with other ratios of data for supervised pre-training and continually RL training.In the experiments, we had similar observations that continual RL training with confidence rewards can further improve the downstream CQA performance.

Evaluation on Retrieval CQA
For RL-BM25, we use RL-C trained on CANARD as the pretrained model, then train it to maximize the BM25 reward, which can be readily obtained from the retrieval model.Results on CAsT-19 are shown in Table 4.As with extractive CQA, RL-BM25 achieves lower scores on QR metrics than

Human Evaluation
In addition to CQA performance, generating userfriendly rewrites is also important for real-world applications, since the rewrites sometimes will be displayed to users.To answer RQ3, we perform a user study to evaluate the quality of model generated rewrites.Specifically, two groups are compared: (1) The first group contains the rewrites generated by RL-C and human rewrites; (2) The second group contains rewrites from RL-C and BART CQR , respectively.For each group, we randomly choose 200 questions from CANARD testing set.For each pair, we collect human's judgments on which rewrite contains more accurate context and details from conversation history.
The results are shown in Table 5.The study suggests that RL-C significantly performs better than Human and BART CQR (p-value < 0.001, see details in Appendix B.2). Remarkably, annotators prefer the rewrites from RL-C than humans in more than 50% cases.We show two examples in Table 6.In the first example, both RL-C and BART CQR correctly replace the pronoun with the referred person name.However, the rewrite generated by RL-C includes more accurate details which appear in conversation history.In the second example, both RL-C (same as RL-F1 and RL-QR) and BART CQR fail to generate the correct person's name.This error might be due to the prior knowledge of BART.To answer RQ3, we find that our reward-guided model can generate rewrites preferred by humans.However, all rewriting models can suffer from the coreference resolution problem.

Conclusion
We proposed a conversational question rewriting (CQR) approach using reinforcement learning.Such rewriting approaches are an emerging solution in real-world settings where QA systems with many existing answering backends trained on standalone questions must be adapted to work in conversational settings.
After assessing various QA and QR rewards, we showed that optimizing QR rewards is limited in improving CQA performance.In contrast, QA rewards that do not require ground-truth annotations consistently achieve the best CQA performance over baselines.For extractive CQA, using confidence rewards improved F1 by 2% over BART-based baseline on CANARD; and for retrieval CQA, using BM25 rewards improved the NDCG@3 of the baseline by 3.4% on CAsT-19.A human evaluation also demonstrated that our approach can generate higher-quality rewrites with more accurate and detailed context information.generate q ′ from R(q, c) by greedy decoding; 12 obtain r s = F(q s ); 13 obtain r ′ = F(q ′ );

B Human Study Design
For each annotation, an annotator is presented with the evidence document, conversation history, the original question and two rewrites.The annotator is required to select one from four options as listed in Table 5.The source of rewrite is anonymized.For each pair of rewrite, we randomly assign them to two options so that the judgments are not biased by the position of choices.We collect two judgments per rewrite pair.If there is a tie, we collect additional judgments.The final judgments are based on majority vote.

B.2 Significance Tests
Here we describe how we conduct the Wilcoxon signed-rank test on the annotation results.When comparing RL-C with Human, for each sample, if annotators think RL-C is better, RL-C obtains score 1 and Human obtains score -1.Similarly, if annotators think Human is better, then Human obtains score 1 and RL-C obtains score -1.For other cases (i.e. both are good or both are bad), each of them obtains score 0. Then we use the method "scipy.stats.wilcoxon" in scipy library2 to do the test.About the study annotator agreement rates, 48% samples have 100% agreement and the overall agreement rate is around 80%.

C Rewriting Examples
In Table 7, we show examples where the rewrites generated by RL-C are preferred by human annotators over the baseline method and ground truth.Compared with ground-truth rewrites, RL-C tends to generate rewrites with more factual details, which can help the user and also downstream QA systems to understand the question without conversation history.To some degree, it explains why the CQA performance is improved with RL-C, while the corresponding scores of QR metrics (i.e., BLEU-1, BLEU-4, ROUGE-1 and ROUGE-L) are very low.It also indicates that the human groundtruth in existing CQR datasets is not perfect and only evaluating CQR model with QR metrics can be biased.
The cases where both RL-C and the baseline generate incorrect rewrites are shown in Table 8.We can see that both methods make mistakes in coreference resolution.However, RL-C still has the tendency to include more conversational context in the rewrites.
Figure 1: A conversational question rewriting example.

Figure 3
Figure 3 shows the interface for annotators.Figure 4 contains the instruction which is visible for each annotator.In the instruction, we show several annotation examples in Figure 5.

Figure 5 :
Figure 5: An annotation example in the instruction.

Table 2 :
is used.Overall QR and QA performance (%) on CANARD.Bold indicates the best results except "Human".We denote BLEU-n as B-n and ROUGE-n as R-n.† denotes statistically significant difference from BART CQR (p < 0.05 with t-test).