Robustifying Multi-hop QA through Pseudo-Evidentiality Training

This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate “pseudo-evidentiality” annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.


Introduction
Multi-hop Question Answering (QA) is a task of answering complex questions by connecting information from several texts. Since the information is spread over multiple facts, this task requires to capture multiple relevant facts (which we refer as evidences) and infer an answer based on all these evidences.
However, previous works (Min et al., 2019;Chen and Durrett, 2019;Trivedi et al., 2020) observe "disconnected reasoning" in some correct answers. It happens when models can exploit specific types of artifacts (e.g., entity type), to leverage them as reasoning shortcuts to guess the correct answer. For example, assume that a given question is: "which country got independence when World War II ended?" and a passage is: "Korea got independence in 1945". Although information ("World War II ended in 1945") is insufficient, QA models * correspond to seungwonh@snu.ac.kr predict "Korea", simply because its answer type is country (or, using shortcut).
To address the problem of reasoning shortcuts, we propose to supervise "evidentiality" -deciding whether a model answer is supported by correct evidences (see Figure 1). This is related to the problem that most of the early reader models for QA failed to predict whether questions are not answerable. Lack of answerability training led models to provide a wrong answer with high confidence, when they had to answer "unanswerable". Similarly, we aim to train for models to recognize whether their answer is "unsupported" by evidences, as well. In our work, along with the answerability, we train the QA model to identify the existence of evidences by using passages of two types: (1) Evidence-positive and (2) Evidence-negative set. While the former has both answer and evidence, the latter does not have evidence supporting the answer, such that we can detect models taking shortcuts.
Our first research question is: how do we acquire evidence-positive and negative examples for training without annotations? For evidence-positive set, the closest existing approach (Niu et al., 2020) is to consider attention scores, which can be considered as pseudo-annotation for evidence-positive set. In other word, sentence S with high attention scores, often used as an "interpretation" of whether S is causal for model prediction, can be selected to build evidence-positive set. However, follow-up works (Serrano and Smith, 2019;Jain and Wallace, 2019) argued that attention is limited as an explanation, because causality cannot be measured, without observing model behaviors in a counterfactual case of the same passage without S. In addition, sentence causality should be aggregated to measure group causality of multiple evidences for multi-hop reasoning. To annotate group causality as "pseudo-evidentiality", we propose Interpreter module, which removes and aggregates evidences into a group, to compare predictions in observational and counterfactual cases.
As a second research question, we ask how to learn from evidence-positive and evidencenegative set. To this end, we identify two objectives: (O1) QA model should not be overconfident in evidence-negative set, while (O2) confident in evidence-positive. A naive approach to pursue the former is to lower the model confidence on evidence-negative set via regularization. However, such regularization can cause violating (O2) due to correlation between confidence distributions for evidence-positive and negative set. Our solution is to selectively regularize, by purposedly training a biased model violating (O1), and decorrelate the target model from the biased model.
For experiments, we demonstrate the impact of our approach on HotpotQA dataset. Our empirical results show that our model can improve QA performance through pseudo-evidentiality, outperforming other baselines. In addition, our proposed approach can orthogonally combine with another SOTA model for additional performance gains.

Related Work
Since multi-hop reasoning tasks, such as Hot-potQA, are released, many approaches for the task have been proposed. These approaches can be cat-egorized by strategies used, such as graph-based networks (Qiu et al., 2019;Fang et al., 2020), external knowledge retrieval (Asai et al., 2019), and supporting fact selection (Nie et al., 2019;Groeneveld et al., 2020).
Our focus is to identify and alleviate reasoning shortcuts in multi-hop QA, without evidence annotations. Models taking shortcuts were widely observed from various tasks, such as object detection (Singh et al., 2020), NLI (Tu et al., 2020), and also for our target task of multi-hop QA (Min et al., 2019;Chen and Durrett, 2019;Trivedi et al., 2020), where models learn simple heuristic rules, answering correctly but without proper reasoning.
To mitigate the effect of shortcuts, adversarial examples (Jiang and Bansal, 2019) can be generated, or alternatively, models can be robustifed (Trivedi et al., 2020) with additional supervision for paragraph-level "sufficiency" -to identify whether a pair of two paragraphs are sufficient for right reasoning or not, which reduces shortcuts on a single paragraph. While the binary classification for paragraph-sufficiency is relatively easy (96.7 F1 in Trivedi et al. (2020)), our target of capturing a finer-grained sentence-evidentiality is more challenging. Existing QA model (Nie et al., 2019;Groeneveld et al., 2020) treats this as a supervised task, based on sentence-level human annotation. In contrast, ours requires no annotation and focuses on avoiding reasoning shortcuts using evidentiality, which was not the purpose of evidence selection in the existing model.

Proposed Approach
In this section, to prevent reasoning shortcuts, we introduce a new approach for data acquiring and learning. We describe this task (Section 3.1) and address two research questions, of generating labels for supervision (Section 3.2) and learning (Section 3.3), respectively.

Task Description
Our task definition follows distractor setting, between distractor and full-wiki in HotpotQA dataset (Yang et al., 2018), which consists of 112k questions requiring the understanding of corresponding passages to answer correctly. Each question has a candidate set of 10 paragraphs (of which two are positive paragraphs P + and eight are negative P − ), where the supporting facts for reasoning are scattered in two positive paragraphs. Then, given a question Q, the objective of this task is to aggregate relevant facts from the candidate set and estimate a consecutive answer span A. For task evaluation, the estimated answer span is compared with the ground truth answer span in terms of F1 score at word-level.

Generating Examples for Training
Answerability and Evidentiality

Answerability for Multi-hop Reasoning
For answerability training in single-hop QA, datasets such as SQuAD 2.0 (Rajpurkar et al., 2018) provide labels of answerability, so that models can be trained not to be overconfident on unanswerable text. Similarly, we build triples of question Q, answer A, and passage D, to be labeled for answerability. HotpotQA dataset pairs Q with 10 paragraphs, where evidences can be scattered to two paragraphs. Based on such characteristic, concatenating two positive paragraphs is guaranteed to be answerable/evidential and concatenating two negative paragraphs (with neither evidence nor answer) is guaranteed to be unanswerable. We define a set of answerable triplets (Q, A, D) as answer-positive set A + , and an unanswerable set as answer-negative set A − . From the labels, we train a transformer-based model to classify the answerability (the detail will be discussed in the next section).
However, answerability cannot supervise whether the given passage has all of these relevant evidences for reasoning. This causes a lack of generalization ability, especially on examples with an answer but no evidence.

Evidentiality for Multi-hop Reasoning
While learning the answerability, we aim to capture the existence of reasoning chains in the given passage. To supervise the existence of evidences, we construct examples: evidence-positive and evidence-negative set, as shown in Figure 1.
Specifically, let E * be the ground truth of evidences to infer A, and S * be a sentence containing an answer A, corresponding to Q. Given Q and A, expected labels V E of evidentiality, indicating whether the evidences for answering are sufficient in the passage, are as follow: We define a set of passages satisfying V E |= T rue as evidence-positive set E + , and a set satisfying V E |= F alse as evidence-negative set E − . Since we do not use human-annotations, we aim to generate "pseudo-evidentiality" annotation. First, for evidence-negative set, we modify answer sentence S * and unanswerable passages, and generate examples with the three following types: • 1) Answer Sentence Only: we remove all sentences in answerable passage except S * , such that the input passage D becomes S * , which contains a correct answer but no other evidences. That is, V E (Q, A, S * ) |= F alse.
• 2) Answer Sentence + Irrelevant Facts: we use irrelevant facts with answers as context, by concatenating S * and unanswerable D. That is, • 3) Partial Evidence + Irrelevant Facts: we use partially-relevant and irrelevant facts as context, by concatenating D 1 ∈ P + and D 2 ∈ P − . That is, Second, building an evidence-positive set is more challenging, because it is difficult to capture multiple relevant facts, with neither annotations E * nor supervision. Our distinction is obtaining the above annotation from model itself, by interpreting the internal mechanism of models. On a trained model, we aim to find influential sentences in predicting correct answer A, among sentences in an answerable passage. Then, we consider them as a pseudo evidence-positive set. Since such pseudo labels relies on the trained model which is not perfect, (1) is not guaranteed, though we observe 87% empirical recall (Table 1).
Section 1 discusses how interpretation, such as attention scores (Niu et al., 2020), can be pseudoevidentiality. For QA tasks, an existing approach (Perez et al., 2019) uses answer confidence for finding pseudo-evidences, as we discuss below: (A) Accumulative interpreter: to consider multiple sentences as evidences, the existing approach (Perez et al., 2019) iteratively inserts sentence S i into set E t−1 , with a highest probability at t-th iter-ation, as follows: (2) where E 0 starts with the sentence S * containing answer A, which is minimal context for our task. This method can consider multiple sentences as evidence by inserting iteratively into a set, but cannot consider the effect of erasing sentences from reasoning chain.
(B) Our proposed Interpreter: to enhance the interpretability, we consider both erasing and inserting each sentence, in contrast to accumulative interpreter considering only the latter. Intuitively, erasing evidence would change the prediction significantly, if such evidence is causally salient, which we compute as follows: where (D\S i ) is a passage out of sentence S i . We hypothesize that breaking reasoning chain, by erasing S i , should significantly decrease P (A|·). In other words, S i with higher ∆P S i is salient. Combining the two saliency scores in Eq. (2),(3), our final saliency is as follows: (4) where the constant values can be omitted in argmax. At each iteration, the sentence that maximize ∆P S i is selected, as done in Eq. (2). This promotes selection that increases confidence P (A|·) on important sentences, and decreases confidence on unimportant sentences. We stop the iterations if ∆P S i < 0 or t = T , then the final sentences in E t=T are a pseudo evidence-positive set E + . To reduce the search space, we empirically set T = 5 1 .
Briefly, we obtain the labels of answerability and evidentiality, as follows: • Answer-positive A + and negative A − set: the former has both answer and evidences, and the latter has neither.
• Evidence-positive E + and negative E − set: the former is expected to have all the evidences, and the latter has an answer with no evidence.

Learning Answerability & Evidentiality
In this section, our goal is to learn the above labels of answerability and evidentiality.

Supervising Answers and Answerability (Base)
As optimizing QA model is not our focus, we adopt the existing model in (Min et al., 2019). As the architecture of QA modal, we use a powerful transformer-based model -RoBERTa (Liu et al., 2019), where the input is [CLS] question [SEP] passage [EOS]. The output of the model is as follows: where f 1 and f 2 are fully connected layers with the trainable parameters ∈ R d , P s and P e are the the probabilities of start and end positions, d is the output dimension of the encoder, n is the size of the input sequence.
For answerability, they build a classifier through the hidden state h [0,:] of [CLS] token that represents both Q and D. As HotpotQA dataset covers both yes-or-no and span-extraction questions, which we follow the convention of (Asai et al., 2019) to support both as a multi-class classification problem of predicting the four probabilities: where p span , p yes , p no , and p none denote the probabilities of the answer type being span, yes, no, and no answer, respectively, and W 1 ∈ R 4×d is the trainable parameters. For training answer span and its class, the loss function of example i is the sum of cross entropy losses (D CE ), as follows: where s i and e i are the starting and ending position of answer A, respectively, and c i is the index of the actual class C i in example i.

Supervising Evidentiality
As overviewed in Section 1, Base model is reported to take a shortcut, or a direct path between answer A and question Q, neglecting implicit intermediate For (O1), as a naive approach, one may consider a regularization term to avoid overconfidence on evidence-negative set E − . Overconfident answer distribution would be diverged from uniform distribution, such that Kullback-Leibler (KL) divergence KL(p||q), where p and q are the answer probabilities and the uniform distribution, respectively, is high when overconfident: where P unif orm indicates uniform distribution. This regularization term R forces the answer probabilities on E − to be closer to the uniform one.
However, one reported risk (Utama et al., 2020; Grand and Belinkov, 2019) is that suppressing data with biases has a side-effect of lowering confidence on unbiased data (especially on in-distribution). Similarly, in our case, regularizing to keep the confidence low for E − , can cause lowering that for E + , due to their correlation. In other words, pursuing (O1) violates (O2), which we observe later in Figure 3. Our next goal is thus to decorrelate two distributions on E + and E − to satisfy both (O1) and (O2).
Figure 2(b) shows how we feed the hidden states h into two predictors. Predictor f is for learning the target distribution and predictor g is purposedly trained to be overconfident on evidence-negative set E − , where this biased answer distribution is denoted asP . We regularize target distribution P to diverge from the biased distribution ofP .
Formally, the biased answer distributionsP (P s andP e ) are as follows: where g 1 and g 2 are fully connected layers with the trainable parameters ∈ R d . Then, we optimizê P to predict answer A on evidence-negative set E − , which makes layer g biased (taking shortcuts), and regularize f by maximizing KL divergence between P and fixedP . The regularization term of example i ∈ E − is as follows: where λ is a hyper-parameter. This lossR is optimized on only evidence-negative set E − . Lastly, to pursue (O2), we train on E + , as done on A + . However, in initial steps of training, our Interpreter is not reliable, since the QA model is not trained enough yet. We thus train without E + for the first K epochs, then extract E + at K epoch and continue to train on all sets, as shown in Figure  2(a). In the final loss function, we apply different losses as set E and A: where the function u is a delayed step function (1 when epoch t is greater than K, 0 otherwise).

Passage Selection at Inference Time
For our multi-hop QA task, it requires to find answerable passages with both answer and evidence, from candidate passages. While we can access the ground-truth of answerability in training set, we need to identify the answerability of (Q, D) at inference time. For this, we consider two directions: (1) Paragraph Pair Selection, which is specific to HotpotQA, and (2) Supervised Evidence Selector trained on pseudo-labels.
For (1), we consider the data characteristic, mentioned in Section 3.1; we know one pair of paragraphs is answerable/evidential (when both paragraphs are positive, or P + ). Thus, the goal is to identify the answerable pair of paragraphs, from all possible pairs P ij = {(p i , p j ) : p i ∈ P, p j ∈ P} (denoted as paired-paragraph). We can let the model select one pair with highest estimated answerability, 1 − p none in Eq. (6), and predict answers on the paired passage, which is likely to be evidential.
For (2), some pipelined approaches (Nie et al., 2019;Groeneveld et al., 2020) design an evidence selector, extracting top k sentences from all candidate paragraphs. While they supervise the model using ground-truth of evidences, we assume there is no such annotation, thus train on pseudo-labels E + . We denote this setting as selected-evidences. For evidence selector, we follow an extracting method in (Beltagy et al., 2020), where the special token [S] is added at ending position of each sentence, and h [S i ] from BERT indicates i-th sentence embedding. Then, a binary classifier f evi (h [S i ] ) is trained on the pseudo-labels, where f evi is a fully connected layer. During training, the classifier identifies whether each sentence is evidence-positive (1) or negative (0). At inference time, we first select top 5 sentences 2 on paragraph candidates, and then insert the selected evidences into QA model for testing.
2 Table 1 shows the precision and recall of top5 sentences. While we discuss how to get the answerable passage above, we can use the passage setting for evaluation. To show the robustness of our model, we construct a challenge test set by excluding easy examples (i.e., easy to take shortcuts). To detect such easy examples, we build a set of single-paragraph P i , that none of it is evidential in HotpotQA, as the dataset avoids having all evidences in a single paragraph, to discourage single-hop reasoning. If QA model predicts the correct answer on the (unevidential) single-paragraph, we remove such examples in HotpotQA, and define the remaining set as the challenge set.

Experiment
In this section, we formulate our research questions to guide our experiments and describe evaluation results corresponding to each question.

Research Questions
To evaluate the effectiveness of our method, we address the following research questions: • RQ1: How effective is our proposed method for a multi-hop QA task?
• RQ3: Does our method avoid reasoning shortcuts in unseen data?
Implementation Our implementation settings for QA model follow RoBERTa (Base version with 12 layers) (Liu et al., 2019). We use the Adam optimizer with a learning rate of 0.00005 and a batchsize of 8 on RTX titan. We extract the evidencepositive set after 3 epoch (K=3 in Eq. (11)) and retrain for 3 epochs. As a hyper-parameter, we search λ among {1, 0.1, 0.01}, and found the best value (λ=0.01), based on 5% hold-out set sampled from the training set.  Metrics We report standard F1 score for Hot-potQA, to evaluate the overall QA accuracy to find the correct answers. For evidence selection, we also report F1 score, Precision, and Recall to evaluate the sentence-level evidence retrieval accuracy.

RQ1: QA Effectiveness
Evaluation Set • Original Set: We evaluate our proposed approach on multi-hop reasoning dataset, Hot-potQA 3 (Yang et al., 2018 Baselines, Our models, and Competitors As a baseline, we follow the previous QA model (Min et al., 2019) trained on single-paragraphs. We test our model on single-paragraphs, paired-paragraphs and selected evidences settings discussed in Section 3.4. As a strong competitor, among released models for HotpotQA, we implement a state-ofthe-art model (Asai et al., 2019) 4 , using external knowledge and a graph-based retriever.
Main Results This section includes the results of our model for multi-hop reasoning. As shown in Table 2, our full model outperforms baselines on both original and challenge set. We can further observe that i) when tested on single-paragraphs, where forced to take shortcuts, our model (O-I) is worse than the baseline (B-I), which indicates that B-I learned the shortcuts. In contrast, O-II outperforms B-II on pairedparagraphs where at least one passage candidate has all the evidences.
ii) When tested on evidences selected by our method (O-III), we can improve F1 scores on both original set and challenge set. This noise filtering effect of evidence selection, by eliminating irrelevant sentences, was consistently observed in a supervised setting (Nie et al., 2019;Groeneveld et al., 2020;Beltagy et al., 2020), which we could reproduce without annotation.
iii) Combining our method with SOTA (C-I) (Asai et al., 2019) leads to accuracy gains in both sets. C-I has distinctions of using external knowledge of reasoning paths, to outperform models without such advantages, but our method can contribute to complementary gains.  Ablation Study As shown in Table 3, we conduct an ablation study of O-III in Table 2. In (A), we remove E + from Interpreter, in training time. On the QA model without E + , the performance decreased significantly, suggesting the importance of evidence-positive set. In (B), we remove evidentaility labels of both E + and E − , and observed that the performance drop is larger compared to other variants. Through (A) and (B), we show that training our evidentiality labels can increase QA performance. In (C), we replaceR with R, removing layer g to train biased features. On the replaced regularization, the performance also decreased, suggesting that trainingR is effective for a multi-hop QA task.

RQ2: Evaluation of Pseudo-Evidentiality Annotation
In this section, we evaluate the effectiveness of our Interpreter, which generates evidences on training set, without supervision. We compare the pseudo evidences with human-annotation, by sentencelevel. For evaluation, we measure sentence-level F1 score, Precision and Recall, following the evidence selection evaluation in (Yang et al., 2018).
As a baseline, we implement the retrieval-based model, AIR (Yadav et al., 2020), which is an unsupervised method as ours. As shown in Table 4, our Interpreter on our QA model outperforms the retrieval-based method, in terms of F1 and Recall, while the baseline (AIR) achieves the highest precision (63.06%). We argue recall, aiming at identifying all evidences, is much critical for multi-hop reasoning, for our goal of avoiding disconnected reasoning, as long as precision remains higher than precision of answerable A + (36.94%), in Table 1.
As variants of our method, we test our Interpreter on various models. First, when comparing (a) and (c), our full model (c) outperforms the baseline (a) over all metrics. The baseline (a) trained on single-paragraphs got biased, thus the evidences generated by the biased model are less accurate. Second, the variant (b) trained by R outperforms (c) our full model. In Eq. (8), the loss term R does not train layer g for biased features, unlikeR in Eq.
(10). This shows that learning g results in performance degradation for evidence selection, despite performance gain in QA.

RQ3: Generalization
In this section, to show that our model avoids reasoning shortcuts for unseen data, we analyze the confidence distribution of models on the evidencepositive and negative set. In dev set, we treat the ground truth of evidences as E + , and a single sentence containing answer as E − (each has 7K Q-D pairs). On these set, Figure 3 shows confidence P (A|Q, D) of three models; (a), (b), and (c) men-tioned in Section 4.2. We sort the confidence scores in ascending order, where y-axis indicates the confidence and x-axis refers to the sorted index. Thus, the colored area indicates the dominance of confidence distribution. Ideally, for a debiased model, the area on evidence-positive set should be large, while that on evidence-negative should be small.
Desirably, in Figure 3(a), the area under the curve for E − should decrease for pursuing (O1), moving along blue arrow, while that of E + should increase for (O2), as red arrow shows. In Figure  3(b), our model with R follows blue arrow, with a smaller area under the curve for E − , while keeping that of E + comparable to Figure 3(a). For the comparison, Figure 3(d) shows all curves on E + . In Figure 3(c), our full model follows both directions of blue and red arrows, which indicates that ours satisfied both (O1) and (O2).

Conclusion
In this paper, we propose a new approach to train multi-hop QA models, not to take reasoning shortcuts of guessing right answers without sufficient evidences. We do not require annotations and generate pseudo-evidentiality instead, by regularizing QA model from being overconfident when evidences are insufficient. Our experimental results show that our method outperforms baselines on HotpotQA and has the effectiveness to distinguish between evidence-positive and negative set.