HOP, UNION, GENERATE: Explainable Multi-hop Reasoning without Rationale Supervision

Explainable multi-hop question answering (QA) not only predicts answers but also identifies rationales, i. e. subsets of input sentences used to derive the answers. This problem has been extensively studied under the supervised setting, where both answer and rationale annotations are given. Because rationale annotations are expensive to collect and not always available, recent efforts have been devoted to developing methods that do not rely on supervision for rationales. However, such methods have limited capacities in modeling interactions between sentences, let alone reasoning across multiple documents. This work proposes a principled, probabilistic approach for training explainable multi-hop QA systems without rationale supervision. Our approach performs multi-hop reasoning by explicitly modeling rationales as sets, enabling the model to capture interactions between documents and sentences within a document. Experimental results show that our approach is more accurate at selecting rationales than the previous methods, while maintaining similar accuracy in predicting answers.


Introduction
Multi-hop reasoning is an important capability for any intelligent machine comprehension system.Question answering (QA) is a common application for evaluating a system's ability to reason across multiple steps (Geva et al., 2021;Yang et al., 2018;Welbl et al., 2018).Large language models have achieved tremendous success on challenging QA tasks, even in the few-shot setting (Wei et al., 2022).However, Min et al. (2019) and Chen and Durrett (2019) demonstrate that these models, in reality, often bypass multi-hop reasoning by performing shallow pattern matching, resulting in poor generalization ability (Tang et al., 2021).To avoid predictions made from such reasoning shortcuts, it is important to understand the series of steps the systems follow to derive the answers.This work explores the challenge of building explainable multi-hop QA systems, which, in addition to predicting an answer, also identify a rationale -the set of sentences that lead to the answer.Depending on task specifications, the rationale can be within a single document, or span across multiple documents.Explainable multi-hop QA has been extensively studied in the supervised setting, where both rationale annotations and answer annotations are given.These approaches either apply multi-task loss functions (Joshi et al., 2020;Groeneveld et al., 2020;DeYoung et al., 2020) or design specialized network architectures (Tu et al., 2019;Fang et al., 2020).However, having access to rationale annotations is a strong assumption.In practice, they are expensive to collect (Geva et al., 2021), less available than answer annotations (Welbl et al., 2018), and can suffer from low agreement rates between annotators (Zhang et al., 2020).
Researchers have thus explored approaches that do not require rationale annotations (Lewis et al., 2020b;Glockner et al., 2020;Atanasova et al., 2022).However, these previous approaches limit their reasoning to information from 1 or 2 sentences, and so they cannot be applied in multihop scenarios, i.e.QA tasks that require making connections between several pieces of information across sentences and across documents.Additionally, these methods are either restricted to only work for multiple-choice QA, or restricted to only produce rationales at the document level but not at the sentence level.
We propose HOP, UNION, GENERATE (HUG), a principled, probabilistic approach for training explainable multi-hop QA systems without rationale supervision.HUG overcomes the two-sentence limitation of previous methods by directly reasoning about rationales as sets of sentences, while also extending rationale prediction to the multidocument setting.We show an overview of HUG in Figure 1.HUG leverages the naturally hierarchi-  Hop explicitly considers all possible document sets and selects the most likely document set, Union explicitly considers all sentence subsets and chooses the most likely sentence subset within each selected document, and Generate combines the chosen sentence subsets and generates an answer.cal structure of text and proceeds in three stagesit first selects the relevant set of documents given the question (Hop); then, it selects a subset of sentences within each of the relevant documents and collects them together (Union); finally, it generates an answer via a seq2seq model with all the collected sentences (Generate).The key to multi-hop reasoning in HUG is modeling each selection as an explicit distribution over sets.A probabilistic set distribution compares non-contiguous and variable size rationales, affording HUG flexibility for rationale selection.
Training a set-prediction model quickly becomes intractable as the size increases.We make two algorithmic choices that lead to tractable training for HUG.Treating rationales as a latent variable requires HUG to marginalize over all possible rationales, leading to an intractable learning objective.HUG overcomes this issue by performing sampling in a hierarchical way -it first identifies the most promising documents and then the most promising sentences within those documents.Second, multi-hop QA often involves reasoning over long documents, which is challenging due to the computational complexity of encoding long documents with neural models such as transformers.To make this encoding efficient, HUG performs computation in the embedding space.
We empirically evaluate HUG on three different multi-hop QA datasets: HotpotQA (Yang et al., 2018), MuSiQue (Trivedi et al., 2022), FEVER (DeYoung et al., 2020), and Mul-tiRC (DeYoung et al., 2020).Results show that, for both selecting rationales and predicting answers, HUG is better than a number of state-of-the-art semi-supervised and unsupervised methods (Chen et al., 2019;Lewis et al., 2020b;Glockner et al., 2020;Atanasova et al., 2022) on all of the datasets.We also demonstrate that HUG combined with larger language models consistently produces better performance.We then analyze performance according to different types of multi-hop reasoning and show that by explicitly modeling multi-hop reasoning, HUG achieves a large improvement on the reasoning type that requires bridging entities.

Related Work
Explainable methods for multi-hop QA.Active research has been devoted to collecting human rationales for a wide range of QA tasks; a recent survey has identified 65 datasets that provide explanation annotations (Wiegreffe and Marasovic, 2021).The appearance of such datasets has enabled rapid progress in supervised methods for extracting reasoning chains; we refer readers to Thayaparan et al. (2020) for a comprehensive survey.While these supervised methods such as Qi et al. (2019) have achieved tremendous success on retrieving rationales, even in the open-domain setting1 , they can only be applied when rationale annotations are available.However, such annotations do not always exist - Welbl et al. (2018) andYu et al. (2022) propose two complex reasoning datasets that do not have rationale annotations.In these cases, we need unsupervised rationale selection methods to still build explainable QA systems.
Other works have explored multi-hop QA with only answer supervision but not rationale supervision.As in our work, Retrieve and Generate (RAG) (Lewis et al., 2020b), treats the rationale as a latent variable; however, in RAG the open retrieval-stage is to find a single document, ignoring the connections between different documents.RAG also does not produce sentence-level rationales.Other works consider sentence rationales: Glockner et al. (2020) compute a score for every sentence pair and pick the sentence pair that has the highest score as the rationale for answer predic-tion, and Atanasova et al. (2022) perform binary classification on whether individual sentences are included in the rationale with added constraints such as consistency and faithfulness.The shared limitation of these methods is that they do not capture the dependency between more than two pieces of information.HUG overcomes this limitation by performing multi-hop reasoning as document set prediction and sentence set prediction.
Outside of unsupervised methods, Chen et al. (2019) propose a semi-supervised method which collects silver rationale annotations.However, their method is limited to bridge-based questions, which is only one form of multi-hop reasoning; the other types can be found in Trivedi et al. (2022).
Rationales as latent variables.A focus for rationale methods in NLP outside of multi-hop QA has been identifying subsets of input tokens to justify decisions.For text classification, Lei et al. (2016), Bastings et al. (2019), and Chen and Ji (2020) frame rationales as minimal subsets of input tokens.For multi-hop QA, where input tokens are too granular a representation for rationales, treating sentences as rationales within long documents leads to the challenges of hierarchical selection and the representation of long documents; both of which we address with HUG.
Outside of using input tokens for rationales, Zhou et al. (2020) assume rationales take the form of unconstrained text.While flexible, this approach leads to computationally expensive training methods.Therefore, we constrain rationales to be a set of sentences from given documents, which both accommodates the production of useful intermediate reasoning steps and keeps training tractable.
Unsupervised retrieval.A task closely related to our setting (i.e., no access to rationale supervision) is unsupervised retrieval, which searches for sentences relevant to the questions but does not predict answers.For example, one could apply Yadav et al. (2019), Yadav et al. (2020), Zhao et al. (2021) and Xu et al. (2021) to first identify the rationale for a multi-hop QA example and predict an answer based only on the rationale rather than on the entire document, so that the answer prediction is more constrained.However, HUG may be preferred over these unsupervised retrieval methods because they assume specific types of QA formats (Xu et al., 2021), but HUG works on any type of QA problems, and 2) while other works also x: Emily Beecham is best known for her role in a television series whose second season premiered on what date?d1: [1] Emily Beecham is an English-American actress.
[2] She is best known for her role in the AMC television series "Into the Badlands" [3] In 2011, she received the Best Actress award at the London Independent Film Festival.
d2: [4] Into the Badlands is an American television series that premiered on AMC November 15, 2015 [5] The series features a story about a warrior and a young boy who journey through a dangerous feudal land together seeking enlightenment.[6] AMC renewed the show for a 10-episode second season, which premiered on March 19, 2017.
[7] On April 25, 2017, AMC renewed the series for a 16-episode third season.' Figure 2: A QA example.The rationale z used to derive the answer is highlighted in blue italics, the documentlevel interaction is highlighted in red boldface, and the sentence-level interaction (i.e., coreference resolution) is highlighted in underline.HUG models dependencies both between documents and between sentences within a document, thus being equipped with the capacity to perform multi-hop reasoning.
propose to model rationales as a latent variable, we additionally introduce the hierarchical structure in our probabilistic model, enabling efficient inference.
3 Generative Multi-Hop QA In the standard multi-hop QA setting, an example consists of a question x, a set of documents D, and an answer y.Within D, some documents are relevant to the question, while the others are distractors.Explainable multi-hop QA models predict a rationale z, a minimal set of sentences across the relevant documents, in addition to predicting the answer y.We show a multi-hop QA example (with distracting documents omitted) in Figure 2.

Model
We propose the following generative model for multi-hop QA.Given the question x, we first select a subset of documents d = {d 1 , d 2 , . ..} ⊆ D. Next, within each document d i , we select a subset of sentences z i .Finally, conditioned on the union of sentence sets from each document, z = ∪ i z i , we generate an answer y.The only assumption we make in the model is that sentence sets are selected independently among documents.Formally, we write the model as, (1) • p(y | z, x). (3) We refer to Eq. 1 as the document set selection model, Eq. 2 as the sentence set selection model, Eq. 3 as the answer generation model.

Document Set Selection
We select a set of documents d by directly parameterizing a distribution over all valid document sets.We rely on a document set scoring function f (d, x), which captures both the relevance of the document set d to the question x, as well as the dependencies among the documents in the set.The document set selection model is given by This distribution is globally normalized over all valid subsets of documents D, requiring the evaluation of the document scoring function f on all valid document subsets.Document set validity is dataset specific, and is discussed in Section 4. For efficiency, the document set scoring function f first computes embeddings of each document in the set d independently, then combines them with a neural network (MLP).Formally, let emb : V * → R n be an embedding function that maps a sequence of text to an n-dimension vector, where V is the vocabulary.The document set scoring function is given by We provide the details of the MLP in Appendix A and the details of the embedding function below, as part of the sentence selection model description.
Sentence Set Selection Within each document d i , we select z i ∈ P(d i ), a power set of all sentences in d i .We rely on a sentence set scoring function g(z i , x), similar to the document set scoring function, which captures all relationships between selected sentences and the question.The sentence set selection model is given by which is globally normalized over all valid subsets of sentences in the document d i .
Computing p(z i | d i , x) requires enumerating all sentences subsets, which is intractable.We instead extend the approach of Li et al. (2022), which obtains document and contextual sentence representations in a single encoding step.We insert a special [SPC] token at the beginning of each sentence, shown here at positions k 1 , k 2 , . . ., We then obtain sentence subset emb(z i ) embeddings by feeding this to an encoder-only model such as BERT (Devlin et al., 2019) and taking the average of the contextual embeddings of the special tokens corresponding to the sentences in z i : Finally, let v be a learnable vector, g(z i , x) = v T emb(z i , x) In practice, we note that encoder methods have a maximum input length, which can prevent full document encodings.We provide the details of long document encoding in Appendix B. We also only consider subsets of up to a fixed max length.
Answer Generation Parameterization of the answer generation model, p(y | z, x), is done using a sequence-to-sequence model where the question and rational are fed to an encoder, and that answer is generated.This process is complicated by the fact that answers can take on different forms, depending on specific QA tasks such as Boolean QA, multiple-choice QA, extractive QA, and abstractive QA, etc.We can therefore use a sequenceto-sequence model such as BART (Lewis et al., 2020a) and T5 (Raffel et al., 2020), and provide prompt templates for the different variants of the task, given in the next section.

Training and Inference
To learn an explainable multi-hop QA system, HUG optimizes an approximation of the marginal likelihood.The marginal likelihood, is intractable, as it requires computing p(y | z, x) under the answer generation model for every valid set of sentences across documents.We instead optimize a top-K Viterbi approximation of the marginal likelihood.Given we use the following approximation of the marginal likelihood as our training objective: At test time, we must choose the best documents and rationales.Similar to training, we first choose the most likely pair of documents from S 1 , then the most likely rationale from S 1 d .Finally, for span-based QA, we generate an answer by performing greedy search on the answer generation model p(y|z, x); for Boolean QA or multiplechoice QA, we normalize the answers between different choices and take arg max y p(y|z, x).

Experimental Setup
Datasets and Their Representations.We evaluate HUG on four multi-hop QA datasets: Hot-potQA in the distractor setting (Yang et al., 2018), MuSiQue with answerable questions (Trivedi et al., 2022), FEVER (Thorne et al., 2018), and Mul-tiRC (Khashabi et al., 2018).HotpotQA and MuSiQue are extractive QA datasets that require reasoning over multiple Wikipedia documents and identifying a span of text as the answer.In Hot-potQA, each example contains ten candidate documents, and we must identify exactly two documents (|D| = 2) that are relevant.FEVER is a fact checking dataset that requires verifying claims made based on Wikipedia articles (|D| = 1).Mul-tiRC is a multiple-choice QA dataset collected from diverse sources of documents including narrative stories and news articles, and their questions can only be answered by reasoning over multiple sentences (|D| = 1).Unlike conventional multichoice tasks (Lai et al., 2017;Richardson et al., 2013), MultiRC does not pre-specify the number of correct answer choices, resulting in a more challenging setting.For FEVER and MultiRC, we consider the ERASER version (DeYoung et al., 2020), where rationale annotations are made cleaner and evaluation metrics are provided.In Table 1, we demonstrate how to convert QA examples of these three datasets to a natural language prompt format (Brown et al., 2020).In FEVER, a claim x needs to be classified as whether it is supported (1) or refuted (0) given the accompanying documents D. For MultiRC, there is a varying number of choices per example, and an unknown number of the choices is correct; we stack all answer choices attached with their truth values as outputs to supervise the model.Finally, extractive QA can naturally be formulated as a text-totext problem, where the input is x [SEP] z, and the output is y.
Metrics and Comparison Systems.We compute F1 scores for rationale and document selection, and answer prediction.F1 scores for rationales are computed at the sentence level.Because the three QA datasets are in different formats, F1 scores for answers are computed differently.For extractive QA, F1 scores are computed at the token level for the answer spans.For Boolean QA and multiplechoice QA, F1 scores measure categorical answers.
For each dataset, we compare to (1) state-of-the-art approaches that require no rationale supervision and (2) at least one fully supervised method (i.e., answers and rationales available for training).The latter provides an upper bound on performance.
-On HotpotQA and MuSiQue, we compare to a rule-based approaches, BM25, and RAG (Lewis et al., 2020b) as unsupervised baselines.We note that RAG only performs document-level retrieval, and therefore its current form cannot be directly applied to identifying sentence-level rationales.We modify RAG to treat a sentence as a document, and at inference we take the top-3 sentences to be the rationale as it results in the highest sentence F1 scores.For fair comparison, we parameterize RAG in the same way as we parameterize HUG.
We also consider a semi-supervised approach -CHAIN (Chen et al., 2019); they assume no access to gold rationale annotations and supervise their model on silver rationales produced with external entity taggers.The fully supervised system we consider is SAE (Tu et al., 2020).Both CHAIN and SAE use RoBERTa-large as sentence encoders.
-On FEVER and MultiRC, we also compare to RAG as an unsupervised baseline (predicting top-2 sentences for MultiRC, and top-1 sentence for FEVER).Additionally, we consider diagnosticsguided explanation generation (DIAGNOSTICS) (Atanasova et al., 2022) and faithful rationales (FAITHFUL) (Glockner et al., 2020).Both of these methods have two variants -one trained with rationale supervision (denoted by RS-*) and the other trained without rationale annotations (denoted by RU-*).On MultiRC, We also compare to WT5 (Narang et al., 2020).
Implementation and Hyperparameters.We test HUG with language models of both small and large sizes.For the small version (HUG-Small), we use distilBERT (Sanh et al., 2019) as the encoder and BART-base as the seq2seq model.For the large version (HUG), we use RoBERTa-large as the encoder and BART-large as the seq2seq model.We only test RAG in the small version.We implement HUG with Hugging Face Transformers (Wolf et al., 2020).We perform grid search with learning rates {5e-6, 1e-5, 2e-5} and batch sizes {2,4,8,16} for both HUG and RAG.We train our system for 3 epochs for HotpotQA and 5 epochs for the other three datasets.We warm up the learning rate with first 10% examples.We choose the checkpoint that has the highest answer F1 score on the validation set.We consider rationales up to four sentences for HotpotQA and rationales up to three sentences for the other datasets.Finally, we take S 10 and S 9 d for HotpotQA and S 80 d for MultiRC.Because we remove rationales whose sentences are not contiguous in FEVER, we are able to compute the exact likelihood without top-k sampling.

Results
HotpotQA.We summarize the results on Hot-potQA in Table 2. HUG-Small outperforms the best unsupervised approach RAG-Small by 18 sentence F1 points, demonstrating superior multi-hop reasoning abilities.HUG-Small, despite having fewer parameters, outperforms the semi-supervised CHAIN on predicting rationales and is comparable to CHAIN on predicting answers.While CHAIN explicitly exploits the heuristics used in the data collection process for HotpotQA, HUG-Small is able to learn such heuristics in a fully automatic way.Finally, the gap between HUG and SAE, a fully supervised method that is given both rationale and answer annotations, remains large.

MuSiQue.
Table 2 shows that HUG-Small is better at both identifying rationales and predicting answers than RAG-Small, the best-performing unsupervised baseline.Due to the difficulty of MuSiQue, it is harder for an unsupervised method to learn to select rationales -the gap between the supervised method and the best unsupervised method on this dataset is greater than the gap on HotpotQA.
FEVER We summarize the results on FEVER in Table 3.In terms of selecting rationales in an unsupervised manner, HUG-Small outperforms RU-DIAGNOSTICS, performs similarly to RAG-Small, and underperforms RU-FAITHFUL by a small margin.Because FEVER mostly only requires single-hop reasoning, HUG-Small does not improve over the previous methods.On predicting answers, HUG-Small outperforms RU-FAITHFUL but underperforms RAG-Small.Compared to the supervised versions of RS-DIAGNOSTICS and RS-FAITHFUL, which have access to both the answers and rationales during training, the gap between HUG-Small's and their rationale scores (sentence F1) understandably remains large.
MultiRC Table 3 shows that HUG-Small outperforms all comparison models including RAG-Small -the best competing approach without ratio- nale supervision -by 3.5 sentence F1 points and 4.3 answer F1 points.
Scaling HUG to Larger Models.On all three datasets, by increasing the number of model parameters, HUG can consistently achieve better performance.Additionally, as the number of reasoning hops increases, HUG can more benefit from the larger language models -compared to HUG-Small, HUG has the least improvement on FEVER and had the most improvement on MuSiQue.

Analysis
Document Dependencies HUG explicitly models the dependencies between documents for multi-hop reasoning.We consider independent document selection2 to see whether this dependencies is necessary on the HotpotQA dataset.
To understand how document modeling impact rationale selection performance, we break the performance down by the reasoning types proposed in Yang et al. (2018): comparison-based reasoning and bridge-based reasoning.In comparison-based reasoning, relevant documents independently contribute to the answer, whereas for bridge-based reasoning, relevant documents require connections to previously selected documents.Table 4 summarizes answer F1 scores, document F1 scores, and sentence F1 scores.While HUG-Ind is slightly better at comparison-based reasoning than the joint model, it fails at bridge-based reasoning; this result thus confirms the necessity of modeling the dependency between documents.We also note that HUG-Ind and HUG have similar performance in predicting answers, but the gap between how accurate they select rationales is large, suggesting that HUG-Ind often derives answers with wrong reasoning.Overall, HUG is better than HUG-Ind Q: When Copsi was made earl of Northumbria he went to reside in a town at the confluence of which two rivers?Document A, Copsi: Copsi survived Tostig's defeat at Stamford Bridge, and when William the Conqueror prevailed at Hastings he travelled, in March 1067, to pay William homage at Barking (where William was staying while his tower was being constructed in London).In return, William made Copsi earl of Northumbria and sent him back to York.Document B, York: York is a historic walled city at the confluence of the rivers Ouse and Foss in North Yorkshire, England.The municipality is the traditional county town of the historic county of Yorkshire to which it gives its name.Document C, Two Rivers Press: Two Rivers Press is an independent publishing house, based in the English town of Reading.Two Rivers Press was founded in 1994 by Peter Hay (1951Hay ( -2003)).A: Ouse and Foss Figure 3: A HotpotQA example where there is a dependency between two supporting documents, and thus selecting the second document independent of the first one results in insufficient information.Correct rationale is highlighted in blue italics.Entity overlaps between questions and documents are in red boldface.HUG-Ind's predicted Documents B and C, whose reasoning remains at the surface level as they share the most entities with the question.HUG predicted Documents A and B, which demonstrates its ability of understanding dependency between documents.at both predicting answers and selecting rationales.
In addition to the quantitative analysis, we also qualitatively compare the two models in Figure 3.When considering paragraphs independently, documents A and C share the most entities with the question (i.e., Copsi, earl of Northumbria, and Two Rivers), so they are more likely to lead to the answer.However, the correct documents are A and B. Deriving B not only depends on the question but also further requires knowing the information from A. Therefore, while having the independent document selection model can improve efficiency because it only performs one-step reasoning, the joint document selection model is necessary when reasoning steps depend on one another.
Role of Answer Generation HUG uses a generative model (BART) to parameterize p(y | z, x).
Q: What did the judge tell Mr. Thorndike about the law?A1: Cannot be swayed by wealth or political influences.A2: The law is not vindictive.A3: It was not vindictive.A4: It was unjust.A5: It was vindictive.A6: The judge told Mr. Thorndike that the law is not vindictive.He said the law only wishes to be just.Judge said the law cannot be swayed by wealth or political influences.An alternative approach would be to use a classification model such as RoBERTa (Liu et al., 2019) to predict answers for FEVER and MultiRC (Hot-potQA requires a generative model).Interestingly, Table 5 shows the choice of answer model significantly impacts the ability of HUG to learn a rationale model.On FEVER, where claims cannot be verified without the corresponding rationales, BART and RoBERTa perform similarly.However, on MultiRC, where questions can often be answered without information in accompanying documents, the best Generative model outperforms the best Classification model by over 32 sentence F1 points.Figure 4 shows an example of such a question where the answers can be guessed by the classification model using commonsense knowledge to reason about law.Generative models need to assign a high probability to every token in the answer, and we hypothesize that they make better use of the answer supervision.
Speed evaluation.While HUG obtains strong sentence F1 scores, training is more expensive because the model must consider a set of rationales for every example.In particular, the answer model p(y | z, x) must be run for every sampled z for each training example.At inference, the answer model requires only a single evaluation of p(y | z, x) for arg max z p(z | x).We empirically measure the runtime overhead of HUG compared to FAITHFUL on MultiRC, using 80 samples of z at training time.We report the total training time and inference time in Table 6.Compared to FAITHFUL, HUG takes longer to train and to predict.

Conclusion
We present HUG, a probabilistic, principled approach for explainable multi-hop reasoning without rationale supervision.HUG explicitly models multi-hop reasoning by considering the dependency between documents and between sentences within a document.Experimental results demonstrate that HUG outperforms other state-of-the-art methods that do not rely on rationale labels.

Ethics Statement
The goal of explainable methods is to improve the trustworthiness of systems.HUG presents a method for fine-tuning language models for selecting rationales, without rationale annotations, that exploits the knowledge already present in pretrained language models.While this has the potential of improving the trustworthiness of the model, it may also reinforce existing harmful biases in the language model.For extending this parameterization to large document sets, we could use a similar parameterization to the sentence set scoring function:

B Encoding long documents
Transformer-based text encoders can only accept inputs shorter than a fixed length (e.g., 512 tokens).
To address this limitation, we partition documents into slices of m sentences and compute the embedding for each slice individually.We denote a slice for a document d as d i:j that starts at the ith sentence and ends before the jth sentence.We set the slice length m purely based on whether the longest slice is under 512 tokens.m is set to 3 for HotpotQA, 5 for FEVER, and 9 for MultiRC.

C Model selection
In the unsupervised sentence selection setting, we cannot perform model selection by choosing the model with the highest validation sentence F1 score.Instead, we must rely on answer evaluation measures: validation answer F1, answer EM, or likelihood.We train HUG for three epochs, checkpoint every 2500 steps, and evaluate sentence F1 for the checkpoint with the best validation performance measure.The results of these selection methods are presented in Table 7.We find that performing model selection via both answer EM and answer F1 results in the best sentence F1, but the differences between different metrics are minor.where k is the number of documents pre-specified by the task.

E Revealing dataset shortcuts with HUG
We show that HUG is able to discover examples in which answers can be derived with reasoning shortcuts.Yang et al. (2018) claim that all HotpotQA examples require reasoning over two documents, but we identify a number of examples that fail this property with the following steps.First, we look for the examples that HUG correctly predicts the answers but incorrectly predicts the rationales.Of those examples, we look for a subset where only one document is correctly selected.Because if conditioning on the one document that can already lead to the answer, the other document is redundant.Finally, we manually go through the filtered examples and find that many of the questions can be answered with one documents.Figure 5 shows two types of reasoning shortcuts found by us.The first question implies both airports are in the same country, and thus looking up one of the airports is sufficient.In the second question, Document B alone contain the correct answer.

Figure 1 :
Figure1: An overview of HUG, which proceeds in three stages.Hop explicitly considers all possible document sets and selects the most likely document set, Union explicitly considers all sentence subsets and chooses the most likely sentence subset within each selected document, and Generate combines the chosen sentence subsets and generates an answer.

Figure 4 :
Figure 4: A test example from MultiRC that can be answered with commonsense reasoning and thus requires no accompanying documents.Correct answers are highlighted in blue italics.

For
independent document selection, we train a different document selection model that factors as p(d | x) = d∈d p(d | x).

Figure 5 :
Figure 5: Dataset Shortcuts.Two HotpotQA examples that do not need both documents to derive the answers.
Emily Beecham is best known for her role in a television series whose second season premiered on what date?

Table 1 :
In A claim to be verified is that {x} We have following facts: {z} Out The claim is thus {supported/refuted}.BQA In A claim to be verified is that Steve Wozniak designed homes.We have following facts: Steve Wozniak primarily designed the 1977 Apple II , known as one of the first highly successful mass-produced microcomputers.Out The claim is thus refuted.Seq-to-Seq Prompt templates for QA examples for FEVER as Boolean QA (BQA), MultiRC as multiple-choice QA (MCQ), and HotpotQA as extractive QA (EQA).Templates are in purple cells, followed by specific examples in green cells.Template keywords are highlighted in red italics.

Table 2 :
Performance comparison on predicting rationales and answers on HotpotQA and MuSiQue.

Table 3 :
Performance comparison on predicting rationales and answers on Eraser-FEVER and Eraser-MultiRC.HUG has more parameters due to its use of a seq2seq model for answer generation.

Table 4 :
Rationale selection performance broken down by different types of reasoning.

Table 5 :
Comparison on sentence F1 scores between different parameterization choices of P (y | z, x).

Table 6 :
Runtime comparison (in seconds).HUG uses 80 rationale samples at training time, and the argmax rationale at inference.

Table 7 :
Sentence F1 scores from checkpoints that are chosen based on different criteria.
Watertown International Airport and Blue Grass Airport, are in which country? Document A, Blue Grass Airport: Blue Grass Airport is a public airport in Fayette County, Kentucky, 4 miles west of downtown Lexington.Document B, Watertown International Airport: Watertown International Airport is a county owned, public use airport located in Jefferson County, New York, United States.A: United States Q: Who is also an actor, Luis Llosa or Ron Howard?Document A, Luis Llosa: Luis Llosa (born 1951) is a Peruvian film director.Document B, Ronald William Howard: Ronald William Howard (born March 1, 1954) is an American actor and filmmaker.A: Ronald William Howard Q: