Context Quality Matters in Training Fusion-in-Decoder for Extractive Open-Domain Question Answering

Retrieval-augmented generation models augment knowledge encoded in a language model by providing additional relevant external knowledge (context) during generation. Although it has been shown that the quantity and quality of context impact the performance of retrieval-augmented generation models during inference, limited research explores how these characteristics affect model training. This paper explores how context quantity and quality during model training affect the performance of Fusion-in-Decoder (FiD), the state-of-the-art retrieval-augmented generation model, in extractive open-domain question answering tasks. Experimental results suggest that FiD models overfit to context quality during training and show suboptimal performance when evaluated on different context quality. Through the experimental results, we also reveal FiD models trained with different context quality have different cross-attention distribution patterns. Specifically, as context quality during training increases, FiD models tend to attend more uniformly to each passage in context. Finally, based on these observations, we propose a method to mitigate overfitting to specific context quality by introducing bias to the cross-attention distribution, which we demonstrate to be effective in improving the performance of FiD models on different context quality.


Introduction
Recently, large-scale pre-trained language models have achieved impressive performance in the field of Natural Language Generation, which includes tasks that require real-world knowledge, e.g., closed-book question answering and common sense reasoning (Brown et al., 2020).However, these models are still prone to generate factually incorrect outputs known as hallucinations (Ji et al., 2023), particularly when dealing with rare entities (Mallen et al., 2023).Also, they cannot handle new information that arises after their training phase (Kasai et al., 2022).
In order to address these challenges, retrievalaugmented generation models have been recently proposed (Izacard and Grave, 2021b;Lewis et al., 2020).These models draw inspiration from retrieval-based extractive open-domain question answering methods (Chen et al., 2017) and utilize additional relevant external knowledge (e.g., a Wikipedia article about an entity in a given question) during generation to augment knowledge encoded in a language model.Retrieval-augmented generation models have demonstrated effectiveness in knowledge-intensive tasks (Petroni et al., 2021) such as question answering and fact checking (Hofstätter et al., 2022), and have been reported to reduce hallucinations in dialogue tasks (Shuster et al., 2021).
The external knowledge given to the models is called a context, and it is usually obtained through information retrieval systems (Lin et al., 2022).Multiple passages, typically up to 100, are often used collectively as a single context to ensure the high recall of relevant information.This strategy addresses the limitations of retrieval systems, which may return irrelevant passages and fail to capture relevant information in the top results.When dealing with contexts composed of multiple passages, we can define their quantity (the number of passages in the context) and quality (the proportion of relevant passages in the context).Since the context quantity and quality vary depending on model configuration or application, e.g., the performance of the retrieval system and the computational resources available, understanding how these characteristics impact the model performance becomes an important research question.
Indeed, during the inference phase, it has been shown that the quantity and quality of contexts impact the performance of retrieval-augmented generation models.For example, Izacard and Grave (2021b) showed that increasing the number of topranked retrieved passages used as a context during inference improves the performance of their model in the question answering task, and Weller et al. (2022) found that the model prediction is distracted more strongly as the proportion of conflicting misinformation in the context increases.
However, regarding the training phase, it is not yet fully understood how these context characteristics impact the performance of the trained models.Limited research suggests that increasing the number of retrieved passages used as a context during training improves question answering performance (Izacard and Grave, 2021b) and reduces memorization (Chen et al., 2022).Still, the impact of quantity and quality of contexts is mixed in these studies, as relevant passages are typically biased towards higher rank in the retrieval result, and simply increasing the number of top-ranked passages changes both the quantity and quality of the context.
In this paper, we focused on extractive opendomain question answering tasks and investigated the impact of context quantity and quality on the training of Fusion-in-Decoder (FiD) (Izacard and Grave, 2021b), a state-of-the-art retrievalaugmented generation model.We demonstrate that context quality during training affects the performance of the trained model.As far as our knowledge, this work is the first attempt to explicitly control context quality and investigate its effect on training of retrieval-augmented generation models.
Key insights obtained through our experiments are as follows: • FiD models overfit to context quality during training, resulting in deteriorated performance when evaluated on a different quality of context.
• FiD models overfit less to context quantity compared to context quality.
• FiD models trained with different context qualities show different patterns of crossattention probability.As context quality during training increases, the trained models tend to attend more uniformly to each passage in context and vice versa.
Based on these observations, we propose a method to mitigate the overfitting of a trained FiD model to specific context quality without additional training by controlling the selectivity of its crossattention distribution.We present an empirical analysis demonstrating the proposed method's effectiveness in improving the performance of a trained FiD model when deployed in environments with different context quality than those used during training.

Experimental Setup
In this section, we describe the task ( §2.1) and model architecture ( §2.2) used in our experiments, and we define quality and quantity that we used in this paper ( §2.3, 2.4).

Task and Dataset
This study focuses on the extractive open-domain question answering task, where models have to extract answers from retrieved documents.We conducted experiments on two standard benchmark datasets of the task: • Natural Questions (Kwiatkowski et al., 2019) contains questions submitted to Google Search engine.We use the open-domain version of this dataset presented by Lee et al. (2019).
• TriviaQA (Joshi et al., 2017) contains questions authored by trivia enthusiasts.Following Lee et al. (2019), we use the unfiltered set of the dataset.
For each dataset, following (Izacard and Grave, 2021b), we used top-100 passages retrieved by DPR (Karpukhin et al., 2020) 1 .As an evaluation metric, we computed the exact match (EM) between a ground-truth answer and predicted answer generated by greedy decoding2 .We evaluated the performance on the development set of each dataset.

Model
In our experiments, we focused on Fusion-in-Decoder (FiD) (Izacard and Grave, 2021b), a stateof-the-art architecture for retrieval-augmented generation model.FiD is extended from sequence-tosequence models, such as T5 (Raffel et al., 2020), and consists of a Transformer encoder E and decoder D (Vaswani et al., 2017).
Given a question q and its context c = {p i } N i=1 , where p i is the i-th passage of the context, a FiD model converts each passage p i to pi by a template "question: {q} title: {t} context: {c}".Here, {q}, {t}, and {c} are respectively replaced by q, the title of p i , and the main text of p i .Then, a FiD model independently encodes each converted passage pi by the encoder E and feeds their concatenation to the decoder D to get predicted answer a as follows: (1) We followed standard practice and trained FiD models by minimizing a cross-entropy loss of a ground-truth answer.As the position of a passage is not considered while encoding, the prediction of a FiD model is insensitive to the order of the passages.Thus we did not perform any special shuffling or ordering of the passages during training and evaluation.We used t5-base3 (Raffel et al., 2020) to initialize the model.See Appendix A for other implementation details of training and inference of FiD models.

Relevant and Irrelevant Passage
In this paper, we adopt the same definition of relevant and irrelevant passage as in Li et al. (2022).More specifically, a passage is relevant to a question if it logically entails an answer to the question, and it is irrelevant if it does not.
However, in our open-domain setting, no groundtruth annotation of passage relevance exists for Natural Questions and TriviaQA.As discussed by Li et al. (2022), a simple rule that determines a passage that contains a ground-truth answer as a relevant passage is insufficient to filter out irrelevant passages that contain the answer but do not entail the answer.Since accurately determining whether a passage is relevant or not is crucial for estimating context quality, we applied an additional rule to extract relevant passages.We fed a pair of a question and a passage to a pre-trained question answering model4 , and deemed the passage relevant if the predicted answer matched a ground-truth answer to the question5 .Following Li et al. (2022), we considered a passage that did not contain any ground-truth answers to the question as irrelevant.
We respectively denote the set of relevant and irrelevant passages of question q by R(q) and R(q), and we omit (q) when it is not necessary.

Context Quality and Quantity
For a question q and a context c = {p i } N i=1 of N passages, we define context quality and quantity as follows: • Context quality is defined as the proportion of passages in c that is relevant to q, i.e. |R(q)| N .
• Context quantity is the number of passages in c, i.e.N .

Case Studies
In this section, we describe our experiments to investigate how context quantity and quality during training affect the performance of FiD models.Throughout the experiments in this section, we created various training and evaluation environments with controlled context quantity or quality by sampling n + relevant passages from R(q) and n − irrelevant passages from R(q) for each question q, without replacement6 .In the rest of this paper, we define n = n + + n − as the number of total passages and k = n − n + as the ratio of irrelevant passages to relevant ones.We will use subscripts • train and • eval respectively to denote the values of training and evaluation environments if required.

Effect of Context Quality during Training
Setting: To investigate the effect of context quality during model training, we created training and evaluation environments with the same context quantity but different context qualities.More specifically, for each total number of passages (i.e., context quantity) n in {10, 25, 60}, we varied the value of n + among {1, 2, 3} (Natural Question) or {1, 2, 3, 5, 10} (TriviaQA) to obtain environments with different context qualities.
Result: Figure 1 shows the performance of models with different training context qualities7 .We can obtain the following observations from the figure: (i) For a given evaluation context quality, models trained with similar context quality showed the highest performance.(ii) There was a trend of monotonically decreasing performance as training context quality deviated further from evaluation context quality.(iii) Difference of context quantity had a negligible impact in the above trends.
Insight: FiD models overfit to context quality during training, and suboptimal performance can be obtained when evaluated on different context qualities.

Effect of Context Quantity during Training
Setting: To investigate the effect of context quantity during model training, we created training and evaluation environments with the same context quality but different context quantities.More specifically, for each ratio k among {1, 5, 20}, we varied the value of n + among {1, 2, 3} (Natural Questions) or {1, 2, 3, 5, 8, 10} (TriviaQA) to change context quantity8 .Result: Figure 2 shows the performance of models with different training context quantities9 .As can be seen in the figure, the influence of context quantity on model training was generally less significant than that of context quality ( §3.1).However, we observed a more significant influence for smaller k train (higher context quality) in comparison to larger k train (lower context quality), especially in the cases where the training context quantity was small.One possible explanation for this behavior is that the impact of noise in the annotation of relevant (or irrelevant) passages disturbs the actual context quality, and this effect is magni-fied in such cases due to limited context quantities.Nevertheless, our experiments did not reveal a consistent trend in performance changes due to varying context quantity.Hence, the results indicate that context quantity's influence is relatively insignificant compared to context quality's.
Insight: Training context quantity has less influence on model performance compared to context quality.

Effect of Mixed Context Quality during Training
Generally, in practical uncontrolled settings, a training dataset for FiD models may consist of questions with different context qualities.Thus, we conducted experiments with mixed context qualities to investigate whether a similar overfitting phenomenon occurs for a training dataset with multiple context qualities.Setting: We created three environments for each dataset.Specifically, context quantity was set to n = 10, and n + was varied among {1, 2, 3} for Natural Questions, and context quantity was set to n = 25, and n + was varied among {2, 5, 10} for TriviaQA.Then, we uniformly mixed each subset of the three environments and trained FiD models in each of them10 .Performance in a mixed environment was computed by averaging the performance in each constituting environment.
Result: Table 1 shows model performance for each pair of training and evaluation environment in Natural Questions 11 .High scores at diagonal elements of the table show that the models performed best or as well as the best model when they were Table 1: Performance of FiD models trained in each mixture of environments on Natural Questions.The color intensity indicates the relative performance within the same evaluation mixture of environments (column).Checkmarks indicate which environment included in each mixture.evaluated in the same mixture of environments as one in training.For example, the models trained in the uniform mixture of all environments performed best only when they were evaluated on the same mixture.It suggests that covering all context qualities during evaluation is insufficient for optimal performance, and the distribution of the context qualities also matters to the performance.
Insight: FiD models overfit to the distribution of context qualities during training.

Effect of Context Quality during Training on Model's Cross-attention
As we discussed in §3.1, FiDs trained on different context qualities may overfit to each quality, and they perform differently in the same evaluation environment.We hypothesize that overfitting to different context quality occurs due to changes in how a model selects relevant passages since lower context quality may force the model to concentrate more on selecting passages and vice versa.

Investigation on Patterns of
Cross-attention Probability Setting: We denote cross-attention probability from the l-th decoder token to the j-th token of the i-th input passage pi at the k-th decoder layer by c (k) ijl .Following (Izacard and Grave, 2021a), we computed cross-attention probability from the first decoder token, c (k) ij1 , and we computed aggregated cross-attention probability c(k ij1 for each passage pi .
We conducted the following two analyses: (i) We analyzed how much cross-attention probability was allocated to relevant passages at each layer, i.e., i∈{i|p i ∈R} c(k) i .
(ii) We analyzed the difference between the distribution of cross-attention probability to relevant passages, i.e., {c i |i ∈ R}, and that to irrelevant passages, i.e., {c We focused our analyses on FiD models trained for Natural Questions in §3.1 with the following settings: (n, n + ) ∈ {(10, 1), (10, 2), (10, 3)}.Note that these models were trained with the same context quantity but different qualities.We analyzed these models in two evaluation environments of Natural Questions with the following settings: (n + , n − ) ∈ {(3, 7), (3, 57)}.
Result: Table 2 shows the cross-attention probability that was allocated to relevant passages at each layer (Analysis (i)).As shown in the table, in both evaluation environments, FiD models trained  with lower context quality attended more strongly to relevant passages, especially at higher layers that are closer to the output layer12 .A similar trend is also observed in Figure 3 that shows the distribution of cross-attention probability to a relevant or irrelevant passage for each model (Analysis (ii)).Models trained with lower context quality showed more long-tailed distribution for relevant passages, and there was a more significant difference between the distribution for relevant and irrelevant passages, which suggests they are trained to attend more selectively to relevant passages.On the contrary, the distribution is relatively closer to uniform distribution for models trained with higher context quality.
We conjecture that this excessive selectivity of models trained in a low-quality environment may explain their relatively lower performance in a highquality environment ( §3.2), because such excessive selectivity makes the model overlook necessary information in ignored relevant passages and, as a result, fail to correctly answer the questions.It may be the case that, when evaluated in a high-quality environment (i.e., where the majority of passages are relevant to the question), it is more optimal for the model to examine all passages more uniformly without being overly selective.This claim is empirically supported by the result of our experiments in §4.2.
Insight: FiD models trained with different context quality show different levels of selectivity w.r.t.allocation of cross-attention probability.Models trained with lower context quality attend more selectively to relevant passages.

Intervention Experiment
Results in §3.4.1 suggests that overfitting of FiD models to different context quality is due to different level of selectivity of their cross-attention.We validate this claim by intervening on the crossattention probability of these models during inference.
Setting: We intervened on the cross-attention probability of FiD models so that the ratio of crossattention probability to a relevant passage p i ∈ R and an irrelevant passage p j ∈ R to be r for all layers.Intuitively, the model completely ignores irrelevant passages when r = 0, whereas the model attends uniformly to all passages when r = 1.More specifically, for each decoder layer k, we converted original cross-attention probability c ijl as follows: where w and conducted experiments in the same models and evaluation environments as in §3.4.1.
Result: Figure 4 shows model performance with and without intervention on cross-attention probability.In both evaluation environments with lower and higher context quality, the difference in the performance of the models decreased, and the intervention mitigated the effect of overfitting to context quality.
Insight: The result suggests that the difference of cross-attention probability as described in §3.4.1 is one element that explains the overfitting of FiD models to context quality.

Adapting Models to Different Context Quality
While FiD models overfit to context quality during training as shown in §3, it is not desirable to train a dedicated model for each target environment that has potentially different context qualities from each other.Thus, in this section, we propose a method to mitigate the effect of overfitting and adapt an already trained FiD model to an environment with different context quality.

Proposed Method
Based on the insights in §3.4 that shows the overfitting to context quality occurs due to the differ-ent levels of selectivity to relevant passages, we propose to change sharpness of distribution of cross-attention probability during inference.More specifically, we introduce temperature parameter T (T > 0) and compute total cross-attention probability from the l-th decoder token to the i-th passage at the k-th layer as follows: .
(3) Then, we use Equation ( 2) to convert crossattention probability as in §3.4.2 13 .Intuitively, the model attends more uniformly as T becomes larger, which simulates the overfitting effect of a FiD model trained with higher context quality and vice versa.
Note that our proposed temperature parameter does not change the set of input passages and can be tuned complementary with other existing hyperparameter that changes the set of input passages, e.g., the number of input passages.

Experiment
To validate the effectiveness of our proposed method, we adapted models trained in §3.1 by the proposed method and evaluated their performance on evaluation environments with different context qualities where n + eval = 3 and n eval ∈ {10, 25, 60}.Since the temperature parameter T has to be tuned, we conducted 2-fold crossvalidation.Specifically, we split the evaluation data into two folds and searched optimal temperature parameter T * ∈ {0.125, 0.25, 0.5, 1, 2, 4, 8} based on the EM score on one fold, and then, we used a model adapted with T * to evaluate performance on the other fold 14 .
Figure 5 shows the performance of FiD models with and without adaptation by the proposed method 15 .As shown in the figure, the proposed method improved the performance of the models in environments with different context qualities compared to those during training, and it reduced the effect of overfitting to context quality.Also, T * increased in the case of lower context qualities during training, and vice versa, which corroborates our finding that more uniform cross-attention, corresponding to higher T * , is effective when context quality in evaluation is higher than one in training.

Related Works
Retrieval-augmented Generation Models: Lewis et al. ( 2020) introduced retrieval augmentation approach, firstly originating in the field of extractive open-domain question answering (Chen et al., 2017), to sequence-to-sequence models and validated its effectiveness in knowledge-intensive tasks.Contemporary work by Min et al. (2020) applied the retrieval augmentation approach to the task of ambiguous question answering.Izacard and Grave (2021b) proposed Fusion-in-Decoder (FiD), in which each retrieved passage is independently encoded and then jointly input to the decoder.FiD achieves high scalability for the number of passages and effectively aggregates information by jointly using all passages in the decoder.
Effect of Context Characteristics on Retrievalaugmented Models: Several studies have investigated how context characteristics affect inference of retrieval-augmented generation models.For example, increasing the number of top-ranking passages used as the context has been found to improve performance in question answering (Izacard and Grave, 2021b) and response/prose generation (Zhang et al., 2022b), while a higher proportion of false information in the context degrades performance in question answering (Weller et al., 2022) 16 .Liu et al. (2023) found that the performance of language models on multi-document question answering is influenced by the position of a relevant document.
However, limited knowledge is available regarding the impact of context characteristics on the training of retrieval-augmented geneartion models.Notably, a few existing research suggest that the model performance in question answering im-proves by providing more top-ranking passages during training (Izacard and Grave, 2021b) or by randomly masking top-ranking passages during training (Zhang et al., 2022a), and that the model's memorization behavior is reduced by increasing recall of relevant information in the context during training (Longpre et al., 2021;Chen et al., 2022).

Conclusion
In this paper, we investigate how context quality and quantity affect the training of FiD models in extractive open-domain question answering tasks.We show that FiD models tend to overfit to the context quality during training, resulting in degraded performance when evaluated in environments with different context qualities.Additionally, our research reveals that the overfitting to context quality is partially explained by different patterns in the model's cross-attention probability.Based on these observations, we propose changing the selectivity of the cross-attention probability to mitigate the effect of overfitting to context quality.The results of this paper suggest a broad spectrum of future work, including more sophisticated adaptation methods and investigations of the effect of other context characteristics on the training of retrieval-augmented generation models.

Limitations
In this study, we investigated how the quality and quantity of context affect the training of FiD models in extractive open-domain question answering tasks.Our experiments revealed for the first time that context quality significantly impacts FiD models' training and that FiD models tend to overfit to context quality of the training data.The implications of our findings suggest that various context characteristics similarly affect the training of retrieval-augmented generation models, potentially leading to issues such as overfitting.
However, our experiments have several limitations that reduce the generalizability of our findings: Task: Firstly, in this paper, we only focused on the extractive open-domain question answering task, and it is unclear whether similar results can be obtained in other tasks such as dialogue generation, fact verification, code generation, and summarization.
Model Architecture: Secondly, our analysis only targeted FiD models, and it is unclear whether different architectures such as RAG (Lewis et al., 2020) and Internet-augmented language models (Lazaridou et al., 2022) produce similar results.Also, it is an interesting direction of future work to conduct similar investigations on non-generative retrieval-augmented models such as FiE (Kedia et al., 2022).
Model Size: Thirdly, our experiments are focused on only t5-base, and it is unclear how scaling model size changes the behavior of overfitting to context quality.
Characteristic of Context: Lastly, coverage of our analysis is limited to quality and quantity, and further research is required to investigate the effect of other context characteristics.For example, in the field of extractive question answering, it has been shown that models may overfit to answer positions in the contexts (Ko et al., 2020), be misled by adversarially inserted sentence (Jia and Liang, 2017;Jiang and Bansal, 2019), and be susceptible to whether an answer is in the most similar sentence in the context (Sugawara et al., 2018).These findings suggest that those context characteristics may also affect retrieval-augmented generation models.
Other than that, our experiments involved the automatic annotation of relevant and irrelevant passages, which may limit the accuracy of our analysis.Future studies should incorporate human annotation to ensure the high quality of the annotation.Also, passages with similar relevant information can impact models differently due to qualitative factors such as readability and writing style.Nevertheless, as quantitatively evaluating these factors poses challenges, our study did not conduct a finegrained analysis regarding these aspects.A Implementation Details

A.1 Training and Evaluation Data
We trained FiD models with the original train set of each dataset, and we further split the original train set into D train for training and D dev for evaluating performance during training.To train models in a strictly extractive task environment, we excluded questions for which no retrieved passage contained any of their ground-truth answers.For evaluation, we used the original development set of each dataset as evaluation data D eval .For a fair comparison, we used the same set of questions with at least 3 (Natural Questions) or 10 (Trivi-aQA) relevant passages and at least 64 irrelevant passages during training or evaluation.Statistics of the datasets are shown in Table 3.

A.2 Details of FiD Training and Inference
Our model implementation of FiD, including the loss function for training, is based on the official implementation by Izacard and Grave (2021b)17 .For both training and inference, we used transformers (Ver.4.23.1)(Wolf et al., 2020).We trained the models with Seq2SeqTrainer provided in transformers.Hyperparameters for Seq2SeqTrainer used in our experiments are listed in Table 4 and other hyperparameters were set to default values.Since we trained models with 32 A100 GPUs, the effective batch size is 64.We used a model checkpoint with highest EM on D dev for downstream evaluations.Since most questions in Natural Questions are annotated with only one ground-truth answer, we used the first ground-truth answer for each question as a target output for model training.On the other hand, since questions in TriviaQA are more exhaustively annotated with paraphrases of ground-truth answers, we randomly sampled one ground-truth answer that appeared in any of the input passages as a target output at every training step.We tokenized each input passage pi described in §2.2 and target outputs by the tokenizer of t5-base.For both training and inference, to fix the sequence length of each tokenized passage to 256, we conducted truncation for longer passages or padding for shorter passages.We did not truncate target outputs during training and set a maximum length of a predicted answer to 50 during inference.
In experiments where we subsampled passages to control context quality and quantity, to reduce the effect of bias in sampled passages, we sampled different passages at every training step instead of repeatedly using the fixed set of passages sampled before training.

B Details of Experimental Design
We trained three FiD models with different random seeds for each training environment and conducted evaluations for these models in each evaluation environment.We sampled five different sets of passages for each evaluation environment and computed average model performance on these sets of passages.We independently sampled relevant and irrelevant passages, and thus, the same set of relevant (or irrelevant) passages was sampled regardless of the number of irrelevant (or relevant) passages as long as the number of relevant (or irrel-

C Details of Passage Relevance Annotation by Question Answering Model
We used FiD models as pre-trained question answering models used in the passage relevance annotation described in §2.3, and we trained those models as described in Appendix A except training and development data which we describe below.
For each dataset with original train set D train and development set D dev , we split D train into four sets: D 0,train , D 0,dev , D 1,train , and D 1,dev .Then, we respectively trained a FiD model M 0 or M 1 with a pair of train and development data (D 0,train , D 0,dev ) or (D 1,train , D 1,dev ).Finally, we annotated D 0,train and D 0,dev with M 1 , and D 1,train , D 1,dev , and D dev with M 0 .See Table 6 for statistics of the datasets used to train FiD models for the relevant passage annotation.
Our preliminary experiments showed that the behavior of FiD models differs when trained with all passages in the original dataset (All), compared to when trained with only those passages containing a ground-truth answer (Pos).Thus, we chose a stricter criterion to extract relevant passages.Specifically, we trained M 0 and M 1 and annotated passages in each of All and Pos setting, and we extracted only those passages annotated as a relevant passage in the both settings.

D Full Experimental Results
Full results of the experiments in §3.1 are shown in Figure 6 and Table 7 for TriviaQA and Figure 7 and Table 8 for Natural Questions.
Full results of the experiments in §3.2 are shown in Figure 8 and Table 9 for TriviaQA and Figure 9 and Table 10 for Natural Questions.
Results of the experiments in §3.3 are shown in Table 5 for TriviaQA.
Results of the experiments in §4.2 are shown in Figure 10.

Figure 1 :
Figure 1: Performance of FiD models on Natural Questions with varying training context quality.Panels represent different evaluation environments with different (n + eval , n eval ) pairs, and a red dashed line shows the context quality of the corresponding evaluation environment.Red stars represent the best-performed models in the corresponding evaluation environments.Dotted lines show models trained on the same context quantity n train .

Figure 2 :
Figure 2: Performance of FiD models on TriviaQA with varying training context quantity.Panels represent different evaluation environments with different (n + eval , k eval ) pairs, and a red dashed line shows the context quantity of the corresponding evaluation environment.Dotted lines show models trained on the same context quality 1 1+ktrain .

Figure 3 :
Figure3: Distribution of cross-attention probability to each relevant or irrelevant passage at Layer 9. A similar trend can be seen in other higher layers.Red vertical dashed lines represent uniform cross-attention probability, i.e., 1 N if context quantity is N .

Figure 4 :
Figure 4: Model performance under intervention on cross-attention probability."No" represents a setting without intervention.

Figure 5 :
Figure 5: Top panels: Performance of FiD models on Natural Questions with adaptation by the proposed method (solid lines) and without adaptation (dotted lines).Bottom panels: Optimal temperature parameter T * selected for each model.Multiple T * were selected for some context qualities, i.e., training environments, because we selected T * for each of the three models trained with different random seeds for each training environment.Panels represent different evaluation environments with different (n + eval , n eval ) pairs, and a red dashed line shows the context quality of the corresponding evaluation environment.

Figure 6 :
Figure 6: Performance of FiD models on TriviaQA with varying training context quality.Panels represent different evaluation environments with different (n + eval , n eval ) pairs, and a red dashed line shows corresponding context quality.Red stars represent the best performed models in the corresponding evaluation environments.Dotted lines show models trained on the same context quantity n train .

Figure 7 :
Figure 7: Performance of FiD models on Natural Questions with varying training context quality.Panels represent different evaluation environments with different (n + eval , n eval ) pairs, and a red dashed line shows corresponding context quality.Red stars represent the best performed models in the corresponding evaluation environments.Dotted lines show models trained on the same context quantity n train .

Figure 8 :
Figure 8: Performance of FiD models on TriviaQA with varying training context quantity.Panels represent different evaluation environments with different (n + eval , k eval ) pairs, and a red dashed line shows corresponding context quantity.Red stars represent the best performed models in the corresponding evaluation environments.Dotted lines show models trained on the same context quality 1 1+ktrain .

Figure 9 :
Figure 9: Performance of FiD models on Natural Questions with varying training context quantity.Panels represent different evaluation environments with different (n + eval , k eval ) pairs, and a red dashed line shows corresponding context quantity.Red stars represent the best performed models in the corresponding evaluation environments.Dotted lines show models trained on the same context quality 1 1+ktrain .

Figure 10 :
Figure 10: Top panels: Performance of FiD models on TriviaQA with adaptation by the proposed method (solid lines) and without adaptation (dotted lines).Bottom panels: Optimal temperature parameter T * selected for each model.Multiple T * were selected for some context qualities, i.e., training environments, because we selected T * for each of the three models trained with different random seeds for each training environment.Panels represent different evaluation environments with different (n + eval , n eval ) pairs, and a red dashed line shows corresponding context quality.

Table 2 :
Cross-attention probability allocated to relevant passages at each layer.High and Low respectively represent high and low context quality.

Table 3 :
Size of datasets used for training and evaluation.Questions TriviaQA D train D dev D eval D train D dev D eval 20728 3048 2589 11414 1695 1434

Table 5 :
Performance of FiD models trained in each mixture of environments on TriviaQA.The color intensity indicates the relative performance within the same evaluation mixture of environments (column).Checkmarks indicate which environment included in each mixture.