Can Retriever-Augmented Language Models Reason? The Blame Game Between the Retriever and the Language Model

Augmenting pretrained language models with retrievers to select the supporting documents has shown promise in effectively solving common NLP problems, including language modeling and question answering, in an interpretable way. In this paper, we first study the strengths and weaknesses of different retriever-augmented language models (REALM, $k$NN-LM, FiD coupled with DPR, and ATLAS and Flan-T5 coupled with Contriever) in reasoning over the retrieved statements in different tasks. We show how the retrieve-then-read models' limitations in reasoning are rooted both in the retriever module as well as the language model. Our experimental results demonstrate that the similarity metric used by the retrievers is generally insufficient for reasoning tasks. Additionally, we show that the language models in retriever-augmented models do not take the complicated relations between the statements into account, which leads to poor reasoning performance even when using the larger models. Moreover, we analyze the reasoning performance of large language models using multihop retrieval but we only observe minor improvements. Overall, this shows great room for further research in this area.


Introduction
Pretrained language models, such as decoderonly transformers, BERT, and T5 are being used in a wide range of tasks with outstanding results (Vaswani et al., 2017;Devlin et al., 2019;Raffel et al., 2020). Additionally, many methods have been proposed to improve language models, including augmenting the models with knowledge retrievers (Guu et al., 2020;Izacard and Grave, 2021;Izacard et al., 2022b) or memory components (Zhong et al., 2022;Khandelwal et al., 2020;Verga et al., 2021). The primary goal of using a latent knowledge retriever is to allow the model to capture information from external knowledge Phobos is one of two large objects that orbit the planet mars. Because Phobos orbits mars, Phobos should be classified as which type of body?
Mars is a kind of planet. Phobos orbits mars. Moons orbit planets.
Flan-T5 a moon ATLAS Moons orbit planets.

REALM Phobos
In New York State, the shortest period of daylight occurs during which month?
December is during the winter in the northern hemisphere. New york state is a state located in the united states of america.
Winter has the least sunlight. United states is located in the northern hemisphere.
FiD winter kNN-LM is during the winter in the northern hemisphere.
Flan-T5 january ATLAS Winter REALM December Figure 1: Example of language model failures in question answering tasks when reasoning is needed. Here, the correct answer is in bold with the majority of retriever-augmented language models failing to answer correctly.
rather than relying only on the implicit knowledge hidden in the model's parameters. In other words, models perform inference using the provided extra knowledge from the retrieved documents without having to retrain larger models and thereby circumvent the limitations of memorized knowledge based on the size of the model. Although retriever-augmented models perform well on knowledge-intensive NLP problems, such as question answering, theoretically they can also perform better than parametric, non-retrieval based models when it comes to reasoning given that they have access to some explicit tangible knowledge. Reasoning in language modeling and question answering means to complete a [masked or incomplete] sentence query or answer a question correctly by reasoning over the retrieved statements. As shown in Figure 1, models sometimes fail in solving the task when some reasoning is needed over the given statements.
The reasoning abilities of large language models have been well studied in the literature, but The sun is the star at the center of the Solar System Retriever

Language Model moon
Phobos is a kind of [MASK].

retrieved statements
Mars is a kind of planet.

Moons orbit planets.
Triton is the largest moon of neptune.
set of statements query query Phobos should be classified as which type of body? Figure 2: A visualization of the architecture of retriever-augmented language models. The model is given a set of statements S for each query. The retriever retrieves some of the statements as related and necessary to solve the task and the language model predicts the answer using the query and the retrieved documents.
many of them target large language models in arithmetic, commonsense, and symbolic reasoning problems (Wei et al., 2022;Zelikman et al., 2022), while others focus on targeted models for reasoning over facts and rules in order to check whether a hypothesis can be entailed (Dalvi et al., 2021;Tafjord et al., 2021;Bostrom et al., 2022;Creswell and Shanahan, 2022). Reasoning over supporting information with retriever-augmented language models, however, has not been extensively studied.
Intuitively, one might assume that retrieveraugmented pretrained language models should also possess the ability to reason in a scenario where sentence completion or question answering may require a few reasoning steps over the retrieved supporting statements. The results presented in this work, however, suggest that retriever-based language models perform poorly in the aforementioned settings and that the shortcomings of retriever-augmented language models in reasoning are rooted in two distinct parts of their design. On one hand, the knowledge retriever is not trained to retrieve the required statements for completing the task when reasoning. In most cases, the retriever model selects the most similar documents based on a similarity metric between each candidate statement and the input query (Guu et al., 2020;Karpukhin et al., 2020;Khandelwal et al., 2020;Izacard et al., 2022a). However, in the case of reasoning, it would likely be more fruitful to consider other heuristics, such as the relations between the documents themselves and the input query, thereby enabling reasoning over the combination of the retrieved statements leading to informative pieces of knowledge.
On the other hand, current language model architectures are not intentionally designed to enhance reasoning performance. As a general rule, language models integrate different retrieved statements by combining the statements, their encoded representations, or interpolating between the language model and the retrieved statements distributions (Guu et al., 2020;Izacard and Grave, 2021;Khandelwal et al., 2020;Izacard et al., 2022b). These models show promising performance when a similar form of the query's answer exists among the retrieved statements, but perform worse when some reasoning is required over them.
In this study, we demonstrate how retrieveraugmented language models fail in reasoning from different perspectives by evaluating them in language modeling and question answering tasks using different variations of EntailmentBank (Dalvi et al., 2021) and StrategyQA (Geva et al., 2021) datasets. In order to perform well in the tasks of these datasets, the models need to: 1) retrieve the best statements as supporting evidence leading to a promising result; 2) aggregate knowledge from retrieved statements by reasoning for a correct answer. Concretely, we analyze the performance of pretrained REALM, FiD, ATLAS, and k-NN-LM as retriever-based language models and the instruction-finetuned Flan-T5 as a recent strong model in reasoning. In this paper, we address the following questions: Q1 Can retriever-augmented language models generally reason? We investigate the reasoning ability of retriever-augmented language models by analyzing their overall performance in both question answering (QA) and language modeling (LM) tasks. The experimental results show that these models lack reasoning in solving many QA and LM examples.
Q2 How much do the retrievers contribute to the reasoning performance? We investigate the role of retrievers in models' failures by assessing their shortcomings in retrieving the required statements for reasoning. We conclude that the heuristic of retrievers selecting the statements based on their similarity with the queries is not sufficient for the reasoning nature.
Q3 How much do the language models contribute to the reasoning performance? We investigate the role of language models in the failures of retriever-augmented models by measuring the language models' lack of reasoning over the retrieved statements. The experimental results demonstrate that language models cannot consider the relations between the ground-truth statements, which is necessary for reasoning. Interestingly, we also observe that language models perform much better when a conclusion containing all relevant facts from those statements is given as the input.

Related Work
Augmenting language models with external corpora or memory has been well studied in literature (Guu et al., 2020;Khandelwal et al., 2020;Izacard and Grave, 2021;Izacard et al., 2022b). Additionally, researchers have recently been interested in eliciting the reasoning abilities of the large language models to perform better in the tasks where answering the questions involves multiple computational steps. (Wei et al., 2022;Chung et al., 2022). Next, we discuss some of the current research in these areas.

Retriever-augmented Language Models
Retriever-based language modeling or question answering has been well studied in the literature, and different mechanisms have been proposed for integrating language models with retrieved statements. For instance, REALM (Retrieval-Augmented Language Model) is a masked language model that is augmented with a latent knowledge retriever, which allows the model to retrieve and attend over statements (Guu et al., 2020). The knowledge retriever employs a BERT encoder to achieve a dense representation for both the query and statements. The knowledge retriever selects the most similar statements among all based on a similarity metric (e.g. dense inner product) between the transformed query's and the statements' representations. In the language modeling task, the retrieved statements are appended to the query and passed to a trained BERT-based language model to solve the task. In the question answering task, however, a trained BERT-based reader extracts the more promising span from the statements as the answer.
kNN-LM is another proposed model based on a decoder-only Transformer, integrated with a knearest neighbor module (Khandelwal et al., 2020). In this model, the most related token sequences from the statements are selected based on an L2 similarity metric between the representation of the query and all token sequences. The distribution over the next token in generation in this autoregressive model is subsequently computed as the interpolation between the Transformer's final distribution and the distribution of the next tokens over the retrieved statements.
FiD (Fusion-in-Decoder) is a sequence-tosequence T5-based neural network which can work with any retriever (Izacard and Grave, 2021). Given the retrieved statements by the retriever, the encoder encodes the query and each retrieved statement separately. Afterwards, the decoder attends over the resulting representations of all the retrieved statements. In this paper, we investigate DPR (Dense Passage Retriever) as the retriever for FiD (Karpukhin et al., 2020). DPR retrieves the most similar documents based on the inner product of the representations of the query and the documents (i.e. embeddings of the [CLS] token) from two independent trained BERT encoder models.
Finally, ATLAS is a pre-trained retrieval augmented language model that finetunes the retriever and the language model jointly using very few training examples (Izacard et al., 2022b). The language model is a sequence-to-sequence model similar to the FiD's architecture, and the retriever is based on Contriever, consisting of a dual BERT-based encoder architecture (Izacard et al., 2022a). The retrieved documents are selected based on the inner product of the representations of the query and the documents, which are computed as the average hidden representations of the encoder's last layer.

Reasoning of Language Models
Eliciting the reasoning ability of language models has recently attracted the attention of many NLP researchers (Wei et al., 2022;Zelikman et al., 2022;Chung et al., 2022). For instance, Wei et al., 2022 investigates how reasoning abilities emerge in large language models when they are prompted with a few intermediate reasoning steps known as chain of thoughts.
Moreoever, Flan-T5 is an instruction-finetuned T5 model which is shown to have strong reasoning abilities, outperforming the T5 model (Chung et al., 2022;Raffel et al., 2020). Although this finetuned model was not initially constructed for retrieverbased language modeling, it can be coupled with DPR to complete the language modeling and question answering task using the retrieved statements.
In this paper, we study the reasoning ability of REALM, kNN-LM, FiD with DPR, and ATLAS with Contriever as retriever-based language models and Flan-T5 as a reasoning language model coupled with DPR as a retriever. While retrievers generally select statements from a huge common corpus in the literature, as illustrated in Figure 2, we accompany each query with a data-specific collection of statements since we want to have more control over the statements and the reasoning abilities of the models. Although many recent papers focus on commonsense or arithmetic reasoning, we evaluate the models on reasoning via entailment and logical reasoning (Wei et al., 2022;Lewkowycz et al., 2022;Mishra et al., 2022).

Problem Definition
In our retriever-augmented language model reasoning setting, we provide the model with a complete set of statements S = {s 1 , s 2 , . . . , s m } for each sample. In some cases, only some of these statements are necessary to predict the answer with the others contain distracting information. For a fixed number of retrieved statements k, the model should retrieve the set of statements S r = {r 1 , r 2 , . . . , r k } ⊆ S which finds more related and necessary and solve the target task by reasoning over them. A visualization of the general task is represented in Figure 2. We study REALM, kNN-LM, FiD, ATLAS, and Flan-T5 in this paper. Based on the implementation details stated in Appendix A, most of these models have almost the same size as presented in Table 1, and hence, the model size does not play an important role in the results.  Language Modeling (LM). In the language modeling task setup, we measure the performance of the retriever-augmented language models with two metrics: 1) Predicting the desired token correctly; 2) Assigning a higher likelihood to the gold sentence over a similar but incorrect sentence. We use the latter task as a proxy to compare between different autoregressive or masked language models. To this end, we consider two alternative sentences by defining an alternative target for the true target entity mention and compare the score of the model corresponding to each sentence by attending to the retrieved statements by the retriever. A perfect reasoning model should allocate a higher score to the sentence including the true target 100% of the times. An example of this task is illustrated in Table 3. The score of a target entity mention T = t 1 t 2 . . . t M for input query Q = q 1 q 2 . . . q N in each model is computed as follows: • FiD, ATLAS, and Flan-T5: −loss(T ), where loss is the output loss of the models corresponding to target tokens T .
More detailed information about query formats for each model is described in Appendix B.
Question Answering (QA). In the question answering task's setup, we study the token overlapping F1 score of the predicted answer of the models on the reasoning question answering datasets.

Exploring Weaknesses in Reasoning
The shortcomings of retriever-augmented language models are embedded in different parts of their architectures. In this section, we explain the weaknesses of these models caused by the retriever as well as the accompanying language model.

Retriever Shortcomings
The purpose of the retriever is to select k statements as the "retrieved statements" using a relevance score between the query and each statement as explained in detail in Section 2.1. These similarity metrics could be the inner product or the L2 distance between query and statements (or a span in statements) representations derived by one or two different encoders. This naive way of selecting the related statements does not take into account the relation between the statements. Particularly, there may be some data samples and their corresponding sets of statements in which the relation between statements themselves can lead to meaningful knowledge more related to the query compared to each candidate statement itself. Moreover, similarity is not a good representative of carrying a lot of information that can be used for reasoning over statements to solve the language modeling or question answering task.

Language Model Shortcomings
In this subsection, we discuss why current retrieveraugmented language models do not consider the complicated relations between statements resulting in poor reasoning ability. Suppose we have a perfect retriever that can retrieve necessary and sufficient statements for solving the target task, which we call the "gold statements". As described in greater detail below, the language models we study in this paper cannot solve the task perfectly regarding reasoning over the retrieved gold statements S r due to inherent design weaknesses.
REALM. In the language modeling task, one way to predict a token in REALM is to predict the token using each statement in S r separately, and then perform majority voting to select the final output as the predicted token. This approach does not take reasoning into account at all given that the relation between statements is a key factor for the model to be able to reason effectively, while here, the statements of S r are disconnected from each other. Another perspective in using REALM in masked language modeling is to pass all the retrieved statements S r to the language model at once. In this case, as opposed to the previous option, the connections between statements can be considered. However, reasoning over sentences stacked together is not currently considered in the Transformer-based nature of BERT. Similarly, in question answering, REALM does not consider the complicated relations between statements. The model outputs the most probable span among the retrieved statements without combining information from multiple statements.
kNN-LM. According to Section 2.1, given the gold statements S r , the language model predictions will change based on the next tokens of similar token sequences in S r . Once again the model does not consider the relations between the statements and instead, takes the frequency (i.e. probability) of the next tokens of the nearest sequence candidates into account.
FiD and ATLAS. In FiD and ATLAS, the decoder takes all the encoded representations of the retrieved statements as input. However, while the model performs well in simple frameworks, it is not trained to reason over the representations.
Flan-T5. Flan-T5, similar to REALM when concatenating all the retrieved external textual information, generates an answer using all the retrieved facts as well as the query as the inputs. Although this model is finetuned on the chain of thought reasoning datasets, still it cannot perfectly solve the tasks by reasoning, especially when the retriever is not perfect and retrieves some wrong statements.

Evaluation and Results
In this section, we first introduce the datasets and metrics used in the experiments for both language modeling and question answering tasks as explained in Section 2.1. Afterwards, we demonstrate the weakness of the retriever-augmented language models in detail.

Datasets
We compare the reasoning ability of the models based on their performance on the following datasets in different formats with detailed dataset preparation mechanism explained in Appendix C and results shown in Figure 3 and Figure 4.
• EntailmentBank (EB, Dalvi et al., 2021) contains a multi-step entailment tree for a given hypothesis. This dataset consists of three parts, each with a specific characteristic. In EB-1, only the required statements are provided in data samples. In EB-2, 25 statements have been given for each data samples, including relevant and irrelevant ones. In EB-3, similar to EB-2, 25 relevant and irrelevant statements have been given by sampling from a large corpus with no ground-truth entailment tree. Each data sample in this dataset consists of some statements, a question, an answer, and a hypothesis rephrasing the question and answer in a declarative form.
• StrategyQA (Geva et al., 2021) contains yes/no question answering samples, where the required reasoning steps are implicit in the question, as well as supporting and further relevant evidence from Wikipedia. For evaluating the models in language modeling, we convert each question to a declarative format.

Evaluation Metrics
We evaluate the performance of the different models by measuring the model preference accuracy as well as the accuracy in predicting the first target token and present the result in Figure 3 and Figure 8. Note that in kNN-LM, a decoder-only model, the model does not see the input tokens appearing after the masked tokens.
In question answering, we take the token overlapping F1 score of the generated answer with the results shown in Figure 4.
When analyzing the retrievers' performance, we take the ground-truth statement retrieval F1 score to see how well retrievers find the relevant statements.

Can Retriever-Augmented Language Models Generally Reason? (Q1)
In this section, we show the weaknesses of retrieveraugmented language models in reasoning from different perspectives. We present the overall behavior of the models in model preference on the development sets in Figure 3 and Figure 4. Results show that although ATLAS and REALM perform well in language modeling, they perform badly in question answering. One reason is that REALM is an extractive QA model, but another more important reason is that REALM tends to extract the answer from the first few retrieved statements, leading to inferior performance. This phenomenon is illustrated in Figure 1 and Table 3, and can also be seen in Figure 4, where REALM's performance stays the same after retrieving about 3 statements. The experiments on ATLAS show that this model does not deeply reason over the statements, while generally copying only basic answers. This phenomenon is illustrated in the qualitative examples in Appendix E.
Additionally, Flan-T5, an originally nonretriever-based model, shows much sensitivity to distracting statements in both EntailmentBank-  Table 2: Experimental results of the best retriever-augmented models in LM and QA on test sets. The two best models are highlighted in green. ATLAS is coupled with Contriever, and FiD and Flan are augmented with DPR as retriever, respectively. The results show that ATLAS and REALM are superior in LM, and FiD and Flan-T5 are the superior models in QA tasks.
2 and EntailmentBank-3 datasets after providing more and more distracting statements. On the contrary, other models seem more robust in both language modeling and question answering experiments, as represented in Figure 3 and Figure 4. Our hypothesis is that the supporting evidence used in Flan-T5's finetuning procedure has been necessary for solving the tasks, and hence, Flan-T5 has not been finetuned in a way to learn to ignore related but distracting information. On the other hand, the other models have been trained with retrievers, which are imperfect, retrieving incorrect statements in some cases. As a result, they are more robust to the distracting information.
It is worth mentioning that the models have almost the same number of parameters as shown in Table 1, and hence, the model size does not have a meaningful impact on the models' performance. We show some of the models' failures in LM or QA experiments in Table 3. These examples demonstrate the faulty module in each case as described in Section 4. The more comprehensive qualitative results are available in Appendix E. Table 2 demonstrates the performance of the best retriever-augmented models (based on the performance on the dev sets) on the test sets in both language modeling and question answering tasks. It can be observed that the same discussion about the behavior of the models holds on the test set.
Note that all the data samples have a set of at most 25 statements, however, kNN-LM's retriever does not retrieve one statement at a time. Instead, it might retrieve different sequences of tokens from a specific statement as the relevant sequences of tokens. This is why the performance of all other three models stays the same after k = 25, but this is not the case in kNN-LM.

How Much Do The Retrievers Contribute
To The Performance? (Q2) In this subsection, we study the efficacy of the retrievers in reasoning by evaluating the models on the EntailmentBank-2 dataset that includes both gold and distracting statements. Figure 5 demonstrates that although Contriever is superior among the studied retrievers, all four REALM, kNN-LM, DPR, and Contriever retrievers lack in retrieving the most relevant and necessary statements for rea-

kNN-LM
The robot will weigh less on mars than earth but will have the same [MASK]. mass vs mars -As the force of gravity decreases, the weight of the object will decrease. -The gravitational force of a planet does not change the mass of an object on that planet or celestial body. ...  Table 3: Some examples of models' failures rooted in the retriever or language model modules. The correctly retrieved statement and the one that had to be retrieved in order for the model to solve the task correctly are highlighted in green and red, respectively. The true answer (or sequence of tokens leading to the true answer) for each data sample's statements is marked in bold.  soning. For instance, The top retrieved statement by DPR is among the gold statements only 15% of the times. Note that in the language modeling setting, REALM, DPR and Contriever have access to all the tokens of the query, while kNN-LM is autoregressive and only sees the tokens before masking tokens. Another observation is that even with letting kNN-LM retrieve 100 sequences, it does not cover all the gold statements and tends to retrieve sequences from the same statement. Some failures of the retrievers are demonstrated in Table 3. These examples show the weaknesses of retrievers in selecting necessary statements for reasoning. Although these missed statements do not seem similar to the query, they carry important information required for reasoning. Overall, retrievers perform poorly in selecting relevant statements required to perform well in reasoning tasks given that they do not account for the relationships between statements.

How Much Do The Language Models contribute To The Performance? (Q3)
This subsection pictures the shortcomings of the language models by providing the language models exclusively with gold statements. Figure 6 demonstrates the performance of the language models given all or ground-truth, gold statements. According to the solid lines in this figure, we conclude that even with the ground-truth statements, language models do not perform perfectly, at least in question answering, even with having all the required statements in hand. For instance, the best performance in question answering dataset is around 0.4 for Flan-T5 which is instruction finetuned on the chain of thought data. Furthermore in language modeling, results show that kNN-LM, which is trained for language modeling, is performing substantially worse than REALM, FiD, and ATLAS, and its performance does not differ much even with only the ground-truth statements. Experimental results in Figure 7 show how the language models perform much better when no reasoning is needed for completing the task. In this visualization, the solid lines refer to the performance of retriever-augmented language models given ground-truth statements and the dotted lines refer to the performance of the model when only one statement is given with no need to reason. This given statement is in fact the hypothesis sentence in EntailmentBank-2 data samples which can be inferred from the gold statements using reasoning and is sufficient to answer the question. From this figure, we can infer that language models perform better when reasoning is not required and a similar form of the answer is mentioned in the statements. Additionally, we observe that Flan-T5 performs the best in question answering with reasoning over ground-truth statements, while FiD performs much The dotted lines and the solid lines refer to experiments given single statements and gold statements (when reasoning is required), respectively. Results illustrate that the language models are not good at answering questions by reasoning. more promising when the hypothesis is given as a statement.
Some failures of the language models are demonstrated in Table 3. These examples show that even when the query's answer exists in the retrieved statements, the models sometimes have difficulty finding the correct answer. For instance, results show that the QA version of REALM tends to select the answer span from the first or second retrieved statements which can also be observed in Figure 4 and Figure 7. Furthermore, we sometimes observe that Flan-T5 answers the queries regardless of the provided retrieved statements, as demonstrated in Table 3.

Conclusion
In this paper, we analyzed to what extent the retriever-augmented language models, including REALM, kNN-LM, ATLAS coupled with Contriever, and FiD and Flan-T5 coupled with DPR as retriever are capable of solving downstream tasks where reasoning is required. Our analysis is rooted in investigating three leading questions outlined in Section 1 that characterize reasoning ability in retriever-based language models. With respect to Q1, the results show that these models have difficulties in completing both language modeling tasks and question answering tasks. Regarding Q2, experimental results demonstrate that retrievers do not retrieve the statements necessary for reasoning, and instead, select statements based on query similarity which is not often desired in reasoning settings. These incorrect retrieved statements are shown to be harmful to some of the models, especially Flan-T5. With regards to Q3, we conclude that although most of these models perform reasonably well when the answer is mentioned in one of the statements in a similar format, they generally cannot reason over different statements even when all the given statements are among the ground-truth, "gold statements". Instead, these models exhibit different kinds of behavior, such as copying a piece of information from the statement with the highest score and generating an answer regardless of the retrieved statements in question answering tasks. We also observed some deficiencies in preferring the true targets over alternative ones in language modeling tasks. Overall, our qualitative results demonstrate that language models do not take the relations between the statements into account, and therefore, cannot reason over them properly.
These results suggest opportunities for improving the reasoning ability of the retriever-augmented language models by first, improving the retrievers so that they select statements using a more focused heuristic compared to a simple similarity score, and second, enhancing the language models so that they combine the information from retrieved statements by considering the statements' relations more effectively.

B Model and Task-Specific Query Format
This section includes the model-specific query formats in each target task. As stated in Section 3, we aim to study the reasoning abilities of retrieveraugmented language models in language modeling and question answering tasks. A sample of what the queries to each model in language modeling looks like is presented in Table 4. These examples are all specified for next token prediction. From these examples, it can be observed that unlike REALM, FiD, ATLAS, and Flan-T5, kNN-LM cannot see the tokens appearing after the target tokens in the sentence. We resolve this problem using the model preference proxy task.
In the question answering setting, on the other hand, we give the whole question into the model,  and <extra_id_0> as the special masking tokens in BERT-based and T5-based models, respectively. and take the generated output as the generated answer.

C Dataset Preparation Details
In order to prepare the datasets for the language modeling experiments, we keep the data samples that include at least one entity mention and mask out the last entity mention in the hypothesis sentences of StrategyQA and different Entailment-Bank variants using Spacy (Honnibal and Montani, 2017). Also, we randomly pick another entity mentioned in data sample's statements as the alternative target (as described in Section 3) and compare the models' score for each target. Regarding the question answering experiments, we use En-tailmentBank's question and answer formats. The reason we don't use StrategyQA in the experiments is that comparing FiD, ATLAS, and Flan-T5 with REALM as an extractive QA model and kNN-LM as a memory-augmented model on a yes/no question answering dataset like StrategyQA is not fair. We evaluate the retriever-augmented language models based on the number of retrieved statements on the development set of the datasets, and in each model, we report the result of the best settings on test sets. Regarding the EntailmentBank dataset, we run the experiments on the same development and test sets as the original data. However, in the StrategyQA dataset, since we don't have access to the answers in the test split, we cannot change the samples' formats to declarative form. Therefore, we pick 25% and almost 35% of the train data as the development and test sets, respectively.

D Quantitative Results
This section includes more visualizations and detailed results. Figure 8 demonstrates the performance of retriever-augmented language models on the development sets in language modeling based on next token prediction accuracy. As expected, results show that REALM is still performing reasonably well in language modeling in predicting the first masked token. By comparing Figure 3 and Figure 8, we observe that although kNN-LM takes the third place among the models in the next token prediction in EntailmentBank datasets, it can not prefer the ground-truth target over the alternative one correctly a lot of times.

E Qualitative Results
We demonstrate a failure example in each of the retrievers and language models in Table 5. In this table, the true retrieved statement and the one that had to be retrieved in order for the model to solve the task correctly are highlighted in green and red, respectively. The true answer (or sequence of tokens leading to the true answer) for each data sample's statements is marked in bold. These examples explain how not retrieving the necessary statements for reasoning or not reasoning over true statements can lead to incorrect answers. The robot will weigh less on mars than earth but will have the same [MASK]. mass vs mars -As the force of gravity decreases, the weight of the object will decrease. -The gravitational force of a planet does not change the mass of an object on that planet or celestial body. ... What allows two students standing ten feet apart to hear each other talk?
-Talking is when a human produces sound to communicate.
-Sound can travel through air by vibrating air. ...

a microphone
FiD +DPR Which energy conversion happens when a person shivers and the energy is transferred to make the muscles and joints move?
-A person is a kind of animal.
-When an animal moves, chemical energy is converted to mechanical energy.
-Shivering is a kind of shaking.
-Shaking is a kind of moving. ...

ATLAS + Contriever
Wave energy from the ocean can be harnessed to power generators to make electricity. Energy from ocean tides can also be used to make electricity. How would you categorize these two sources of energy? -Tidal energy means energy from ocean tides. -Tidal energy is a renewable resource. -Wave energy is a renewable resource.

Wave energy
Which changes will most likely have a negative effect on an ecosystem? -Humans changing ecosystems usually has a negative impact on an ecosystem / organisms living in an ecosystem.
-Humans building roads in an ecosystem causes that ecosystem to change. ...

kNN-LM
The mass of earth causes the pull of gravity on [MASK]. earth vs newton -The mass of a planet causes the pull of gravity on that planet.