Inferring Implicit Relations in Complex Questions with Language Models

A prominent challenge for modern language understanding systems is the ability to answer implicit reasoning questions, where the required reasoning steps for answering the question are not mentioned in the text explicitly. In this work, we investigate why current models struggle with implicit reasoning question answering (QA) tasks, by decoupling inference of reasoning steps from their execution. We define a new task of implicit relation inference and construct a benchmark, IMPLICITRELATIONS, where given a question, a model should output a list of concept-relation pairs, where the relations describe the implicit reasoning steps required for answering the question. Using IMPLICITRELATIONS, we evaluate models from the GPT-3 family and find that, while these models struggle on the implicit reasoning QA task, they often succeed at inferring implicit relations. This suggests that the challenge in implicit reasoning questions does not stem from the need to plan a reasoning strategy alone, but to do it while also retrieving and reasoning over relevant information.


Introduction
A longstanding goal of language understanding has been to develop systems that can reason, i.e., integrate multiple pieces of information to reach a conclusion (McCarthy, 1959;Clark et al., 2021).This has sparked interest in question answering (QA) benchmarks that require such reasoning (Welbl et al., 2018;Yang et al., 2018;Talmor and Berant, 2018).One particularly challenging case is questions that require implicit reasoning, that is, where the evidence for answering the question is not mentioned explicitly.Consider the question "Does Santa Claus work during summer?".This question requires implicit reasoning since it involves knowing when the holiday associated with Santa occurs, but this is not evident from the question.
Answering implicit reasoning questions can be viewed as a two-step process: (a) inferring simple sub-questions necessary for answering the question, and (b) retrieving the relevant knowledge pieces (i.e., answering sub-questions) and reasoning over them to derive the answer.Figure 1 illustrates this decoupling.To answer the shown question, we need to use knowledge about the Lusotitan dinosaur and almond trees to infer that the relevant sub-questions concern their heights.We refer to the relation height, which is not mentioned in the question as an implicit relation.Once implicit relations are inferred, we can retrieve the relevant facts and deduce that the answer is 'False', as a Lusotitan is Table 1: Example annotations of concept-relation pairs from IMPLICITRELATIONS along with the question source dataset and answer.Each source exhibits different facets of implicit reasoning questions.much higher than an almond tree.
In this work, we put a spotlight on implicit relations and investigate the ability of language models to infer them as a necessary (albeit insufficient) step for answering implicit reasoning questions.We first define implicit relations, and show that they can be reliably annotated through crowdsourcing (example annotated implicit relations are in Figure 1 and Table 1).To show implicit relations are common, we curate and annotate implicit reasoning questions from three existing datasets, STRAT-EGYQA, CREAK, and COMMONSENSEQA 2.0, which results in IMPLICITRELATIONS, a new evaluation benchmark containing 615 questions and 2,673 annotations of implicit relations.
We use our benchmark to evaluate the ability of large LMs to infer implicit relations, since they are known to acquire substantial amounts of knowledge and common sense with scale (Roberts et al., 2020;Liu et al., 2021;Smith et al., 2022), but struggle with implicit reasoning questions.Specifically, we evaluate models from the GPT-3 family using in-context learning, where the model is fixed and only a few examples are given as context.
We find that large LMs perform well on this task, with a 175B parameter model recovering 0.53-0.59 of the implicit relations across datasets, outperforming a baseline by 21-40 points.This is robust across methods for sampling in-context examples, and even in cross-data scenarios where in-context examples are sampled from a different dataset than the target question.However, inferring implicit relations does not improve accuracy on the downstream QA task, even when gold relations are provided.This suggests that the challenge of implicit reasoning questions is not primarily due to implicit relation inference, but possibly due to the need to also retrieve information and reason over it.
To conclude, in this work we propose the notion of implicit relations, and construct the IMPLIC-ITRELATIONS evaluation benchmark for testing the ability of models to infer them from questions.We evaluate large LMs and show that they infer implicit relations fairly well, while still falling short of answering implicit reasoning questions.Our work facilitates future work on improving implicit relation inference, and sheds light on the factors relevant for developing models that can handle implicit reasoning.From a broader perspective, our work joins recent community efforts to highlight the ubiquity of missing and implicit elements in natural language (Cheng and Erk, 2018;Pyatkin et al., 2020;Elazar et al., 2022).1

Implicit Relations
We now define the notion of implicit relations in the context of complex question answering.
Complex questions are questions that require multiple steps of reasoning in order to be answered (Yang et al., 2018;Talmor and Berant, 2018;Welbl et al., 2018;Khot et al., 2021).For example, the question "Was Linnaeus alive when Darwin published Origin of Species?" involves fetching two dates and then comparing them (Figure 2).A prominent challenge in complex QA, that attracted substantial attention recently (Mihaylov et al., 2018;Khashabi et al., 2020;Lin et al., 2021

Retrieval
Retrieval Logical Figure 2: An example of explicit and implicit reasoning questions that share the same question decomposition, along with the implicit relations derived from the retrieval steps in the decomposition.Geva et al., 2021;Yasunaga et al., 2021;Wei et al., 2022), is cases where the reasoning steps are implicit and should be inferred from the question.For instance, "Did Linnaeus edit Darwin's draft of Origin of Species?" involves the same reasoning steps, but they are not mentioned explicitly.Thus, the former question is an explicit reasoning question, while the latter is an implicit one (Figure 2).Wolfson et al. (2020) proposed QDMR as a meaning representation for complex questions, where a complex question q is decomposed into a sequence of m reasoning steps D = (s 1 , . . ., s m ), and each step s i corresponds to a simple natural language question.Answering the simple questions one-by-one yields the final answer (see decomposition in Figure 2).Geva et al. (2021) collected decompositions for implicit reasoning questions as part of the STRATEGYQA dataset, where importantly, inferring the sequence of reasoning steps is challenging due to their implicit nature.In addition to generating effective decompositions for implicit reasoning questions, we find an additional challenge in evaluating these decompositions when represented as a sequence of sub-questions.Specifically, Geva et al. (2021) distinguished two types of reasoning steps in a decomposition -retrieval steps, which require retrieval of facts (s 1 and s 2 in Figure 2), and logical steps, which perform logical reasoning over previous results (s 3 in Figure 2).
In this work, we observe that a key ingredient in inferring decompositions is to identify the implicit relations that are necessary for answering the question.Concretely, each retrieval step in a question decomposition can typically be represented as concept-relation pair ⟨c, r⟩, where c is a sequence of tokens from the question that refer to a concept, and r is a relation of that concept.For example, the concept c in step s 2 in Figure 2 is "Origin of Species", and the relation r is its publication year.
Based on this observation, we provide the following definition for implicit relations in complex QA.Let q be an implicit reasoning question, and denote by D = (s 1 , ..., s m ) its decomposition into a sequence of m reasoning steps.Let {s i 1 , . . ., s in } be the subset of retrieval steps in the decomposition D. We define the implicit relations for answering q as the set I = {⟨c 1 , r 1 ⟩, ..., ⟨c n , r n ⟩} of concept-relation pairs, where each concept-relation pair corresponds to a particular retrieval step.
In the next sections, we will use this definition to construct a task for probing the ability of models to infer implicit relations ( §3- §5), and investigate why they struggle on the downstream QA task ( §6).

The IMPLICITRELATIONS Benchmark
In this section, we describe the process of creating IMPLICITRELATIONS, a benchmark for evaluating the ability of models to infer implicit relations.

Data Collection
We curate questions that require inferring implicit relations from three recent datasets: • STRATEGYQA (Geva et al., 2021): A dataset of yes/no questions that require implicit multi-step reasoning.STRATEGYQA (STGQA) questions can be answered from Wikipedia, and are diverse in terms of the required reasoning skills and question topic.Large language models, such as GPT-3 (Brown et al., 2020), were shown to struggle on STGQA (BIG-bench collab., 2021).
• CREAK (Onoe et al., 2021): A dataset containing true/false statements that require common sense and knowledge about real-world entities.
• CSQA2 (Talmor et al., 2021): A dataset containing yes/no questions and true/false statements.CSQA2 questions involve generic commonsense reasoning.Most questions do not require knowledge about particular entities.
Examples from each dataset are shown in Table 1.
Collecting questions from three sources serves two purposes.First, it demonstrates that inferring implicit relations is necessary for many question types: in multi-step questions (STGQA) and single-step questions (CREAK), in entity-focused questions (CREAK) and generic common sense questions (CSQA2).Second, these datasets were created using different protocols: STGQA and CSQA2 use a model-in-the-loop during data collection, while CREAK does not.STGQA and CREAK ask annotators to author questions freely, while CSQA2 employs a gamification mechanism.Having questions from different data collection pipelines increases the likelihood that empirical conclusions are not tied to a particular dataset.

Question curation
We chose questions that satisfy two properties: (a) answering the question requires inferring an implicit relation, and (b) the question is feasible, that is, it can be answered using real-world facts (provided as part of the benchmark in STGQA and CREAK) or using generic common sense (in CSQA2).We sampled examples from the training set of each of the three datasets and kept questions that satisfy the two properties.
Annotating implicit relations We use Amazon Mechanical Turk to annotate implicit relations over curated questions.We qualify 15 crowdworkers to identify concepts, which are token sequences from the question, relevant for answering the question, and the corresponding relation for each concept (see annotation guidelines in Appendix C).Annotators can specify up to four concept-relation pairs per question, but in practice, 98.9% consist of ≤ 2 pairs.Concepts must be extracted directly from the input questions, and relations are phrased using concise natural language phrases.
For STGQA and CREAK, which often require uncommon knowledge about entities, we provided additional context from the original data source.For STGQA, we provided facts along with the full question decomposition.For CREAK, we provided an explanation for why the claim is true or false.
We collected 5 annotations per example in CREAK and CSQA2, and 3 annotations in STGQA.Due to the availability of facts and question decompositions, STGQA showed high agreement between annotators (see Table 2).CREAK and CSQA2 showed more variability, and thus, we collected more annotations per example.To ensure quality, we manually verified all examples created during the annotation process, filtering out annotations that do not fit the task requirements.Annotator Agreement We manually analyzed 20 random examples from each data source to evaluate agreement between annotators for concepts and relations.We declared concept agreement when at least three annotators identified the same concept in an example.We found that annotators agreed on 77% of the concepts, and less than 10% of concepts were extracted only by a single annotator.See Table 2 for break-down by data source.

Data Analysis
Assessing agreement on relations is more challenging, since relations are short phrases that can differ lexically.We marked for each example whether annotated relations differed only lexically or due to multiple reasoning strategies for answering the question.In 76% of the examples, the relations across all annotators were identical or differed lexically, e.g., the relations "average price" and "cost".In 24% of the examples, multiple reasoning strategies were used, which would result in different reasoning and retrieval steps (see Table 2).
Overall, our analysis suggests that implicit relations can be annotated reliably.

Experimental Setting
We now turn to evaluate the ability of large LMs to infer implicit relations in questions.To this end, we use examples from IMPLICITRELATIONS in a few-shot in-context learning setting (Brown et al., 2020), where given several input-output examples and a test input, the LM is expected to generate the required output.We focus on this setup following the recent progress in in-context learning, specifically for tasks that involve general commonsense reasoning (Da et al., 2021;Chowdhery et al., 2022).

Task and Model Specification
Given a test question q * , we prepend k annotated examples {⟨q (i) , I (i) ⟩} k i=1 to it, where q (i) is the i-th question in this set and n i ⟩} is the set of corresponding concept-relation pairs.Specifically, we use the following input format: Example inputs are given in Appendix A. The prefixes "Question" and "Implicit Reasoning" were chosen arbitrarily and remained fixed throughout all experiments.For each test question, k examples are randomly sampled from the development set, and for each example, a single random annotation is selected.In our experiments, we use k = 16.2Models We evaluate models from the GPT-3 family, 3 which are known to exhibit broad factual knowledge (Brown et al., 2020).In particular, we use text-davinci-002, a 175B-parameter LM, which to the best of our knowledge was not trained on any of the target benchmarks.Furthermore, to assess the scaling behaviour of models, we experiment with other models from the GPT-3 family, see §5.2 for details.In all experiments, outputs are predicted using greedy decoding.
Baseline To account for correlations originating from the concepts that appear in the question or reasoning shortcuts, we define a 'Concept-only' baseline, where instead of testing whether the model can infer the implicit relations from the full question, we test its ability to infer them from the set of gold concepts that appear in the question.For this baseline, we use the same inputs as before, but replace every question q i and test question q * with its set of annotated concepts.
While the identity of the gold concepts provides useful information for inferring implicit relations,  we expect models that have access to the full question to perform better.

Evaluation
Inferring implicit relations involves identifying concepts and relations.We now define evaluation metrics for this task.
Concept extraction Our output is a set of coneptrelation pairs.Let C pred be the set of concepts predicted by the LM, and let C i gold be the gold set annotated by annotator i.Given that annotated concepts are tokens from the original question, we can use edit-distance to match each predicted concept c ∈ C pred with a concept from C i gold , declaring a match when the edit distance of the best match is above 0.8.Following concept matching, we compute recall and precision in the typical fashion and take the max over all annotators.A post-hoc manual analysis validated that no incorrect concept matching occurred due to the use of edit distance.
Relation coverage Since relations are short phrases with high lexical variability, string matching is not a viable strategy.To overcome this, we leverage our ability to align concepts and use relation embeddings rather than relation strings.Figure 3 depicts our evaluation procedure for two sets of predicted (R pred ) and annotated (R gold ) relations.First (Figure 3, step 1), we align predicted and gold concept-relation pairs, using the concepts (as done for concept evaluation).Then (Figure 3, step 2), we embed every relation phrase, using the sentence transformer all-mpnet-base-v2 (Reimers and Gurevych, 2019), and compute cosine similarity between the embeddings of matched rela-STGQA CREAK CSQA2 # of gold pairs 1.9 1.2 1.2 # of generated pairs 2.0 ± 0.04 1.2 ± 0.02 1.2 ± 0.01 tions (defined it as zero if no relation was matched).Last (Figure 3, step 3), we consider a gold relation r gold covered if cosine similarity is higher than a threshold τ , and compute relation coverage, that is, the fraction of gold relations that were covered.With this procedure, we evaluate model predictions against each annotation and take the maximum as the final relation coverage.
We focus on coverage (rather than precision) since we care about whether a model can reveal implicit relations, but predicting additional ones is mostly harmless.Moreover, since we use incontext learning, the average number of conceptrelation pairs generated is similar to the average number of gold concept-relation pairs (Table 3).
To set a threshold τ on relation embedding similarity, we annotated whether a predicted relation is semantically equivalent to a matched gold relation for 100 development examples.We choose τ = 0.51, which results in a 5% false-positive rate (predicting that two different relations are equivalent) and 12% false-negative rate (predicting that two equivalent relations are different).All reported results are an average over three random seeds.

Large LMs Can Infer Implicit Relations
Table 4 shows results on implicit relation inference.First, the model successfully identifies the relevant concepts in the question, achieving high concept recall and precision across all datasets.This is not limited to named entities but is also achieved when concepts are more abstract, as in CSQA2 (Table 1).More importantly, GPT-3 infers the im-plicit relations well, achieving relation coverage scores of 0.53, 0.54, and 0.59 on STGQA, CREAK and CSQA2, respectively.Moreover, a model exposed to the full question dramatically outperforms a model exposed only to the gold concepts by 21, 21, and 40 points on the three datasets.This indicates that concepts contain relevant information, but access to the full question allows the LM to infer the reasoning strategy and in turn the implicit relations.We provide examples for predicted vs. gold concept-relation pairs along with additional qualitative analysis in Appendix A.
Next, we perform additional experiments to (a) further substantiate the ability of LMs to infer implicit relations ( §5.1), and (b) test the effect of model scale on performance ( §5.2).

Effect of In-Context Examples
While the aforementioned results are encouraging, there are two potential (non-disjoint) causes for them: (a) the LM "understands" the task of implicit relation inference, or (b) the LM observes incontext examples and uses them to guess implicit relations for the target question ("soft copying").We study the effect of these causes.

Similar vs. dissimilar in-context examples
To quantify the effect of in-context examples, rather than choosing them randomly, we use examples that are similar or dissimilar to the target question in terms of their implicit relations.We first represent each example as an embedding vector, by (a) concatenating all annotated relations, i.e., r 1 , . . ., r n , and computing a vector representation using a sentence transformer, and (b) averaging the embedding vectors of all annotators.Then, for each example, we select two sets of in-context examples: (a) Similar: the top-k most similar examples (using cosine similarity), and (b) Dissimilar: we discard the 33% most similar examples, and randomly sample from the rest.In both cases, we use gold implicit relations at test time, and thus this experiment is for analysis only.
Table 5  copying from in-context examples.
In the Similar setting, performance increases across all datasets, along with a much higher rate of copying.This hints that designing methods for retrieving similar prompts can lead to gains in performance (Rubin et al., 2022).
To further investigate the relation between copying and performance, we label every example for whether the model copies from in-context examples and the coverage of the inferred implicit relations.We then compute the point-biserial correlation (Tate, 1954) to check if copying is correlated with performance and find that correlation is low (< 0.1 for all datasets), showing that copying does not explain model performance.
Overall, this experiment suggests that while models can leverage examples from context to improve performance, the LM does more than copy and execute implicit relation inference.

Cross-dataset in-context examples If LMs can
infer implicit relations, we should expect high performance even when in-context examples and target questions are taken from different datasets.
To test this, we evaluate performance on questions from CREAK and CSQA2 when in-context examples originate from all 3 datasets.Testing on STGQA does not work well because the number of implicit relations in an example is typically two, while in CREAK and CSQA2 it is typically one (see Table 3), and thus the LM output a single implicit relation, leading to poor relation coverage.
Table 6 shows that, overall, relation coverage remains high for all sources, suggesting that the LM indeed infers implicit relations regardless of the

Effect of Model Size
Recent work (Kaplan et al., 2020;Smith et al., 2022;Chowdhery et al., 2022) has shown that reasoning abilities of LMs improve with model size.
We evaluate this effect on models from the GPT-3 family: ada, babbage, curie, and davinci, which are estimated to have 350M, 1.3B, 6.7B, and 175B parameters, respectively (Gao, 2021;Black et al., 2022).text-davinci, the model evaluated thus far, is a more recent LM that (Ouyang et al., 2022) was trained differently.46 Implicit Relations for QA Given that LMs infer implicit relations well, a natural question is whether they improve performance on answering implicit reasoning questions.
To examine this, we created three experimental setups: Question + predicted: in-context examples are triples of the question, the implicit relations, and the True/False answer; the model is given a question and asked to return the implicit relations and the answer.Question + gold: Similar to Question + predicted except that the model is given the target question and gold implicit relations and asked to return the answer.Question only: incontext examples are pairs of questions and answers, and the model is given a question and asked to provide the answer.We report an average over 7 seeds.See Figure 4 for results.
Overall, access to either gold or predicted relations does not improve accuracy.This suggests that additional capabilities are missing from LMs to handle implicit reasoning questions, such as retrieval and reasoning.This agrees with work on chain-of-thoughts prompting (Wei et al., 2022), which found that adding an explanation of the reasoning process to a question does not improve performance on STGQA for both GPT-3 and LaMDA (Thoppilan et al., 2022).Nevertheless, recently Chowdhery et al. (2022) achieved improvements on STGQA using chain-of-thought prompting, but with the larger 540B-parameter PaLM.
On STGQA, adding gold relations (Question + gold) does not improve QA performance.5Additionally, no significant differences were observed when the model inferred implicit relations on its own (Question + predicted).6For CREAK, adding gold implicit relations did not improve accuracy compared to Question only, and none of the experiments showed any significant difference.Last, in CSQA2 adding gold implicit relations (Question + gold) did not improve the QA performance, but we observed a statistically significant accuracy improvement of 4.5% when the model inferred the implicit relations (Question + predicted). 7o further analyze the results, we computed the point-biserial correlation coefficient between the relation coverage score and the binary outcome (correct/incorrect) for each question.We found that relation coverage score and answer accuracy are entirely not correlated with a r pb coefficient of 0.03, −0.02 and 0.06 for STGQA, CSQA2 and CREAK respectively.Overall, our results indicate that inferring implicit relations correctly is not sufficient to answer implicit reasoning questions.

Related Work
Recent work utilized the ability of large LMs to generate intermediate reasoning steps for improving performance on QA tasks (Wei et al., 2022;Wang et al., 2022;Zelikman et al., 2022;Nye et al., 2022).Wei et al. (2022) introduced 'chainof-thought' prompting to elicit intermediate reasoning steps along with answers from LMs, which improved performance on several reasoning tasks.Conversely, we propose a task and benchmark for evaluating the ability of LMs to infer the intermediate reasoning steps themselves.
Prior work has dealt with reasoning abilities in LMs (Talmor et al., 2020;Khashabi et al., 2020;Gu et al., 2021) by fine-tuning LMs to generate additional knowledge for reasoning questions.We contribute to this effort by evaluating the in-context ability to infer implicit relations with large LMs.
Implicit relations are closely related to question decomposition, which have been used in past work to improve performance on questions that require reasoning (Min et al., 2019;Wolfson et al., 2020;Perez et al., 2020;Khot et al., 2021).We contribute to this research direction by defining implicit relations pairs, which provide a structured representation of the decomposed sub-questions and allow us to examine how language models infer reasoning steps.Several works in narrative understanding (Rajpurkar et al., 2018;Mostafazadeh et al., 2020;Lal et al., 2021) have attempted to assess a model's implicit reasoning capabilities using different methods, such as assessing the solution path to unanswerable questions and and narrative understanding through question-answering.Despite their different approaches, these studies are relevant to our cause.

Conclusion
We propose the task of implicit relation inference, which decouples inference of reasoning steps from their execution.We introduce IMPLICITRELA-TIONS, a benchmark that includes more than 2,000 annotated implicit relations.We show large LMs can infer implicit relations across multiple types of questions and reasoning skills, but this success does not translate to an improvement in answering implicit reasoning questions.Our work sheds light on capabilities missing from large LMs for addressing implicit reasoning questions, and provides a valuable resource for improving the ability of models to infer implicit relations.

Limitations
This research has some limitations, which are typical for work on text generation with large language models.
First, we demonstrated that large LMs can infer implicit relations from complex questions, but we also showed that they may fail to answer those questions correctly.It is unclear how LMs can use implicit relations to improve QA accuracy or what is the path that leads from inferring implicit relations to actually answering the questions.
Second, evaluating relation coverage requires comparing to free texts, and therefore may be prone to error.Despite the fact that an analysis performed manually exhibited a high degree of consistency with the automatic one, we cannot guarantee the same result for datasets or parameters that have not been tested.
Finally, our research was conducted utilizing OpenAI's GPT-3 family of models, which are not publicly available.Despite our best efforts to eliminate confounding factors, there is a lack of transparency regarding the training methods and the data composition used for pretraining those models.

A Examples and Qualitative Analysis
A.1 Qualitative analysis of evaluation In addition to the evaluation described in §4 and used throughout our study, we performed manual qualitative analysis to assess the relation coverage metric model outputs.We sampled 50 examples randomly from STGQA together with GPT-3 predictions and manually labeled if the implicit relations output by the model are semantically correct.
Results in Table 8 show that our relation coverage was slightly more conservative than our manual evaluation -0.53 vs. 0.6.Out of the 20 examples that we marked as incorrect, the automatic evaluation scored 13 examples (65%) with 0 relation coverage, 6 (30%) examples with partial coverage (scored 0.5), which indeed we included a partially correct prediction but not enough to to cover the needed reasoning process.Only one example out of the 20 was a false positive.

B Number of Prompt Examples
We investigate how different number of examples influence implicit relation inference.We run the original experiment with k ∈ {8, 16, 32} examples in the prompt for 3 random seeds.The results (Table 12) show that the number of examples in the prompt has little effect on all evaluation metrics, justifying our choice to use k = 16 in all experiments.

Figure 1 :
Figure 1: An example implicit reasoning question, where answering requires inferring implicit relations that are not explicitly mentioned.Besides implicit relations, answering the question also requires reasoning over the relevant retrieved facts.In this work, we focus on the first step of inferring implicit relations.

Figure 3 :
Figure 3: Given predicted concept-relation pairs (P) and gold concept-relation pairs (G), we evaluate relations by aligning predicted and gold concepts using edit distance (1), and computing the cosine similarity between matched relation embeddings (2).Relation coverage is the fraction of gold relations with similarity > τ (3).

Figure 4 :
Figure4: Test QA accuracy under all conditions, averaged over 7 seeds.Providing the gold implicit relations did not contribute to model performance.
Tables 9,10,11 present examples from each source dataset accompanied by the question, gold annotated implicit relation pairs, predicted pairs generated by the model and the answer to the question.

Table 2 :
Statistics on IMPLICITRELATIONS.questions, ∼ 200 per dataset, where exactly 100 examples from each data source are used as a test set, and the rest are used as a development set.Development set examples are in Appendix A.

Table 3 :
The mean number of concept-relation pairs in the development set of each data source.For modelgenerated pairs, we report an average over 3 seeds.

Table 4 :
Test set performance for concept-only (CO) and full-question (FQ) for all datasets.

Table 5 :
shows relation coverage for the different sets of in-context examples and the fraction of cases where one of the implicit relations predicted by the LM is copied from the in-context examples.When Dissimilar examples are presented, there is a slight performance degradation, most notably in STGQA.However, results are still dramatically higher compared to Concept-only.Moreover, the model succeeds in predicting implicit relations while hardly Development set performance.Controlling the set of in-context examples with Dissimilar, Similar, Random relations, and Concept-only baseline.

Table 7
presents results on STGQA.Increasing model size improves relation coverage and concept recall, but does not significantly change concept precision.Moving from curie to davinci leads to a modest gain in relation coverage.Comparing this to the order of magnitude difference in parameters between curie and davinci suggests that inferring implicit relations does not explain performance improvement in many reasoning and commonsense QA benchmarks.The smallest model, babbage, tends to produce structural errors, indicating it did not properly learn the task.

Table 7 :
Model size and performance comparison on the development set of STGQA.The relation coverage improves as model size is increased.

Table 8 :
Manual and automatic evaluation of 50 examples from STGQA.

Table 12 :
Development set results for 8/16/32 examples in the prompt, averaged for 3 seeds.