Overleaf Example

Many real-world problems require the combined application of multiple reasoning abilities—employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, “How much would the sea level rise if all ice in the world melted?” FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question-answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large-scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.


Introduction
How long is the drive from Seattle to NYC? How big of an emergency fund do I need?We frequently encounter such questions in our daily lives.Likewise, scientists are often faced with questions such as, How much would the ocean surface rise if the ice caps melted?Known as Fermi Problems1 (FPs), these questions are problems whose answers can only be estimated within reasonable limits of error, as precisely measuring the required quantity is either impossible or impractical.
Solving a FP requires considerable life experience, ability to think through long chains of reasoning, and mathematical intuition -as illustrated by fig. 1 for a FP about rising sea levels.Answering an FP correctly requires multiple different facts, and the correct estimate can be arrived via various reasoning paths -this open-ended nature further adds to their challenge.Unsurprisingly, these questions are often used to test candidates in science Olympiads and interviews.Due to the complexity of reasoning that is required to answer these questions, we propose solving FPs as a new task to drive progress in AI reasoning systems.
A core skill required for solving FPs is that of estimatione.g."How much money do I need for a medical emergency?","What is the thickness of ice sheets in Antarctica", etc.This crucially requires abstracting out details to simplify a complex realworld problem, similar to the (in)famous metaphor of the spherical cow,2 For instance, when estimating the volume of Mt.Everest3 , abstracting that the mountain is a conical shaped object significantly simplifies the estimation problem at hand.To the best of our knowledge, our proposed FP challenge is the first of its kind that requires reasoning of this nature.Through this challenge, we hope to spur research towards building AI reasoning systems that are capable of performing such abstractions, a key reasoning skill that is natural to humans.
The complex reasoning involved in solving FPs (see fig. 1) often means the question must be creatively decomposed into simpler ones.These simpler questions often themselves are open-ended Fermi problems.We thus hope our challenge will encourage advances in recursive reasoning models.Further, FPs require the combined application of multiple reasoning strategies to solve the problem.Unlike existing datasets and tasks geared towards specific reasoning skills (e.g.commonsense reasoning or question decomposition), we hope our work drives progress in not just the ability of AI to employ suitable abstractions and estimations, but also in models that can combine various reasoning abilities to produce a coherent solution.

Contributions.
1. We introduce Fermi Problems (FPs), as a task to drive progress in AI reasoning systems -specifically, testing for their ability to make reasonable abstractions, creatively decompose questions into solvable chunks and employ commonsense reasoning.2. We collect a set REALFP of 1k real-world FPs aggregated from numerous websites, quizzes and Olympiads.Further, we provide a synthetic dataset SYNTHFP of 10k questions with the aim of serving as a bank of more accessible problems of intermediate complexity -and hopefully, aid the development of AI models for the harder real-world setting.Both datasets are available at https://allenai.org/data/fermi.3. Based on the FP datasets, we propose three tasks of increasing hardness and establish baselines built around state-of-the-art language models.We find that FPs are well beyond the reach of such systems even after substantial fine-tuning -on average, making predictions that are off by two orders of magnitude and only slightly better than predicting a constant value.Further, we pro-vide an analysis of both the dataset and baselines to illustrate the hardness of the proposed tasks and motivate future advances.

Fermi Problems
The following properties of FPs and their solutions make them an ideal candidate for evaluating and advancing AI reasoning - (1) Recursive Nature of Sub-Problems.As mentioned previously, problem decomposition is an important aspect of FPs.An interesting property of FPs is that decomposed sub-problems are also FPs -e.g."How many dump trucks to empty Mt.Everest?" requires answering the -"What is the volume of Mt.Everest?" and "What is the volume of a dump truck?", which are in-turn, FPs.We employ this property of FPs to create a richer synthetic dataset (see section 4 for more details).
(2) Creativity in FP solutions.Problem decomposition for FPs is not only recursive but also requires considerable amount of creativity.For the above FP about emptying Mt.Everest, an alternative decomposition is -"How many dump trucks to empty Mt.Rainier?" and "How many Mt.Rainiers fit in Mt.Everest?".Note that the decomposition still retains the recursive nature but now follows an alternate path.The exact decomposition is a function of the knowledge and life experiences of a person, and in the case of an AI, the information accessible to it -either through information stored in its parameters, a retrieval mechanism or a knowledge base.As an accurate estimate is sought at the end, creativity in problem decomposition is closely intertwined with the problem of what can be estimated.In addition to practical scenarios (e.g."How many port-a-potties are needed for a gathering of 1 million people?"), FPs often concern (a) unrelated objects (e.g.Mars bars and Olympic pools), (b) unusual attributes of common objects (e.g.volume of a Mars Bar as opposed to its calorific value) and (c) hypothetical scenarios (e.g."Consider the earth and moon are at two ends of the school oval, how far is the sun?").Thus, FPs require going beyond biases encountered in the real world or in previous problems. 4Estimating the answer to such questions requires the ability to think creatively, a thorough understanding of the underlying process and the intent of the question.
(3) Need for Reasonable Abstractions.Despite taking a creative approach, one can be unsuccesful at solving FPs without the ability to make reasonable abstractions.Returning to our running example of emptying Mt.Everest, a creative decomposition leads us to considering the volume of Mt.Everest w.r.t. to that of Mt.Rainier.However, we still need to address the issue of computing the volume of Mt.Rainierhere, assuming it to be a conical shaped object helps us in computing a reasonable estimate.We humans employ various abstractions regularly in our daily lives -e.g.spatial abstraction ("Is the road wide enough to turn my car?"), temporal ("Do I have enough time to grab lunch before the next meeting?")and causal ("Pressing the gas pedal, makes my car rush forward").We would require such a key skill to be well within the reach of AI systems and to this end, the proposed FP challenge is an ideal downstream task to evaluate this.
Arriving at the correct answer requires one to make reasonable abstractions at each step.This requires a sufficiently accurate working model of world and is broadly categorized as life experience.For example -the fact that a Mars Bar can be eaten in a few bites can help determine its volume.Similarly, understanding that pizza shops usually cater to homes within a few mile radius helps in estimating the number of pizza delivery persons in Chicago.Further, domain-specific reasoning might be required to solve some FPs -for example FP illustrated in fig. 1 requires an understanding of physics to infer that only land ice leads to increase in sea levels.

Related Work
Mathematical Reasoning In the area of mathematical reasoning, several projects have probed the limits of transformers to solve pure math problems (Saxton et al., 2019;Lample and Charton, 2019;Hendrycks et al., 2021).FPs differ from these problems in two important ways.First, due to the heuristic nature of their solutions, FPs do not have a unique, precise answer with formal proof, in the way that normal mathematical problems do.Second, FPs are stated in natural language (NL) rather than a formal, mathematical notation.FPs are perhaps closer to algebra word problems, where a NL question, e.g., "How many cookies were left?", is asked about a simple NL story (Amini et al., 2019;Ling et al., 2017;Koncel-Kedziorski et al., 2015).However, in algebra word problems, answers are again uniquely defined and provable.In addition, all required information is provided in the story, while in FPs the solver must find/recall required information. 5Finally, in story problems, the space of possible solution equations is typically small and well-defined enough that it can be exhaustively searched, while FPs can have arbitrarily complex solutions (e.g., Figure 1).
Question Decomposition FPs require problem decomposition, in a way loosely similar to multihop inference.However, for FPs, the appropriate decomposition is not explicit in the question itself, unlike early multihop datasets such as HotpotQA (Yang et al., 2018) or WebQuestions (Berant et al., 2013).Later multihop datasets, e.g., OBQA (Mihaylov et al., 2018), contained questions where the decomposition was not explicit in the question (e.g., "Does a suit of armor conduct electricity?",implicitly requiring a subquestion about materials), but typically into just two (or at most three) steps.In contrast, FPs typically require multiple levels of decomposition, significantly increasing complexity.This in turn requires identifying a solution strategy, namely how to factor an unknown quantity into a function of known (or recursively factorable) quantities.The StrategyQA (Geva et al., 2021) dataset illustrates this problem but for a different task, namely true/false questions about whether something is possible, and without recursive decomposition, a key feature of FPs.
Commonsense In addition to mathematical reasoning, FPs require significant commonsense knowledge, both for estimating quantities and for decomposing problems.For example, "How many pizza delivery trucks are in Chicago?" requires significant commonsense about human behavior (How often do people order pizza?How many deliveries can a truck make per day?) to even begin to decompose the problem, let alone estimating basic quantities (Population of Chicago?).While new resources of commonsense knowledge are becoming available, e.g., (Bosselut et al., 2019;Sap et al., 2019), substantial development is still needed for the kind of world modeling that many FPs require.
Numeric Estimation Large-scale language models trained on web-scale data have been shown to contain common numerical facts -e.g.number of days in a year, distance from earth to moon, number of hairs on a human head, etc.We leverage one such model (T5 (Raffel et al., 2020)) for our baselines.More recently, researchers have shown that models can also perform estimation to some degree (Zhang et al., 2020), and have proposed novel encoding strategies to improve number prediction and estimation (Spithourakis and Riedel, 2018; Berg-Kirkpatrick and Spokoyny, 2020).Such techniques would be valuable for improved solutions to FPs.

Datasets and Tasks
We present two datasets, REALFP and SYNTHFP, which are collections of real-world and synthetic Fermi problems, respectively.We then define three FP challenge tasks, with varying difficulty levels.

Dataset Elements
Each instance in our datasets consists of a Fermi question Q and its answer A, standardized using the International System of Units, SI. 6 Further, we add two extra elements to each question Q, supporting facts and explanations.
Supporting Facts F : Each question Q is paired with F , a set of supporting facts, which are sentences describing quantities relevant to Q.This enables two aspects of our Fermi challenge: (a) defining certain tasks where the output must include F as part of an explanation, to encourage program → statement* statement → comp-expr | support-expr comp-expr → qn-id "->" {math-expr | value-expr} math-expr → operator "(" qn-id* ")" operator → "Add" | "Sub" | "Mul" | "Div" value-expr → val-id "because" fact-id support-expr → question-expr | fact-expr | val-expr question-expr → qn-id ": " question fact-expr † → fact-id ": " sentence val-expr → val-id ": " number [units] Figure 2: Grammar for FP explanation programs.† The proposed FP tasks (proposed in section 4.2.3)separate out fact-expr from the program to either provide them as part of the input or expect them in the output.
justifiable reasoning (see below); and (b) defining simpler FP tasks where F (or a noisy version of it), is provided as part of the input (as question "context") to help drive progress on the FP challenge under the familiar Reading Comprehension setting.
Explanations P : In the case of FPs, the reasoning behind an answer is as important as the answer itself and therefore, each question is paired with an explanation in the form of an executable program describing the facts, values, and mathematical computations needed to arrive at an answer -see fig. 3 for an example.The explanation programs that can be expressed are captured by a simple, recursive grammar shown in fig. 2.
As seen from the grammar, an FP program is a sequence of statements, where each statement is either a computation expression or a support (or explanation) expression.A computation expression can be either a mathematical operator applied to one or more recursively spawned sub-questions (e.g., Q0 → Mul(Q1, Q2) in Figure 3), or a value expression pointing to the identifier of a numerical value along with the identifier of a fact supporting that value (e.g., Q1 → A1 because F1).A support expression defines a sub-question, supporting fact, or numerical value, and associates it with a unique identifier for reference in the rest of the program (e.g., Q1: What is . . ., F1: There are . . ., and A1: 7 in the example in Figure 3).
A program P that respects this grammar can be "executed" or evaluated to obtain a numerical answer, using only the computation and value expressions contained in P .The sub-question and fact expressions included in P act as provenance for the numerical computation captured by P .In the datasets, P evaluates to A, i.e., is an explanation of A. However, as we show later, if we train a model to predict A, and to also predict P , the evaluation of P (called PAns) is typically different to A. We can view these as two alternative ways to predict an answer, either directly or via explicit program synthesis.While the synthesis approach is more interpretable, it is not obvious which is better as far as answer prediction is concerned.We evaluate this shortly in Section 5.

REALFP: Real-World Fermi Problems
The REALFP dataset contains 928 FPs, collected from various internet pages7 , quizzes, and Fermi problem Olympiads.The questions cover a wide variety of topics requiring domain-specific reasoning (such as physics, basic mechanics of Poker, etc), commonsense reasoning, and most importantly, estimating various physical quantities such as volume, speed, density, etc.
As discussed in Section 4.1, each instance in RE-ALFP consists of four elements: a question Q, an answer A in SI units, supporting facts F , and an explanation P in the form of an executable program referring to facts in F ; fig. 3 shows a sample question from REALFP.While Q and A were collected from various sources, F and P were added as part of this work using expert annotation.It should be noted that the supporting facts and numerical estimates provided in this dataset are a function of the annotator's life experiences and information available on the Internet.As a result, they are not always fully accurate.Due to this, as well as the inherent variance in the answers to FPs, our annotations are best viewed as informing us of one potential way of approaching the solution.
We split the REALFP dataset into train, validation and test splits containing 185, 185 and 558 questions respectively.Reserving a majority (∼60%) of FPs for testing is in line with our objective of using the dataset primarily as a test bench to evaluate and drive progress in AI reasoning.The baseline models we provide use the train set to finetune large-scale models and report performance on the test set.Data Analysis.The questions Q in REALFP have a median length of 14 tokens.The entire dataset has 892 unique nouns with each question containing 3.7 nouns on average.Further, the facts and subquestions collected as part of the dataset, contain, on average, an additional ∼4 nouns.This indicates that the decomposition for FPs is not trivial and requires recalling or finding information about objects often not mentioned in the original question.The executable program P provided in the dataset typically contains 2 subquestions; however 176 questions in REALFP contain a deeper chain of reasoning requiring up to 10 subquestions.
Further, we analyse the questions in REALFP based on the core reasoning skill required to solve it.For example, the fermi question in fig. 1 -"How much would the sea level rise if all ice melted?" is an illustrative example requiring causal and spatial Program Templated Question Div($y.volume,$x.volume)How many $x fit in $y e.g.How many Olympic pools fit in Lincoln Memorial Reflecting pool?Mul($y.density,$x.volume)If $x were to have the same density as $y, how much would it weigh?e.g.If tennis balls were to have the same density as bones, how much would it weigh?Div(Div($y.area, 2), $x.area) Assume $x's area is half its value.How many $y have the same area as $x? e.g.Assume Indianapolis's area is half its value.How many Dublin International Airport (DUB) have the same area as Indianapolis?
Table 1: Example templates used for creating the SYNTHFP dataset along with sample questions for each.
reasoning along with knowledge of science.We considered six reasoning types-spatial abstraction, causal abstraction, temporal abstraction, presence of unusual attributes or relationships, commonsense reasoning and science.The frequency of their occurrence in REALFP is summarized in fig. 4. Perhaps not surprisingly, commonsense reasoning and science knowledge are required to solve nearly half of the questions.Other reasoning types like abstraction or presence of unusual attributes appear in nearly 25% of the dataset with potential overlap, i.e. one questioning requiring multiple types of reasoning.

SYNTHFP: Synthetic Fermi Problems
The complexity of REALFP questions and the relatively small size of the dataset makes it difficult to get started with Fermi-style questions.To address this, we introduce a larger dataset of 10k synthetic questions that span a limited set of entities and lines of reasoning, to serve as a sandbox for researchers to help tackle the real-world challenge set.After inspecting questions in the RE-ALFP dataset, we manually selected a few recurring themes to create 12 templates, a few examples of which are shown in table 1.Each template consists of a Fermi-style question with objects represented as variables ($x, $y), etc.), and an associated mathematical formula referencing properties of these object variables (e.g., Div($y.volume,$x.volume)).
To illustrate the process of generating a synthetic question from such a template, consider the following FP: "How many basketballs fit in a schoolbus?".The broad template for this question is, "How many $x fit in $y?".Multiple questions that adhere to this template can be generated by replacing $x and $y with objects as long as $x.volume and $y.volume are available.This question generation approach, in addition to ensuring solvability, also provides an easy way to generate an executable program respecting the grammar discussed earlier (e.g., with statements such as Div(Q1, Q2), Q1: "Volume of Y?", Q2: "Volume of X?", etc.)).We provide the full list of 12 templates used to generate the SYN-THFP dataset in appendix A.
Further, we employ the recursive nature of FPs (see section 2) to generate more complex solutions for the templated questions in SYNTHFP.For instance, see the last template in table 1.First, the question requires halving the area of Indianapolis and further, in our database, its area is provided in terms of the area of Nauru island.Therefore solving this question requires a further decomposition i.e. "What is the ratio of the area of Indianapolis and that of Nauru?"In our dataset, we decompose the solution w.r.t.another object present in the database for roughly half of the 10k generated templated FPs.
At its core, the synthetic dataset uses a knowledge base K, collected via API calls to The Measure of Things resource.8K contains ∼500 objects.For each object, it contains information (when applicable and available) about eight common attributes: length, area, volume, weight, density, speed, time, and information (data).Starting with K, we generate a dataset whose questions have an equal representation of all the templates.In total, SYNTHFP contains 10K FPs with 8K for training, and the remaining 2K FPs equally divided for validation and testing.

Challenge Tasks
We introduce three tasks that each build up to the full complexity of FPs -allowing researchers to make progress in a measurable and principled manner.For each task, we consider two ways of solving it: (a) generating an answer A directly, where any reasoning is implicit in the parameters of the model, or (b) generating an explanation program P , which can then be executed to produce an answer PAns.Note that A and PAns are distinct, reflecting differ-ent ways of answering, one directly from the model and one via program synthesis and execution, analogous to "Thinking, fast and slow" (Kahneman, 2011).
Task 1, perfect-context: Q, F → A | P .9To help make progress, we define an easier FP task where all and only the relevant facts F are also supplied as the input, along with Q.An example of this I/O is shown in Figure 3.We define two alternative outputs, namely predicting A directly or predicting a program P (which is then evaluated to produce an answer PAns).10 Task 2, distractor-context: Q, {F ∪ F d } → A | P .This setting extends Task 2 by adding F d , a set of distractor facts to the input, bringing the total number of facts to 20.This requires the model to also identify which facts are actually useful for the solution.F ∪ F d here is akin to the "context" in the typical Reading Comprehension setting studied in the QA literature.It should be noted that the set of distractor facts F d are chosen from facts corresponding to similar questions in the dataset;11 similarity is defined using the question embedding as given by a sentence transformer (Reimers and Gurevych, 2019).
Task 3, full: Q → A | P .When the input is only the question, we are in the original Fermi problem setting.Again, we define two subtasks (a) generate an answer A directly (b) synthesize a program P which is then used to compute its implied answer PAns.Note that when the explanation program P needs to be outputted, the model is not presented with any facts F unlike the previous tasks.Therefore, the model has the freedom to avail information from any other source -e.g. a knowledge base or via information already part of its parameters.Given the unconstrained nature of this task, there are many possible programs and facts (the gold program and facts in the dataset represent just one possible solution), making fully automatic evaluation out of reach.Instead, we indirectly evaluate P by (a) requiring it to be executable (b) scoring its derived answer PAns wrt. the gold.Even these are a high bar due to the challenging nature of FPs.Human-in-the-loop evaluation tools, such as GE-NIE (Khashabi et al., 2021), could also be used to directly assess P and F when performance on other metrics reaches a non-trivial level.

Metrics
Answer Evaluation: FPs do not have precise answers, because of the underlying ambiguity in terminology and context of both the original FP and sub-questions which may need to be answered.Therefore, in Fermi Science Olympiads, participants are awarded full points for obtaining an answer in the same order of magnitude as a reference gold answer, and 1/3 points less for each order of magnitude they are off by.In line with this evaluation scheme, we use the following continuous scoring metric: where A and A are the predicted and the reference gold answers in SI units, respectively.During evaluation, we convert the output A of all models to SI units before comparing with A and therefore, the model is free to output units that are most natural for the question.The score thus ranges from 1 when producing precisely the gold reference answer, to 0 when the prediction is off by three or more orders of magnitude.
Program Evaluation: When we operate in the synthesis setting i.e. care about outputting an explanation program P that executes to a numerical estimate PAns, we evaluate explanations (programs) along three axes: Validity: Is the program syntactically valid?This is assessed by seeing whether it successfully evaluates to a number.For this, we use a program executor, written in python, that evaluates FP programs as described earlier and returns a numerical result, or throws an error.If execution is successful (independent of the result), validity=1, else 0. Evaluated Answer accuracy: If the program successfully evaluates, how accurate is the resulting evaluated answer?We use the same answer evaluation metric described earlier.Note that the evaluated program's answer may (and likely will) differ from the model's direct answer A. However, if the program is not valid, the model gets a credit of 0. Further, it is important to note that this metric assigns credit to outputs that do not necessarily correspond to the explanation program present in the collected dataset.As FPs can be solved via multiple possible explanation programs, if a model arrives at the correct answer by such an alternative approach, the evaluated answer accuracy still provides a noisy estimate of the effectiveness of the outputted program.Fact Identification: For tasks that include the gold facts F as input (possibly with distractor facts), did the program P include all and only the gold facts F ?We compute an F1 measure by comparing the gold fact IDs with the fact IDs used.

Experiments
We describe some baseline approaches to solve the FP Challenge tasks, and report their performance on the test sets of the two datasets.

Baselines
These FP Challenge tasks require predicting both the final answer A, and the reasoning involved P .We fine-tune a pre-trained T5 model (Raffel et al., 2020) 12 as a seq2seq model that, for each FP challenge task, takes in the corresponding inputs (Q and possibly F ) to produce the corresponding outputs (A, P , and possibly F ).For each of these tasks, we evaluate the performance on RE-ALFP after (1) finetuning on the SYNTHFP-train set, (2) finetuning on the REALFP train set, and (3) finetuning on the SYNTHFP train set in addition to the REALFP-train set.

Results.
Based on predicted answer (A), we find that the T5 model finetuned only on REALFP performs slightly better than other variants.However, the best score is achieved when the explanation program is evaluated -12 from the Huggingface library.
highlighting the utility of outputting the chain of reasoning as opposed to directly predicting an estimate.Further, when predicting the program, we observe that fine-tuning on the SYNTHFP is useful as it improves other metrics associated with outputting an accurate program (i.e.validty and fact f1-measure).Not surprisingly, the full setting of the FP challenge is significantly challenging and overall performs very poorly compared to other settings where relevant facts are provided to the model.
A few interesting failure modes of the T5 (FT both) when trained on the distractor-context task, are discussed in table 3.These examples illustrate the importance of evaluating both the explanation (P ) and the final estimate (A).

Discussion
6.1 Regression baselines on full task.
We try out some trivial baselines i.e. constant prediction and regression to model the full setting of the FP challenge.We find interesting trends where even such trivial baselines outperform existing large-scale language-models like T5 on the FP challenge.Constant Prediction.This is a trivial baseline that predicts a constant value irrespective of the question.By performing a logarithmic sweep between 10 −10 − 10 10 , we find that the constant prediction of 1000 (for every FP) achieves an average score of 0.22, indicating that this prediction is, on average, two to three orders of magnitude off.Regression.This baseline uses a 3-layer MLP, which regresses to a number, given an encoding of the question (obtained using a pre-trained BERT model (Devlin et al., 2019)).We train this model in three settings: (1) on SYNTHFP, (2) on REALFP, Q: Imagine the earth is at one end of the school oval and the moon is at the other end.How far away is the sun?Predicted P : Mul(Distance of earth-sun?2e+11 km, Distance of earth-moon?2e+11 km) Target P : Mul(Length of school oval?0.1 km, Div(Distance from earth-sun?151e+6 km , Distance from earth-moon? 384400 km)) Scores: Valid?: Yes, Ans: 0 Facts: 0.67 Q: How many punctuation marks are in a book?Predicted P : Div(Number of sentences in a book?5000, Avg.number of punctuation in a book?0.005) Target P : Mul(Number of sentences in a book?5000, Avg.number of punctuation in a sentence?3) Scores: Valid?: Yes, Ans: 0.392 Facts: 1 Table 3: Some interesting failure modes of the T5 model (FT both) trained on the distractor-context task (format changed for presentation purposes).The first example highlights gaps in the model's reasoning ability: itu is not able to identify the relevant facts from the set of presented facts.The second example, on the other hand, identifies the relevant facts but performs the wrong operation (division instead of multiplication) and fails to generate one of the sub-questions and its estimate.and (3) on both training sets.From table 4, we can see that this model performs best when trained on both datasets (achieving a score of 0.32).However, this is only slightly better than predicting a constant and on average is still off by roughly two orders of magnitude from the correct estimate.
Our REALFP dataset includes only one explanation program to a given FP whereas in practice, there can be multiple creative decompositions that lead to the correct answer.To encourage models that are capable of capturing this diversity in the output space, it would be interesting to (a) collect alternative solutions similar to say, image captioning datasets where it is the norm to train and evaluate against multiple ground truth candidates and (b) increasing the number of templates in the SYNTHFP dataset, thereby biasing the model towards exploring multiple solutions by pre-training on a richer synthetic dataset.Further, the work doesn't include other variants of FPs -e.g.binary yes/no questions, comparisons, or FPs involving probability and risk quantification.Finally, note that our real-world dataset, by virtue of how it is collected, has a high US-centric bias, both in terms of cultural context and vocabulary.

Modeling Improvements.
In terms of modeling, we establish baselines by finetuning existing large-scale language models.However, it might be interesting to incorporate them as part of a bigger framework that is developed specially to solve FPs -for instance, a neuro-symbolic system that intelligently seaches the space of FP decompositions by interleaving question decomposition and the estimation (predicting the numerical answer for sub-questions) phases.Further, both the estimation phase and the decomposition phase can be improved by giving the model with the ability to access a knowledge base that contains various numerical, commonsense or science facts.

Conclusion
In this work, we propose Fermi Problems (FPs) as a reasoning challenge for AI systems.Apart from introducing abstraction as a crucial reasoning skill, our work requires the combined application of various reasoning skills including creative decomposition of problems, commonsense reasoning, mathematical reasoning, etc.We collect two datasets -REALFP with ∼1k real-world questions and SYN-THFP with 10k templated questions.Based on these datasets, we propose three concrete tasks of increasing difficulty that encompass the FP challenge.The baseline models we provide, despite being based on state-of-the-art language models and even with substantial fine-tuning, struggle on our challenge tasks.They are, on average, off by two orders of magnitude from the correct estimate and perform only slightly better than predicting a constant number.We thus hope to establish Fermi problems as a hard reasoning challenge that motivates further advances in AI reasoning systems.

Figure 1 :
Figure 1: Humans solve FPs by employing sophisticated reasoning skills including abstraction, (ice on land ≈ ice on Antarctica), problem decomposition (Volume of ice = Area of Ice × thickness of ice) and commonsense reasoning (only ice on land causes rise in sea levels).
Figure 3: Example I/O for Task 2. The input is a Fermi question Q and relevant facts F .The output is the answer A and an explanation P in the form of a program.

Figure 4 :
Figure 4: Distribution of questions in REALFP based on the type of the reasoning required to arrive at the correct explanation program for a fermi question.

Table 2 :
Results on FPs with explanations (programs), for T5 fine-tuned on the synthetic FPs (train), the real FPs (train), or both.Ans A is the model's direct answer.Explanation (program) P is evaluated on whether it executes (Valid?), and if so, whether that execution produces a correct answer (PAns) and whether it uses the needed (gold) facts F included in the input for Tasks 1 and 2 (measured as F1 score).

Table 4 :
Performance of the MLP-based regression models for the full task of our FP challenge with results reported on the REALFP test set.