TheoremQA: A Theorem-driven Question Answering dataset

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.


Introduction
A long-standing goal of AI systems is to help human beings solve challenging problems, especially more domain-specific problems.To benchmark the progress towards this goal, researchers propose to evaluate AI systems' performance on different math word problem (WMP) datasets.In recent years, there has been a plethora of WMP datasets (Lu et al., 2023c), which we include in Table 1.Most of these datasets are meant for fundamental questions aimed at Grade 1-12 students on a narrow subject.On the other hand, these datasets do not involve much domain-specific knowledge, aka theorem.Due to these two deficiencies, we believe that these datasets are not ideal to benchmark the existing powerful LLMs (Brown et al., 2020;Tamkin et al., 2022;Chen et al., 2021b;Chowdhery et al., 2022;Hoffmann et al., 2022;Taylor et al., 2022) due to their simplicity.In fact, on the popular GSM8K dataset (Cobbe et al., 2021), GPT-4 (OpenAI, 2023) and PaLM-2 (Google, 2023) both already achieved 92% accuracy.Similarly, we tested GPT-4 (OpenAI, 2023) on the subsets of several other listed datasets in Table 1 and observed 90+% accuracy in most cases.The only exception is MATH (Hendrycks et al., 2021) containing highschool math competition problems with SoTA performance around 50% (Zheng et al., 2023).However, MATH (Hendrycks et al., 2021) is focused on math skills rather than theorem.
In this paper, we propose the first theoremdriven QA dataset built on university-level theorems across Math, Physics, EE&CS, and Finance.The whole collection process takes two steps: (1) we first enumerate roughly 400 theorems in different subfields like algebra, number theory, graph theory, information theory, etc, (2) we ask domain experts to search for questions regarding these theorems from different sources like Internet and Textbooks.The domain experts will adjust these questions to ensure the answers follow the desired format for the ease of automatic evaluation.Through the careful construction process, we collected 800 high-quality question-theorem-answer triples as our final release version.
In the course of our experiments, several notable observations were made.First, GPT-4 (Ope-nAI, 2023) significantly outperformed all existing models, reaching an accuracy level of 51% when combined with Program-of-Thoughts prompting.Trailing behind GPT-4, the second most effective model was ChatGPT, achieving an accuracy of 35% through the same prompting method.Additionally, our human evaluation determined that half of GPT-4's errors are caused by minor mistakes like calculation errors, rounding errors, etc.We believe these errors could be easily rectified with a more deliberate prompting strategy or human intervention.This suggests that there is still significant headroom for GPT-4 to achieve with more deliberate prompting strategies.Secondly, we found that all open-source, instruction-tuned language and code models scored below 15% in accuracy, barely exceeding the random guess baseline of 10%.Our human evaluation reveals that open-source models like Alpaca are making errors mainly due to their ignorance of the theorem, where 90% of the errors are not rectifiable.This stark gap between GPT and opensource models suggests that further enhancement strategies, such as science-focused pre-training or fine-tuning, should be considered to narrow the performance disparity.Thirdly, we explored the potential to do theorem-augmented generation.However, the simple strategy of concatenation did not yield a significant improvement.We conjecture that a more complex integration strategy may be needed to achieve more gains.Lastly, we examined the performance of various multi-modal instruction-tuned models on the multimodal subset of the TheoremQA dataset.Surprisingly, these models did not demonstrate significant performance gains over their textonly counterparts.This is mainly due to the unnaturalness of the image, which consists of lots of diagrams and text.Such images are not well captured by existing visual encoder models.
To sum up, our contributions are three folds: • We propose the first theorem-driven questionanswering dataset to understand LLMs' capabilities to apply science theorems.
• We comprehensively evaluate a wide spectrum of 16 LLMs on TheoremQA.
• We perform different analyses in the theorem integration and multimodal understanding aspects to provide detailed insights.
2 Related Work

Math Word Problems
Mathematical reasoning skills are crucial for general-purpose intelligent systems, garnering significant interest from the research community.In the past, studies have explored the ability of NLP models to solve arithmetic and algebraic problems (Hosseini et al., 2014;Koncel-Kedziorski et al., 2015;Roy and Roth, 2015;Ling et al., 2017).
More recently, researchers have introduced increasingly challenging datasets (Saxton et al., 2019;Miao et al., 2020;Amini et al., 2019;Hendrycks et al., 2021;Lu et al., 2021b;Patel et al., 2021b) aimed at enhancing difficulty, diversity, and adversarial robustness.LiLA (Mishra et al., 2022) proposes to assemble a vast collection of mathematical datasets into a single, unified dataset.LiLA also annotates Python programs as target outputs for solving mathematical problems.However, the existing datasets were mostly focused on grade school simple mathematics.To further investigate the LLMs' capabilities to assist humans to solve challenging math problems, we propose TheoremQA as the first benchmark to enable research in this direction.

Large Language Models
In recent years, there has been a surge of research and development in the area of large language models (LLMs) that have significantly advanced the field of natural language processing.GPT-3 (Brown et al., 2020) demonstrated a strong capability to perform few-shot predictions, where the model is given a description of the task in natural language with few examples.By using human-feedback reinforcement learning, Instruct-GPT (Ouyang et al., 2022) (Wei et al., 2022a).More recently, GPT-4 (OpenAI, 2023) shows tremendous progress on lots of complex reasoning tasks spanning mathematics, coding, vision, medicine, law, psychology, and more.Bubeck et al. (2023) shows that GPT-4 is already demonstrating more general intelligence than previous AI models.To further validate GPT-4's capability to solve challenging reasoning tasks, we propose TheoremQA as the new benchmark to further understand LLMs' upper limit.

Reasoning with Large Language Model
To better unleash large language models' capabilities to solve complex reasoning tasks.Chain-of-Thought Prompting (Wei et al., 2022b;Kojima et al., 2022;Wang et al., 2022) was proposed, which aims at prompting the large language models to generate the 'thought process' before outputting the answer.Later on, several other works (Drozdov et al., 2022;Zhou et al., 2022;Nye et al., 2021) also propose different approaches to utilize LLMs to solve reasoning tasks by allowing intermediate steps.Our method can be seen as an extension to CoT by leveraging an extra step of symbolic execution.Another line of work (Gao et al., 2022;Chen et al., 2022b) was proposed to adopt Python programs as the demonstration for the 'thought process' to solve different reasoning tasks.

Dataset
Our dataset collection pipeline contains two steps: Theorem Enumeration Our aim was to encompass a wide range of theorems.To this end, we began by prompting Large Language Models (LLMs), specifically GPT-4 (OpenAI, 2023) Ultimately, we collected approximately 400 theorems, encapsulating a diverse range of topics within these fields.We then delegated these theorems to nine domain experts, instructing them to locate question/answer pairs from varied sources.During the annotation process, a small number of theorems were discarded due to their evaluation complexity.
Question Annotation Our problems were sourced from websites, books, or devised by the experts themselves.One challenge we encountered was the potential for questions found online to have been included in the training data.To mitigate this 'data contamination' issue, we encouraged domain experts to modify these questions.Another challenge arose from questions with answers in symbolic form, matrix form, figure form, etc.These presented significant obstacles for our automatic evaluation.To overcome this, we instructed domain experts to alter the question so the answer would be limited to the following forms: (1) integer, (2) float, (3) list of integers/floats, (4) boolean, and (5) multiple-choice options.For instance, if the original question concerned a matrix, we would revise it to ask about the trace of the answer matrix.This modification significantly streamlined the evaluation process.An example of this can be found in Figure 2. The majority of the questions in TheoremQA have float and integer as the answers, which is more realistic than the existing multi-choice datasets like ScienceQA (Lu et al., 2022) or AQuA QA (Ling et al., 2017).Therefore, the models are unlikely to take shortcuts to achieve high accuracy.Human-Level Performance To provide a rough but informative estimate of human-level performance.We randomly select 20 questions and assign these questions to the 4 Math&CS undergraduate students (average GPA) who have taken the required courses regarding these questions.The participants are given 24 hours with internet access to solve these questions.The four undergraduate students achieve 12/20, 15/20, 18/20, and 19/20 scores on these randomly sampled questions.From this experiment, we are more confident that an expert-level performance should be 100%.

Method
Our method for addressing these demanding questions in the TheoremQA is comprised of several distinct modules, as outlined in Figure 1 2022b): This strategy prompts the language model to initially generate a step-by-step thought process, eventually leading to the final answer.
• Program-of-Thought Prompting (Chen et al., 2022b;Gao et al., 2022): This strategy prompts the language model to progressively generate a program.The final answer is then derived by executing this program.
By delegating computational tasks to an external executor, the problem-solving process is considerably enhanced in its reliability.This improvement results in remarkable advancements in existing math datasets being reported in (Chen et al., 2022b).

Answer Extraction
We observed that parsing the output from Large Language Models (LLMs) can be challenging due to two main issues: (1) The answer is often embedded within a sentence, making it difficult to extract using regular expressions, and (2) The answer may not be normalized, such as 'pi / 3' or '2*10 -e', which complicates comparison with the ground truth.To tackle these problems, we initially employ ChatGPT to identify the answer span within the model's output, then forward this string to WolframAlpha (Inc.) for normalization into a float, integer, or list.
Theorem Augmentation We explored the potential of enhancing large language models with retrieved theorem descriptions to assess their effect on performance.One approach is to retrieve descriptions of the given theorems from the Internet to supplement the LLMs' output.Another experiment involved prompting GPT-4 to generate text descriptions of the theorem, which are then used as an additional augmentation signal.
Multimodal Input A small portion of our data (50 instances) includes images, such as diagrams, as supplemental input, particularly in geometry questions.Since current LLMs don't support such multimodal inputs, we propose a solution: to employ captions like Chameleon (Lu et al., 2023a).These captions describe the image and are then appended to the LLMs' output as an additional signal.

Model Descriptions
In our experiments, we mainly investigate the following models: • GPT3/3.5/ChatGPT/GPT4:These are instruction-tuned models from OpenAI2 .
• Alpaca-13B: This model is based on the LLaMA (Touvron et al., 2023).Alapaca is instruction-tuned by the 52K data generated from GPT-4.
• Vicuna-13B: This model is based on the LLaMA (Touvron et al., 2023).Vicuna is instruction-tuned by the 100K ShareGPT data generated by different GPT-based models.
• MOSS-instruct-16B: This model is based on CodeGen (Nijkamp et al., 2022), which is further instruction-tuned with instruction following dataset distilled from GPT.5 .

Main Results
We demonstrate our main results on Table 2.We will summarize different findings in the following: Closed-source Models For GPT-3 (text-davinci-002) and GPT-3.5 model, since these two models are not Chat-based models, we need to demonstrate one example ensure to help them generate outputs of the desired format.With CoT prompting, GPT-3 (text-davinci-002) and GPT-3.5 models are only achieving 16.6% and 22.8% accuracy.By adopting the program as the intermediate reasoning form, both models can gain reasonable improvements.
For Claude-v1, we found that it is matching the performance of GPT-3.5.ChatGPT outperforms GPT-3.5 and Claude-v1 significantly by 8%, which indicates ChatGPT's capabilities to perform complex numerical reasoning.GPT-4 is the strongest model being evaluated, which beats all the rest models by a huge margin.With Chain-of-Thoughts prompting, GPT-4 can outperform ChatGPT by 13%.With Program-of-Thoughts prompting, GPT-4 can outperform ChatGPT by 16%.Though some other models have shown to match GPT-4 on simple tasks, GPT-4's capability to solve challenging tasks seems unparalleled.

Open-source Models
For the open-source models, we found that their performance is much behind.To better understand their accuracy, we also provide the random-guess baseline of 10%.We test both prompting strategies, however, their results consistently lie in the range of 10-14%.The results indicate that these open-source LMs are still struggling with more complex mathematical reasoning tasks in TheoremQA.Given that ChatGPT of a similar size is able to achieve much higher performance, we believe the parameter size is not the only cause.There is still a significant amount of effort during pre-training or supervised fine-tuning to instill enough scientific knowledge into the models' parameters to close the gap.
Program of Thoughts Analysis From Table 2, we observe that PoT brings consistent improvement over CoT on GPT-* models.Different GPT-* models can normally yield a gain of 5-8% accuracy.In contrast, Claude-v1 and StarChat are almost obtaining the same accuracy.To better analyze where the gains are coming from, we plot Figure 5 to understand how many of generated Python programs are actually 'executable'.As can be seen, both Star-Chat and CodeT5+ are having trouble generating 'runnable' programs with only 40% programs being executable.Claude-v1 is able to increase the validity of the generated programs to 60%.In contrast, GPT3.5 and ChatGPT can further increase the ratio to around 80%. GPT-4 is extremely accurate in generating programs, where 92% of the generated programs are runnable.Such a high executable ratio explains why the gain brought to GPT-* model is much higher than Claude-v1 and StarChat.

Additional Result
Theorem Augmentation We also investigate whether feeding theorem as an additional text condition would help the model better solve the problem.Specifically, we ask GPT-4 to generate a paragraph to describe the theorem, which we postprocessed to ensure correctness.We feed the theorem in the prompt to different language models to see the performance change and plot our findings in Table 3.For all the evaluated scenarios, we found that the improvement is limited to within 1%.Unlike the Text or KB knowledge, theorem knowledge is more abstract and symbolic, simply concatenating the theorem definition is not enough.We believe a more sophisticated augmentation scheme is needed to truly help the model understand and apply the theorems to solve problems.
Multimodal Questions Our aim was to assess how effectively the current method could tackle multimodal questions (those with image inputs) in the TheoremQA dataset.An example is illustrated in Figure 6, where an image is converted into 'captions' by BLIP (Li et al., 2022).We graphed the results from over 50 multimodal question subsets in Figure 7. Notably, this subset posed substantial challenges; none of the models were able to achieve an accuracy rate of 10%.This is primarily due to information loss during the captioning process.
In light of this, we conducted further evaluations on two multimodal instruction-tuned models, LLaVA-13B (Liu et al., 2023) and VisualGLM-6B (Zeng et al., 2022) 8 .These models utilize a visual encoder (either CLIP (Radford et al., 2021) or BLIP (Li et al., 2022)) to encode image input, which is then integrated with language models for multimodal conversation.However, these models demonstrated performance similar to their text-only 8 https://github.com/THUDM/VisualGLM-6Bequivalent, Alpaca, with the addition of a visual encoder not significantly enhancing the results.We hypothesize that the current visual encoding modules may not be suited for representing these diagrammatic images, resulting in these less than ideal outcomes.We believe these multimodal questions remain a challenge for the research community, and we eagerly anticipate further advancements in addressing these multimodal scientific questions.
Error Analysis We conduct detailed error analysis on 200 erroneous cases from different models to analyze their error distribution.Specifically, we pick GPT4, ChatGPT and Alpaca to understand their error sources.We include the following error types: (E1) the model does not even know this theorem, (E2) the model does know the theorem, but uses the wrong formula or algorithm, (E3) the model knows the theorem and the formula, the error is only caused by minor calculation mistakes.The severity of the error decrease as the error number increases.We plot our findings in Figure 8, where the bar indicate the percentage of specific error types.We can observe that almost half of the errors made by GPT4 are non-critical with caused by minor calculation mistakes.This error analysis suggests that there is a still a significant headroom for GPT4 to improve with more deliberate prompting strategies or human intervention to mitigate these minor errors.In contrast, Alpaca's errors are mainly caused by not knowing the theorem at all.Case Study We list a few successful and failed examples generated by GPT-4 in Figure 9 to do a side-by-side comparison between chainof-thoughts prompting and program-of-thoughts prompting.In the first example, the question is regarding 'orthogonal projection theorem'.As can be seen, Chain-of-Thoughts prompting requires a very long paragraph to generate the results.We prompted GPT-4 a few times with the same input and the results seem unstable.Sometimes the model will make tiny computation mistakes in the middle to derive the wrong answer.In contrast, the program solution is brief and concise, which leads to rather stable outputs.For the second ex-  ample, the computation requires 'for loop' to iteratively compute delta values for Riemann Sum.

GPT4 ChatGPT Alpaca
We found that such problems are also more natural for programs to solve.Through these examples, we can see GPT-4's unprecedented capabilities to solve these difficult math problems even without any demonstration or hints.
We also show some examples in Figure 9 to compare the results of CoT and PoT prompting.We can see that the PoT can significantly shorten the output sequence length.By leveraging the additional tool, PoT is able to significantly lower the task difficulty.

Conclusion
In this paper, we propose the first theorem-driven science question-answering dataset and evaluate different LLMs on it.Though GPT-4 can achieve strong performance on our new dataset, the existing open-source LLMs are still struggling to achieve reasonable performance.We conjecture it is essen-tial to leverage more science-related pre-training or fine-tuning to close the gap.On the hand, we found that the multimodal science questions are still extremely challenging for the existing visual LLMs.We believe more specialized visual encoding models are needed to better represent diagrams in these science questions.

Limitations
In this work, we explore the possibilities to utilize different large language models to solve challenging theorem-driven questions.There are still some limitations: (1) our answer extraction is still not perfect.There are some cases where our answer extractor is not able to locate the answer.Thus the final accuracy is still an approximate lower bound.
(2) in our dataset collection, we specifically avoid the hard-to-evaluate cases where the answer is a formula, figure, or a matrix.Our choice of the questions can be biased in terms of evaluating the overall ability.(3) in the multimodal questions in TheoremQA, we have investigated different existing models but none of them succeed in achieving reasonable performance.

Figure 2 :
Figure 2: Examples from TheoremQA.The first question requires the usage of Stoke's theorem to transform the double integral into a line integral.The second question requires knowing the properties of Wiener's process.

Figure 5 :
Figure 5: Ratio of Executable Python Program of different models with PoT prompting.

Figure 9 :
Figure 9: Case Study of GPT-4 generation with both prompting strategies.

Table 1 :
List of existing Math and STEM QA datasets.
Scaling model size, data, and computing are crucial to enable this learning ability.Later, Rae et al. (2021); Chowdhery et al. (2022); Zhang et al. (2022); Touvron et al. (2023); Chen et al. (2021b) have proposed to train different types of LLMs with different training recipes.The capability to follow few-shot exemplars to solve unseen tasks is not existent on smaller LMs, but only emerges as the model scales up has shown its unprecedented capabilities to follow human instructions.

Table 2 :
Results for CoT and PoT prompting on TheoremQA.We report the accuracy over different fine-grained question types and scientific fields.

Table 3 :
Results for CoT and PoT prompting with additional theorem conditions.