Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models

The performance of large language models (LLMs) on existing reasoning benchmarks has significantly improved over the past years. In response, we present JEEB ENCH , a considerably more challenging benchmark dataset for evaluating the problem solving abilities of LLMs. We curate 515 challenging pre-engineering mathematics, physics and chemistry problems from the highly competitive IIT JEE-Advanced exam. Long-horizon reasoning on top of deep in-domain knowledge is essential for solving problems in this benchmark. Our evaluation on various open-source and proprietary models reveals that the highest performance, even after using techniques like self-consistency, self-refinement and chain-of-thought prompting, is less than 40%. The typical failure modes of GPT-4, the best model, are errors in algebraic manipulation, difficulty in grounding abstract concepts into mathematical equations accurately and failure in retrieving relevant domain-specific concepts. We also observe that by mere prompting, GPT-4 is unable to assess risk introduced by negative marking for incorrect answers. For this, we develop a post-hoc confidence-thresholding method over self-consistency, which enables effective response selection. We hope that our challenging benchmark will guide future re-search in problem-solving using LLMs.


Introduction
The capabilities of large language models (LLMs) have been improving since the last decade on a plethora of tasks including reasoning.Most recently, GPT-4 demonstrates significant improvements over GPT-3 on tasks such as code-generation, arithmetic and commonsense reasoning (Bubeck et al., 2023), exhibiting impressive performance on standard reasoning and STEM benchmarks such as GSM-8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), MMLU (Hendrycks et al., 2020) and ScienceQA (Lu et al., 2022) Figure 1: An example problem from JEEBench Rising capabilities of LLMs call for harder benchmarks.We introduce JEEBench, a benchmark consisting of 515 problems that require complex logical and mathematical reasoning on top of deep in-domain knowledge of pre-engineering level Physics, Chemistry and Mathematics.Problems have been curated from the past 8 editions of the Joint Entrance Examination (JEE)-Advanced exam, held annually in India as an entrance test to India's premier engineering institutes: the IITs.The exam is designed to be time-consuming, difficult, and has a low selection rate (approx 5%).
The problems in the dataset require a complex interplay of employing multiple high-level domain specific concepts, grounding them into mathematical equations/constraints, followed by algebraic manipulation and arithmetic operations.Figure 1 is a problem from the dataset along with an expert's solution.In this problem, the ideal solution involves the retrieval of the appropriate concepts: the rules of static equilibrium, grounding the concepts into mathematical equations for the specific problem instance, followed by solving the equations in order to find the final answer.Other instances of domain-specific concepts can be Balancing of redox reactions (Chemistry), Current into a junction equals current out of the junction (Physics) and Integration by parts (Mathematics).More such examples can be found in the Appendix A.2.
We conduct a qualitative and quantitative study of contemporary open-source and proprietary LLMs on these problems and also highlight avenues for further research.Our analysis indicates that GPT-4 is unparalleled in performance compared to other models.It demonstrates long horizon reasoning and the ability to manipulate complex algebraic equations in quite a few problems.We observe that chain-of-thought prompting (Kojima et al., 2023) and self-consistency (Wang et al., 2023), which are recent proposals to improve LLM performance, are indeed effective on our dataset.
We also explore Self-Critique (Madaan et al., 2023;Shinn et al., 2023), where an LLM (verifier) is instructed to improve the outputs of the same LLM (generator).We find that this approach is not helpful on JEEBench.The verifier is weak in spotting conceptual errors, and like the generator, is itself prone to hallucinations.It would be interesting to explore the class of problems where this approach of self-refinement is (not)helpful.
We further conduct a critical analysis of the limits of GPT-4's reasoning abilities, and highlight major areas that require considerable improvement.A detailed error analysis suggests that it frequently struggles in retrieving relevant concepts required to solve problems, and performing algebraic manipulation & arithmetic.Inability to perform even simple algebra highlights an important question: can we build LLMs faithful to mathematical logic?Another important question is how to estimate GPT-4's performance in comparison to humans.The JEE Advanced Exam comes with the bane of negative marking for incorrectly answered ques-tions.This makes the exam even more challenging, because in addition to advanced problem solving skills, it also requires an accurate risk assessment and computing a good policy based on it.Our experiments demonstrate that when prompted with the marking scheme, GPT-4's performance actually drops.To mitigate this, we employ a simple method -thresholding over self consistency.Self consistency generates multiple responses for each question.Relative frequency in the set of responses can be considered as a proxy for confidence score of each option.Threshold on the confidence score can be tuned using a validation set.We find that GPT-4's score, after augmenting it this way, lies in the top 10-20 percentile of human scores in the 2023 edition of the exam.
Overall, we hope that this benchmark serves as a strong and reliable test-bed and fosters future research on problem solving with LLMs.

Related Work
Reasoning has been studied under various contexts such as logical reasoning, commonsense reasoning, mathematical reasoning, and theorem proving.We summarize some key works in two sub-areas, most closely related to our work: mathematical reasoning and Science QA.Mathematical problem solving: GSM8K (Cobbe et al., 2021), Dolphin18K (Huang et al., 2016), AQuA-RAT (Ling et al., 2017), MATH (Hendrycks et al., 2021) and Ape210K (Zhao et al., 2020) are datasets that contain mathematical reasoning questions.Dolphin18K, GSM8K, and AQuA-RAT consist of elementary problems, requiring only basic arithmetic and problem comprehension.Thus, there is a general lack of variety in the underlying reasoning steps across problems.In terms of difficulty, MATH, containing problems from AMC, AIME and Olympiads, comes close to JEEBenchin terms of complexity.However, compared to MATH, the mathematics questions in our dataset span many additional topics such as Differential and Integral Calculus, Differential Equations, 3D geometry, and Conic Sections.Also, the problems in JEEBenchare harder, as we discuss later in the paper.miniF2F (Zheng et al., 2021) consists of mathematics problems from MATH dataset and other sources in a formal language.In contrast, problems in our dataset are in natural language.General Science: In the context of Physics and Chemistry, ScienceQA (Lu et al., 2022), SciQ  (Welbl et al., 2017) and MMLU (Hendrycks et al., 2020) are prominent available datasets.ScienceQA and SciQ, built from elementary and high school science curricula, mainly test factual knowledge of the subject.The skills required to solve such problems are primarily information extraction, reading comprehension and commonsense reasoning.In contrast, questions in our dataset require longhorizon reasoning and grounding of complex scientific concepts into equations and arithmetic.
Problems present in JEEBench are significantly harder than those in other contemporary datasets.
To verify this, we sample 50 questions each from JEEBench and the test sets of MATH and the highschool Physics, Chemistry and Mathematics sections from MMLU and conduct zero-shot evaluations on GPT-4.The results can be seen in Figure 2. As we can see, GPT-4 can easily solve more than 80% problems from MMLU.The MATH dataset is harder, where the performance is approximately 60%.However, GPT-4 struggles in JEEBench-Math, solving close to a mere 20% problems.

The JEEBench Dataset
The dataset consists of 515 problems extracted from the past 8 editions of the JEE-Advanced from the year 2016 to 2023.The problems are harvested from publicly available sources.1 The exam consists of 2 papers held every year, each containing 50-60 questions equally distributed among Physics, Chemistry, and Mathematics.We use online tools to extract problems from PDF-format exam papers into L A T E X.We remove all problems containing diagrams in their description (approximately 40%).Manual quality checks are performed to fix/eliminate possible errors in pre- processing.Figure 3 shows representative problems from the final dataset.The problems are categorised by subject: Physics, Chemistry and Mathematics, and the format of expected response: multiple choice questions (MCQ) with single option correct, MCQs with multiple options correct, Integertype and Numeric-type.In Integer-type questions, the answer is an unbounded non-negative integer, whereas for Numeric-type, the answer is a floating point number upto 2 digits after the decimal point.The breakdown of the problems based on answer-type and subject is shown in The questions contained in the dataset belong to diverse sub-topics (for example, Math questions could belong to Calculus, Algebra, Combinatorics, etc.).The breakdown of the entire dataset into sub-topics can be found in Appendix A.1.

Experimental Setup and Results
We wish to investigate the following research problems: 1. How well do LLMs perform on JEEBench?2. How effective are methods, such as chainof-thought prompting and self-consistency, which have been proposed to improve the reasoning abilities of LLMs? 3. What are the main sources of errors which limit the performance of these models?4. Can LLMs be used to verify their own generations in the context of JEEBench?What are the limitations in this behaviour?5. How would they perform in an exam setting, where each question could potentially give negative marks when answered incorrectly?

Metrics
For Single-Correct MCQs and Integer-type questions, we use accuracy as the metric, that is, a score of 1 if the model response matches the gold response, otherwise 0. For Numeric-type questions, we award a score of 1 if the model response differs from the gold response by atmost 0.01.For Multi-Correct MCQs, we award a score of 1 if the model response matches all the correct options.If any of the options selected by the model is incorrect, we award 0. If the model selects some of the correct options and no incorrect option, then for each correct option in the output, the model is given a score of 0.25.For example, if the gold response is ABD and the output response is BD, a score of 0.5 is awarded.This is done, so as to reflect the actual scoring method of JEE-Advanced, which incentivizes a student to not guess.3
For obtaining the model's response, each model is prompted with the expected response type concatenated with the problem description.The exact system and user prompts can be found in the Appendix A.3.The exact answer is extracted manually from the response generated by the LLM.Sometimes, the LLM response is gibberish and sometimes responds by saying that none of the options are correct.For both of these cases, we record "None" as the answer.
We do not conduct few-shot evaluations because of the restrictive cost of using the OpenAI API.Additionally, the average length of the prompt combined with the response in the GPT-4+CoT setting is approximately 900 tokens.Since the inference time increases as O(n 2 ), few-shot evaluation would take significantly more time.
All the proprietary models were prompted between May 17, 2023 and June 23, 2023.The maximum response length is set to 2048 and decoding temperature is set to 0. Table 2 contains the results obtained on various LLMs aggregated by subject and question type.

General trends:
We observe that open-source models perform as good as random and are, in general, lagging behind proprietary models.Performance on JEEBench increases consistently with newer versions of the GPT model.GPT-3 exhibits near random performance, but GPT-3.5 and GPT-4 perform significantly better.GPT-4 is far superior to GPT-3.5, by a large margin of 12.9 points but overall performance still remains close to 30%.It is evident that the performance boost is the highest for Chemistry, followed by Physics, and lastly Maths.This is probably because the complexity of reasoning is highest in Mathematics questions and least in Chemistry in JEEBench.These results highlight the difficulty of the benchmark posed to both open-source and proprietary models.
Chain-of-Thought prompting: The original prompt is concatenated with the phrase Let's think step by step, as proposed by Kojima et al. (2023).We observe that this approach leads to significant improvement in performance, improving vanilla GPT-4 by 4.2 points.
Self-Consistency (SC): It samples multiple responses from the LLM at a non-zero temperature.For Integer-type, Numeric-type and Single-Correct MCQs, we use a majority vote (from all the responses which are not "None") as the proposed answer.For Multi-Correct MCQs, we choose a simplifying assumption that all options are inde-  (ii) (middle) response shows a conceptual error where the model is unable to retrieve the relevant concepts required to solve the problem (iii) (bottom) response is a grounding error, where the concept is correct, however the application in terms of computing # lone pair electrons on Br in BrF5 is wrong.
pendent.If an option occurs atleast 50% times in the responses, we select it, otherwise we don't.We use τ = 0.5 and the number of responses is set to 8. Self-consistency helps a lot in improving over the GPT-4+CoT baseline by a score of +3.9 points.

Error Analysis of System Responses
In order to assess GPT-4's weaknesses, we conduct a manual inspection of the errors it makes in its reasoning chains.We perform this study on the errors made by GPT-4+CoT on a random subset of 100 problems.The score obtained on this subset is 27.25.We ask the following questions about the model response for each problem instance: 1. Is GPT-4 able to retrieve the concepts/facts required for solving the problem?Inability to do this contributes to conceptual errors.2. If relevant concepts are retrieved, are they grounded correctly as equations/constraints?These contribute to grounding errors.3. Is the algebraic manipulation & arithmetic correct?These contribute to computation errors.
Refer to Figure 4 for an illustration of each type of error4 .In one case, we find that GPT-4 misunderstands the question.The overall results of this analysis is shown in Table 3.Our error analysis indicates that most errors are caused because of not being able to retrieve important concepts (34 out of 80), which are critical to making progress in the solution, or due to computation errors (30 out of 80).Moreover, in 20 questions, where the answer is correct (out of 27), the explanation is also correct.i.e., about 28% of the time, the model outputs a correct answer for the wrong reasons.

Can GPT-4 find and correct its mistakes?
Can GPT-4 be used to grade its own outputs?A good grader should be able to spot errors in a solution.Using an LLM to critique its own output has been proposed recently by multiple works (Shinn et al., 2023;Madaan et al., 2023) and has shown improvements on some datasets.A good verifier should be able to catch and fix all errors.Even when the final answer is correct, it isn't necessary that intermediate reasoning steps are correct.
We put the idea of self-critique to test on JEEBench.After a CoT response has been generated, we prompt another GPT-4 instance by first describing the problem, GPT's solution and then appending the instruction: "Find problems(if any) with the given solutions.If there are any errors, correct it and give the new answer." We re-evaluate the new answer suggested by GPT-4.Results clearly show that this approach doesn't lead to improvement.In fact, it leads to poorer results as compared to GPT-4+CoT and the performance goes down from 35% to 33.9%.
In order to develop a deeper understanding of the repairs suggested by the verifier GPT-4, a manual inspection is performed.We use the same subset of 100 problems picked up earlier for categorizing error-types.For each generated solution and suggested edit, we pose the following questions: • Can the verifier find problems in the solution?
• Can the verifier fix problems if it finds them?
• Is the problem identified by the verifier actually a valid problem?Table 4: The figure shows the breakup of the kind of errors the verifier GPT-4 makes while suggesting edits.
Our results can be seen in Table 4.It is evident that, contrary to observations in other works, on JEEBench, GPT-4 is mostly ( 46 80 = 57.5%)unable to find errors in solutions it proposes.Even when it can, it is unable to fix them.Only in 2 out of 80 questions, was GPT-4 able to give a meaningful edit to an erroneous solution, which is over-compensated by the solutions it degrades by suggesting edits to parts of the solution which are already correct.Figures 5 provides example of errors made by the verifier.The complete response for these along with other examples can be found in the Appendix A.5.This experiment raises an interesting question: for what class of problems is the approach of self-critique (not)helpful? .

Comparison with human performance
The JEE exam contains negative marking, for instance, single-correct MCQ questions are awarded a score of +3 marks when correct, -1 if answered incorrectly and 0 when not answered.For Mult-Correct MCQs, a score of +4 is given when all options are included in the final answer.If any of the options is wrong, -2 is given.If some of the options are correct, +1 is given for each correct option.The skills needed for an examinee to maximize one's score include being able to assess one's own confidence in their response and being able to decide whether to answer or not based on the confidence levels.Contingent on the former skill, the latter is a simple decision-theoretic computation under uncertainty.

Deciding whether to answer
To attain a good score in the examination, it is important to ensure that the model does not answer when it is unsure of its solution.Can LLMs assess this risk and plan accordingly when prompted with the marking scheme?To investigate this, we prompt the model with the exact marking scheme for each MCQ question type along with problem statement, and then ask to generate an answer or skip the question altogether.The complete prompt is in Appendix A.6.We re-run inference on all the problems with these prompts for all the MCQ questions.The results can be seen in Table 5.Table 5: Marks obtained when GPT-4 is prompted with the marking scheme v/s without on MCQ questions.These marks are out of a total of 1074.
Results indicate that prompting isn't helpful in this case, and GPT-4 cannot effectively decide when not to answer.In response, we develop a post-hoc confidence-thresholding method on the self-consistency responses.

Calibration
For single-correct & multiple-correct MCQs, we compute the confidence score for each option by computing its relative frequency in the set of responses.Note that often, GPT-4 is unable to answer the question at all, or arrives at a conclusion that is not supported by any option (a "None" response).In such cases, we do not count contributions from this response.For instance, if a model's response in 4 attempts in a Multi-Correct MCQ is "AB", "None", "B", "AC", then, the confidence for options are A: 1 2 , B: 1 2 , C: 1 4 , D:0. Figure 6 is the calibration cruve of GPT-4 on JEEBench.The maximum calibration Error (MCE) is 0.136 and the average calibration error (ACE)5 is 0.098.The plot suggests that the model is slightly overconfident at high confidences, because of lower accuracy on higher levels of confidence, but slightly underconfident at low and medium confidences.

Thresholding with Self-Consistency
Our objective is to decide whether to include an option in the final response or not.We wish to compute a parameter τ such that an option will be in the final response if the confidence for that option is atleast τ .We compute separate τ single , τ multiple for Single-correct and Multiple-correct MCQs respectively.We compute confidence scores for GPT-4's response to each question as in Section 4.5.2.Questions from 2016-2021 are chosen as the validation set and from 2022-2023 as the test set.The best τ single and τ multiple thresholds for Single-Correct and Multi-Correct MCQs by simple hyper-parameter search.Figure 7 shows the plot of positive, negative and total score on the validation set over range of possible values of τ single and τ multiple .The optimal value of τ multiple is 0.75 and of τ single is 0.125.τ single being less than 0.25 indicates that taking a majority vote is the best strategy for single-correct MCQs.However, this is not true for multi-correct MCQs, where a threshold of τ multiple = 0.5 (as done originally) is sub-optimal.We assume that Integer and Numeric questions do not have negative marking.The final response for them is decided using a majority vote over responses.Table 6 shows scores with the optimal thresholds on the test set.We find that not answering when confidence is less than threshold increases the total score by about 4.3%.

Estimating performance compared to humans
Finally, we wish to estimate the performance of GPT-4 compared to humans.For this we use the

Discussion
The general performance trend demonstrates the efficacy of high-quality data, instruction fine-tuning, RLHF and parameter scaling in improving the reasoning capabilities of LLMs.For many problems, GPT-4 is able to give a sketch of the correct, humanlike solution that is impressive given the extent of reasoning involved in the problems.However, our analysis also reveals major areas where progress is needed.Although GPT-4 performs flawless logical and mathematical reasoning in some instances, sometimes it commits grave errors in trivial steps.Algebraic manipulation and calculation is still hard for GPTs.It would be interesting to see LLMs that are able to leverage a blackbox calculator as an API as done by Schick et al. (2023).
Errors in retrieval and application of concepts suggest an interesting research question: Can we augment an LLM such that it's generation is constrained by faithfulness to a set of facts?Such a system will demonstrate robustness in reasoning, critical for long-horizon tasks.
Physics problems in the benchmark often require an understanding of spatial reasoning.We found that while GPT-4's spatial reasoning is far from perfect.Appendix A.7 provides an example where GPT-4 commits errors which might be attributed to its inability to reason spatially.With the release of the multi-modal version of GPT-4, evaluating this aspect of Physics problems might be easier.
Finally, an LLM that understands its own confidence in an answer is a key missing piece, as highlighted by our experiments in the exam setting.Our simple post-hoc wrapper does slightly improve performance in this regard.

Conclusion
We present a challenging problem solving benchmark to evaluate large language models.We perform a detailed analysis of the performance of various LLMs on the benchmark, and identify areas of improvement in the best current LLMs.Our work raises interesting research directions such as mathematical logic-augmented GPT, multi-modal evaluations on GPT-4 and the decision-making capabilites of GPT in an exam setting.We hope that JEEBenchguides future research in reasoning using LLMs.

Limitations
Contamination is a big problem in the era of pretrained language models which have been trained on large web corpora.Therefore, it is really hard to determine if a dataset has been seen.Accurately determining the extent of contamination is also not easy.Evaluations against humans is also a slightly flawed process due to other limitations such as time pressure during the examination procedure.Additionally, this data's distribution is fixed to pre-college Physics, Chemistry and Mathematics.There are more gradations and difficulty levels at which the model can be evaluated which have not been tested as part of our analysis.Figure 16: Error made by GPT-4 in understanding which curves to take area between.Here it is taking area between the curve and x-axis.However, the question intended between curve and the x = y line.This indicates that GPT-4 might be weak in 2D reasoning from purely text-based prompts.

Figure 4 :
Figure 4: The figure shows the different types of error made by GPT-4 in its response.(i)(top) exhibits a computation error, where the squaring operation performed is algebraically wrong.(ii)(middle) response shows a conceptual error where the model is unable to retrieve the relevant concepts required to solve the problem (iii) (bottom) response is a grounding error, where the concept is correct, however the application in terms of computing # lone pair electrons on Br in BrF5 is wrong.

Figure 5 :
Figure 5: [Top]: Question where GPT-4 identifies a mistake but is unable to fix it.The problem and part of the response is on the top.The bottom block contains the edit suggested by GPT-4.The manipulation in the edit suggested is mathematically wrong.[Bottom]: Question where GPT-4 is unable to identify an error.The problem and part of the response is on the top.The bottom block contains the edit suggested by GPT-4.It should be log2(2 • 4 4 ) instead of log2(2 • 16 4 )

Figure 7 :
Figure 7: Scores obtained on different thresholding values on Single-Correct(top) and Multi-Correct(bottom) type questions from the val set, the optimal value is τ single = 0.125 and τ multiple = 0.75

Figures
Figures 15 and 16 show an example problem for which GPT-4's response indicates inability to ground physical concepts spatially.

Figure 15 :
Figure 15: Error made by GPT-4 in understanding physical concepts.In this example, correct form of equation 1 should be f = N A cos 30 • .GPT-4 fails to spatially ground the concept of direction in a 2D environment.Figure16: Error made by GPT-4 in understanding which curves to take area between.Here it is taking area between the curve and x-axis.However, the question intended between curve and the x = y line.This indicates that GPT-4 might be weak in 2D reasoning from purely text-based prompts.

Table 1 :
#questions for each subject and problem-type.

Table 2 :
Chemistry Mathematics Physics Integer Single-Correct Multi-Correct Numeric Total This table shows the score obtained by various open-source and proprietary models on JEEBench aggregated by subject on the left question type on the right.The overall aggregate scores are in the last column.

Table 3 :
Variety of errors GPT-4 makes in the solution.