Conic10K: A Challenging Math Problem Understanding and Reasoning Dataset

Mathematical understanding and reasoning are crucial tasks for assessing the capabilities of artificial intelligence (AI). However, existing benchmarks either require just a few steps of reasoning, or only contain a small amount of data in one specific topic, making it hard to analyse AI's behaviour with reference to different problems within a specific topic in detail. In this work, we propose Conic10K, a challenging math problem dataset on conic sections in Chinese senior high school education. Our dataset contains various problems with different reasoning depths, while only the knowledge from conic sections is required. Since the dataset only involves a narrow range of knowledge, it is easy to separately analyse the knowledge a model possesses and the reasoning ability it has. For each problem, we provide a high-quality formal representation, the reasoning steps, and the final solution. Experiments show that existing large language models, including GPT-4, exhibit weak performance on complex reasoning. We hope that our findings could inspire more advanced techniques for precise natural language understanding and reasoning. Our dataset and codes are available at https://github.com/whyNLP/Conic10K.


Introduction
Mathematical understanding and reasoning ability is an important component of human intelligence.Such an ability is the foundation of data analysis, financial applications and scientific research.Though there have been lots of studies (Lample and Charton, 2020;Wei et al., 2022b), mathematical reasoning are far from being solved by existing methods (Lu et al., 2022), even with symbolic reasoners (Hopkins et al., 2019) and large language models (LLMs) (Lightman et al., 2023).To evaluate and analyse the mathematical ability, various datasets and benchmarks have been proposed in recent years (Zhao et al., 2020;Hendrycks et al., 2021;Mishra et al., 2022b,a).However, these datasets or benchmarks often suffer from the following problems: (1) The problems can be solved with only a few reasoning steps, so language models may rely on shallow heuristics to achieve high performance (Patel et al., 2021); (2) The dataset covers a wide range of topics and hence there is only a small amount of data for each topic, which makes it hard to distinguish whether the model fails because of a lack of background information, or due to weak reasoning ability.

To address the above issues, we propose
CONIC10K, an open-ended math problem dataset on conic sections in Chinese senior high school education.This dataset contains 10,861 carefully annotated problems, each one has a formal representation, the corresponding text spans, the answer, and natural language rationales.Figure 1 shows an example problem in our dataset.To evaluate the mathematical understanding and reasoning ability, we perform two different tasks on existing LLMs: semantic parsing and mathematical question answering (mathQA).Semantic parsing assesses a language model's ability to understand mathematics.The model is required to translate the math problem in natural language into its formal meaning representations.MathQA jointly evaluates the language model's ability of mathematical understanding and reasoning.The model needs to gen-erate the answers to questions.Since the topic of CONIC10K is restricted to conic sections, the knowledge required to solve different problems is the same, while the only difference is the difficulty in reasoning.Therefore, if the model is able to solve simple problems but not hard ones, we are assured that the failure lies in the lack of ability in mathematical reasoning.
Our experiments show that current models obtain good performance in semantic parsing.However, in mathQA, these models are far from being satisfactory.When performing zero-shot chain of thought (CoT) (Wei et al., 2022b) prompting, the best model GPT-4 (OpenAI, 2023) can only achieve 15.5% accuracy using human evaluation.When finetuning is further applied, the best model ChatGLM-6b (Du et al., 2022) still obtains a poor accuracy of 22.5% under human evaluation.When we translate the problems into English and apply zero-shot CoT to reason in English, the accuracy of GPT-4 is 26.0%, which is still far below the performance of human experts at 57.5% with a 3-minute time limit for each problem.This shows that the poor performance is not due to the language being used but to a deficiency in reasoning ability.Therefore, we believe the mathematical reasoning ability of language models is still limited despite their huge success in natural language understanding.We conclude our contributions as follows: 1) We propose CONIC10K, a challenging math problem dataset on conic sections in Chinese senior high school education, with high-quality annotations of formal representations; 2) We perform experiments to inspect the mathematical understanding and reasoning ability of LLMs separately; 3) We give detailed analysis on the model behaviour and conduct comprehensive case studies.We hope that our work could help the community to better analyse LLMs in mathematical understanding and reasoning and inspire more advanced techniques to enhance the mathematical reasoning ability of LLMs.

Related Work
There has been a wide range of datasets on math problems in the literature.MATHQA (Amini et al., 2019) and GSM8K (Cobbe et al., 2021) are math word problem datasets.They focus on open-domain understanding, where the objective is to extract a single equation based on the information about quantities in the problem, rather than mathematical reasoning.Similarly, Math23K (Wang et al., 2017) and Ape210K (Zhao et al., 2020) are popular datasets about Chinese math word problems with open-domain scenarios and simple reasoning steps.Geometry3K (Lu et al., 2021) is a geometry problem-solving dataset that provides formal representations, but the dataset size is small and the problems do not require complex reasoning.AQuA (Ling et al., 2017), NumGLUE (Mishra et al., 2022b) and Lila (Mishra et al., 2022a) are large-scale datasets of various math problems.They have been used as benchmarks in solving math word problems and mathematical reasoning tasks, but we find that these datasets require only a few reasoning steps.MATH (Hendrycks et al., 2021) is the one with the longest reasoning steps among these datasets.It has been used as a standard benchmark in recent work of LLMs (Lewkowycz et al., 2022;Lightman et al., 2023).However, while it covers a wide range of problems, it contains limited data in each specific topic, making it hard to analyse the model behavior in detail with reference to one topic.It also does not provide any formal representations.Our proposed CONIC10K contains problems of long reasoning steps using closed-domain knowledge and has highquality annotations with formal representations.A detailed comparison between the aforementioned datasets and CONIC10K is shown in Table 1.

Formal Representation
We design a formal representation that avoids ambiguity and is close to natural language.Specifically, our representation is built upon Assertional Logic (Zhou, 2017).Assertional Logic (AL) is a powerful knowledge representation that is more expressive than first-order logic while easier to read and write for humans.In this work, we use a variant of AL with three components: declarations, facts and queries.Declarations define individuals with their types (e.g.G:Ellipse).Facts are assertions that describe the conditions in the problem (e.g.Focus(G)={F1, F2}).Queries are the terms that represent the goal of the problem (e.g.Range(Eccentricity(G))). See more details in Appendix A.

Dataset Format
An example is presented in Figure 1.For each question, we give 1) the question text in natural language with math formulas in L A T E X, 2) the ra- Formal Representation: Question: (Let (, ) be an arbitrary point on the ellipse 2 = 1 ( >  > 0). 1 and  2 are the two foci of the ellipse, and ∠ 1  2 ≤ 90 ∘ .What is the range of values for the eccentricity of the ellipse?)Rationale: (When the point  is located at 0,  or 0, − , the angle ∠ 1  2 ≤ 90 ∘ is at its maximum.In this case, cos ∠ tionale in natural language, 3) the answer to the question, 4) the formal representation and 5) the text span corresponding to each sentence in the formal representation.

Data Collection
To construct the dataset, we first collect approximately 20,000 open-ended problems about conic sections from two websites that focus on Chinese high school education in image format.Each problem image contains the problem text, rationale, and answer.Then, we use mathpix 1 to convert these images into text.Since our dataset is focused on conic sections, we filter out problems that involve knowledge from other topics such as sequences and solid geometry.After that, we remove duplicated 1 https://mathpix.com/problem using fuzzy matching.After the above process is finished, the size of the dataset is reduced from around 20,000 to approximately 14,000.

Annotation
To ensure the correctness of the data and avoid ambiguities, we apply strict quality control during the annotation process2 .The complete process is as follows: Initiation We first build a small dataset with hundreds of samples, write the annotation guidelines and design a rule-based AI assistant for annotation.
The rule-based AI assistant is able to recognize L A T E X math expressions and complete simple formal representations, which greatly accelerates the annotation process and reduces annotation errors.1XPEHURI5HDVRQLQJ6WHSV Verification We select the annotators from a group of candidates by their performance on the small dataset.These annotators are provided with annotation guidelines along with hundreds of samples.Annotators with the best performance will take part in the rest of the annotation process.
Annotation We ask the annotators to further filter out problems about other topics, write the formal representation, select the corresponding text spans and fix the incorrectly recognized problem texts and answers.Each problem is annotated by two annotators, and then validated by another validator with an automated tool for comparison.We also randomly check 3% of the annotations.This process takes 4 months in total.
Finalization After the annotation is finished, we train a language model3 through 5-fold crossvalidation, manually check the inconsistency between model predictions and the annotated formal representations, and fix the errors in annotations.This helps us correct another 2% of the data.Then we randomly split the dataset into train, validation, and test sets with the ratio 7.5:1:2.The train set size is 7,758, the validation set size is 1,035, and the test set size is 2,068.We proceed to the evaluation of LLMs with this split.

Dataset Statistics
Table 2 presents the basic statistics about CONIC10K.The problems in our dataset tend to be long and complex.Besides these metrics, we also estimate the number of reasoning steps by the minimum number of rules required to get enough information to obtain an answer.Since the process of applying rules is subjective, we ask two graduate students to individually annotate the rules used to solve the problems.We uniformly sampled 30 problems from each of the datasets listed in Table 1 and ask the two students to annotate the reasoning steps.Results show that CONIC10K is the dataset with the second largest number of reasoning steps.The distribution of reasoning steps in CONIC10K is depicted in Figure 2. We show additional dataset statistics in Appendix B.
To facilitate model analysis, we divide the answers into 6 categories as described in Table 3. Figure 3 shows the distribution on these categories.

Experiments
This section describes our experiments to evaluate the mathematical understanding and reasoning abilities of various models.

Tasks
Based on data provided by CONIC10K, we introduce two tasks: semantic parsing and mathQA.Semantic parsing requires a model to translate math problems in natural language into formal representations, while mathQA needs a model to give correct solutions to math problems.The semantic parsing task aims solely at assessing the model's ability to understand mathematics, and the mathQA task jointly evaluates the model's ability of mathematical understanding and reasoning.

Models
We evaluated the performance of several popular pretrained models on the above two tasks.The models used for evaluation are as listed in Table 4.

Evaluation Details
Due to limited computation resources, we conducted full finetuning on models with size of less than 4B.For models around 7B, we performed parameter efficient finetuning using LoRA (Hu et al., 2022) and 8-bit quantization (Dettmers et al., 2022).We also apply zero-shot CoT inference without finetuning for models with sizes between 7B and 13B.The models evaluated in zero-shot CoT setting all have undergone instruction tuning or RLHF in their respective pretraining process.When finetuning, we use instruction tuning (Wei et al., 2022a) to train the models.The instructions are architecturespecific and task-specific, as depicted in Table 5.
When finetuning language models, we use the following hyperparameter settings.We use AdamW as the optimizer.The learning rate is selected from {8e − 5, 2e − 5}, with a linear learning rate decay.For models using LoRA, we set target modules to q, k, v for Falcon-7b and to q, v for other models.The LoRA rank is set to 16 for models with size around 7B.To ensure a similar number of trainable parameters, we set the LoRA rank to 24 for Bloomz-3b and 32 for Bloomz-1b7.We use greedy decoding in all generations.
In zero-shot CoT inference for mathQA, we use the same prompt as GAOKAO-Benchmark (Zhang et al., 2023) to instruct the models to give an answer together with a rationale.In MathQA, we also experiment with in-context learning (Min et al., 2022), which adds in-context demonstrations of the task in the prompt, and self-consistency (Wang et al., 2023), which conducts majority voting on the sampled results on GPT-3.5-turbo.In semantic parsing, however, the formal representation is unknown to the above models.Since it requires more than 3,000 tokens to explain the syntax and semantics of each component in the formal language, which is out of the context length limit of most models listed above, we do not evaluate the performance of zero-shot CoT in semantic parsing.
In addition to the methods mentioned above, we also evaluate the following two methods in mathQA as a reference: (1) Guessing '2': Predicting the most frequent answer in the train set, which is '2'.
(2) Human Experts: We randomly select 20 problems from the test set and ask two graduate students to answer.Each problem has a 3-minute time limit.We report the average accuracy of these two students.

Semantic Parsing
For semantic parsing, we evaluate the model predictions by micro-F1, macro-F1 and accuracy.The accuracy is the proportion of the problems that have a one-to-one match between all sentences in the prediction and the ground truth.Micro-F1 (mi-F1) and macro-F1 (ma-F1) are defined as follows: where n is the total number of problems, p = # of all matched sentences # of all predicted sentences is the overall precision, r = # of all matched sentences # of all gold sentences is the overall recall, F1 i is the F1 score of problem i.
To compute the metric, we need to find the number of matched sentences between the prediction and ground truth.Since the formal representation is insensitive to individual naming, we enumerate all possible individual name mappings between prediction and ground truth and select the mapping  that achieves the maximum number of matched sentences.We optimize the evaluation script by only considering individuals with the same type so that the evaluation time on the validation set and test set is acceptable.

Results and Discussions
In this section, we introduce and explain the results of the experiments.The main results of semantic parsing and mathQA are shown in Table 6 and Table 7 respectively.

Semantic Parsing
Language models show good ability of understanding on math problems after proper training.
Finetuning The best model mT5-xl can successfully translate 84.6% of the problems into formal representations.
For the problems it fails to accurately translate, the predictions only differ from the ground truth in minor details.The F1 score and accuracy from Bloomz family and Falcon-7b are much lower than other models.The performance of finetuned instruction tuned models is consistently better than that of finetuned base models.
Models pretrained on code show strong ability in learning syntax.Models except for the mT5 family have been pretrained on code.The syntax error rates of these models are on average lower than that of the mT5 family, even though their F1 score and accuracy may be lower than the mT5 family.Since the formal representation resembles programming languages in syntax, pretraining on code may be able to help model to learn the syntax of formal representations more easily.
Increasing model size effectively improves model's performance in semantic parsing.
From the results of the model families mT5, mT0 and Bloomz, we find that increasing the model size from the smallest to largest in our experiment can significantly improve the accuracy by at least 7.4%.

MathQA
Language models generally show poor performance on mathQA in CONIC10K.Under the zeroshot CoT setting, most models achieve an accuracy close to 0. Even after finetuning, the accuracy of the best model is still significantly lower than that of human experts by 35.0%.
Simple problems under finetuning setting may not be simple under zero-shot CoT setting.
Most models finetuned on CONIC10K have the best performance in Simple Numbers among the answer categories.However, when it comes to zeroshot CoT setting, GPT-4 and GPT-3.5-turboobtain best accuracy in Coordinate.One possible reason is that after sufficient training on CONIC10K, the model can develop a shallow understanding of the task (Patel et al., 2021), including the frequent answers of a specific kind of questions.Since Simple Numbers are simpler in form and have fewer potential answers compared to Coordinates, being familiar with the answer distribution can effectively increase the probability to hit the correct answer.However, in zero-shot CoT setting, the model is unaware of these distributions, so it has no advantage in difficult problems that have simple answers.
The accuracy is close to 0 in zero-shot CoT.Under the zero-shot CoT setting, Bloomz-7b1 and Falcon-7b-inst show extremely poor performances with 0 accuracy in all problems.These models tend to generate repetitive patterns, and in most cases fail to give an answer.Other models except for GPT-4 generate text that looks like a valid rationale, but the majority of reasoning steps are incorrect.They often produce hallucinations in premises and rules, and derive wrong results.In Table 9, even with in-context demonstrations or majority voting, the performances are still low.We showcase some failing cases in Table 10.
The scaling law is less clear compared to semantic parsing.Though we observe that increasing the model size continuously and effectively improves model performance in semantic parsing, such a phenomenon disappears in mathQA tasks.In mT5 and mT0 series, large models do not necessarily outperform small models.Similar observations have been made in MATH (Hendrycks et al., 2021) where the authors find that accuracy on math problems increases only modestly with model size.
Chinese-oriented language models have better performance in mathQA in CONIC10K.In the zero-shot CoT setting, the two Chinese-oriented models, Ziya-13b and ChatGLM-6b, achieve the best performance below GPT-3.5-turbo.In the finetuning using LoRA setting, ChatGLM-6b achieve an accuracy of 22.5% and outperform other models by a large margin.
Translating problems into English does not make the performance of GPT-4 on par with human experts in mathQA.We translate the problems into English and evaluate GPT-4 in zeroshot CoT setting to determine whether the poor performance is due to language or long reasoning steps.The results in Table 8 show the performance is significantly improved from 15.5% to 26.0% by translating the problems into English.However, this accuracy is still low compared to 57.5% from human experts.Therefore, the primary challenge of mathQA in CONIC10K still lies in how to do mathematical reasoning correctly.

Case Study
We inspect and analyse both success and failure cases in the experiment, which leads us to some interesting findings.
LLMs have limited ability in understanding long L A T E X expressions.9.7% of the incorrect predictions from mT0-xl are due to errors in translating simple but long L A T E X expressions.Common failures include missing terms, flipped signs and incorrect copies.For example, the L A T E X expression in the problem is x^2+y^2+2\sqrt{2}x-4\sqrt{2}y+10-r^2=0, but the translated sentence becomes -4*sqrt(2)*y +2*sqrt(2)*x+x^2+y^2+2=-r^2.In this example, we observe both a flipped sign and an incorrect constant.We do not observe similar errors in relatively short L A T E X expressions.Table 7: Results on mathQA in CONIC10K.ChatGLM-6B achieves the best overall accuracy after finetuning using LoRA among all the models.In fully finetuning setting, mT0-xl shows strongest performance.In the zero-shot CoT setting, GPT-4 has the highest overall accuracy.However, the performances of the above models are significantly lower than human expert's performance.GPT-4 is evaluated on 200 randomly sampled problems.Human Expert is evaluated on 50 randomly sampled problems.The Text accuracy of Human Expert is empty because the sampled problems do not contain answers of category Text.

Language
Overall Accuracy GPT-3.5-turbo+ CoT 6.2 GPT-3.5-turbo+ CoT + ICL 5.9 GPT-3.5-turbo+ CoT + SC 6.8 Table 8: Results on mathQA in CONIC10K using GPT-3.5-turbowith in-context-learning (ICL) or selfconsistency (SC) Models can hardly find shortcuts in reasoning in mathQA.We observe that models usually employ naive approaches to solve problems and fail to find shortcut solutions, which leads to more complicated computation and longer reasoning steps.The additional reasoning steps and computation make the models more likely to make mistakes during reasoning.Some examples of naive solutions from GPT-4 and the corresponding shortcut solutions are listed in GPT-4 and GPT-3.5-turboprobably lack knowledge about certain concepts.When asked problems about focal distance, GPT-4 and GPT-3.5turbokeep giving incorrect answers and often give a value that is half of the ground truth.Based on these observations, we suspect that these two models lack knowledge about focal distance.We ask GPT-4 and GPT-3.5-turbo to explain what focal distance is in both Chinese and English, and they keep defining it as the distance between the center of an ellipse or hyperbola and one of its foci instead of the correct definition, the distance between the two foci.A probable reason is that 'focal distance' is not a commonly used term within the English corpus, making the models unlikely to obtain correct knowledge about it.

Conclusion
We present CONIC10K, a math problem understanding and reasoning benchmark.It provides problems that require complex reasoning, while only involving knowledge about conic sections in Chinese senior high school education.We test popular LLMs on both semantic parsing and math question answering, inspecting model performance and behaviours.Results show that existing LLMs, including GPT-4, have poor performance in mathematical reasoning, while most models could achieve good performance in mathematical understanding (but not perfect yet).We analyse the model predictions in detail and find LLMs tend to hallucinate in reasoning, often fail to find the shortcuts solution, and may lack the knowledge to solve problems.We hope our dataset, CONIC10K, can help to discover the weaknesses of LLMs in mathematical understanding and reasoning and inspire more advanced techniques to enhance the mathematical reasoning ability of LLMs.

Limitations
CONIC10K is a dataset with high-quality formal representation annotations, but there are still some limitations: • We design the formal representation to be accurate, unambiguous and close to natural language, but such representation is not commonly used and does not fit any existing symbolic reasoners.The conclusion may not apply to other formal representations such as propositional logic and first-order logic, or rationales like executable programs.
• In conic sections, the commonly used mathematical reasoning strategies could be limited.
For example, our problems may require solving simultaneous equations systems, but not likely mathematical inductions.Therefore, our dataset cannot evaluate some reasoning strategies such as mathematical induction.

Ethics Statement
CONIC10K is a dataset that requires massive data sources and heavy annotation.We claim that our work is free of ethical risks from the following perspectives: Data Source The problems in CONIC10K are collected from two websites that do not limit the usage of data for education and research purpose.We strictly follow the term of use and manually check all the data to avoid inappropriate information in the annotation stage.
Annotation We hire a group of 14 annotators for formal representation annotation and sign a contract to prescribe the rights from both sides.We clearly state the purpose of our study and the future data use.These annotators are well-paid for their work.The authors take responsibility to maintain the annotation website, provide necessary documents, answer questions from the annotators and clean up the data.
Here, we briefly introduce the syntax of AL.Given a specific domain, the syntactic structure of AL is composed of three components: individuals, concepts and operators.Individuals represent objects in the domain, concepts represent groups of objects and operators represent relationships and connections among individuals and concepts.Operators are similar to functions and predicates in first-order logic (FOL), but they could accept highorder constructs (concept, concept of concepts), which leads to the strong expressiveness of AL.
An assertion is of the form a = b, where a, b are two terms (individuals, either atomic or compound).The knowledge base of AL is just a set of assertions.

A.2 Our Representation
We apply AL as our formal representation because of its strong readability.Our principle is that the formal representation should 1) avoid ambiguity.The formal representation should resolve the ambiguity in natural language and with the information inside the annotations, it should be possible to work out the solution by hand; 2) close to natural language.It should be able to represent the problem without rephrasing it; 3) simple and clear.Designing a representation with thousands of operators is definitely expressive and powerful, but it sacrifices the strength of logic and fails to extract common knowledge underneath.
Therefore, we apply only 94 operators and 20 concepts (see Table 2) to represent all the problems in the dataset.To better accommodate the natural language, we also designed 3 pseudo operators: OneOf, WhenMin, WhenMax.These operators do not fit the semantics of AL, but greatly simplify the representation and are closer to natural language.Also, it is trivial to convert these operators to terms in AL.
There also has been evidence showing that rephrasing significantly impacts learning (Kwiatkowski et al., 2013).To avoid rephrasing, we write detailed documents for the annotators, ask them to raise questions when they are not confident and frequently check the data during annotation.
We design our representation in three components: declarations, facts and queries.
Declarations The declarations define individuals with their types.It has the format of var: type, where var is an individual and type is a concept.These sentences are a special representation of the assertion Is(var, type) = True.For simplicity, we allow defining multiple individuals in one sentence, with commas separating different individuals.
Facts The facts are assertions that describe the conditions in the problem.For clarity, we allow the use of syntactic sugar, which includes <, ≤, > , ≥, +, −, ×, ÷, a b .That is, a sentence could be an inequality such as a > b, which indicates an assertion (a > b) = True.
Queries The queries are the terms that represent the target of the problem.They ought to be an assertion with the left-hand-side(LHS) the query term and the right-hand-side(RHS) an unknown individual in AL, but we use the simplest format during the annotation.

A.3 Annotation Quality Control
Our previous study shows that the annotation of formal language is extremely hard for humans.It is difficult for an experienced annotator to reach an accuracy above 50%.As a result, we employ multiple measures to control the dataset quality, including: 1. We provide a rule-based AI assistant to complete most of the annotations with high precision.
2. We only hire annotators with the highest performance on the small dataset we built in advance.
3. During the annotation, we ask the annotators to raise questions whenever they are not confident about how to annotate.We provide detailed documents and dedicated help to ensure the correctness of the annotation.
4. In addition to formal representations, we ask the annotators to annotate the text spans.We find it helps to increase the annotation accuracy.
5. Each problem will be annotated by two annotators individually, then passed to another validator.We design a web UI which could automatically compare two annotations and extract the difference.The validator will determine which one is correct, or a third annotation is required.6.Every time the annotators finish 1000 annotations, we randomly sample 10 problems for additional checks.After all the annotations were finished, we randomly sample 200 problems for additional checks.In the additional check, we independently annotated the sampled problems, and then compare them with the existing annotation.We ask the annotators to do a thorough check if the accuracy is below 80%7 .
7. We provide competitive payments (> 150k CNY in total, ≈ 20k USD) to the annotators.We allow adequate time for the annotation process.
8. After the annotation is finished, we finetune a zh-en translation model for further validation.We split the whole dataset into five random splits of the same size.Then, we pick four of them to finetune the model and collect predictions for the last split.We manually check all the problems whose prediction does not match the annotation.We repeat this process five times and obtain the final dataset.

B Additional Dataset Statistics
We show the frequency of keywords in Figure 4, and the distribution of question length in Figure 5.
In question length, we count all latex commands such as \frac, \leftarrow as one token.

C Additional Results
We showcase some interesting outputs from models.Models including Bloomz-7b and ChatGLM-6b often generate redundant text and hallucinate when reasoning.We provide some samples in Table 10.
GPT-4 shows a strong ability to perform reasoning in natural language, but its solution often fails to find the shortcuts designed in the problems.Such a solution often requires complicated calculations.Due to its limited ability in arithmetic calculation, the final answer from GPT-4 is likely to be incorrect.We provide some samples in Table 11 and 12.

Problem Rationale & Solution
Given that M is a point on the parabola x 2 = 4y, F is its focus, and point A Hallucination in reasoning by ChatGLM-6b: According to the problem, we know that the eccentricity of the ellipse is 1 2 , and one of its foci coincides with the focus of the parabola y 2 = 8x.First, let's find the coordinates of the focus of the parabola y 2 = 8x.According to the definition of a parabola, the x-coordinate of the focus is p 2 , where p is the focal length.Since the standard equation of the parabola is y 2 = 4px, we can determine that p = 2. Therefore, the x-coordinate of the focus is 2 2 = 1...

Problem Rationale & Solution
Given a point P on the parabola with equation y 2 = 4x, let's denote the distance from point P to the y-axis as d.For a given point A(4, 5), what is the minimum value of |P A| + d?

Figure 2 :
Figure 2: Distribution of reasoning steps in 50 sampled problems from CONIC10K.All numbers are rounded to their nearest integers.

Figure 4 :
Figure 4: Word cloud of the keywords in CONIC10K

144 17 .
Therefore, the minimum value of |P A| + d is 144 17Reference shortcut solution: Since the distance between a point on a parabola and its focus is the same as the distance between the point and directrix of the parabola, |P A| + d can be rewritten into|P A| + d 1 − d 2 = |P A| + |F P | − d 2, where d 1 is the distance between P and the directrix, d 2 the distance between the directrix and the y-axis, F is the focues.By the definition of parabola,d 2 = 1, F = (1, 0).Since |P A| + |F P | − d 2 ≥ |AP | − d 2 = (4 − 1) 2 + (5 − 0) 2 − 1 = √ 34 − 1.Therefore, the minimum value of |P A| + d is √ 34 − 1 .

Table 3 :
Answer categories with examples and description.

Table 4 :
Models used in our experiments.Chinese oriented refers to whether methods, such as increasing the portion of Chinese data and designing a tokenizer for Chinese, are used to improve performance in Chinese tasks.IT stands for instruction tuning and RLHF stands for reinforcement learning with human feedback.

Table 5 :
Instructions used in finetuning.problem is replaced by the problem text when training.

Table 6 :
Results on semantic parsing in CONIC10K.
The fully finetuned mT0-xl achieve the highest accuracy, while the LoRA finetuned Vicuna-7b achieves the lowest syntax error rate.

Table 9 :
Table 11 and 12. Results on Chinese problems and problems translated to English in mathQA in CONIC10K using GPT-4 with zero-shot CoT.Both are evaluated on the same 200 sampled problems.

Table 10 :
Translated failing cases.The red text is the reasoning step where hallucination takes place.

Table 11 :
Translated example of solutions from GPT-4 and shortcut solutions.The red text is the reasoning step where the solution goes wrong.