TRIGO: Benchmarking Formal Mathematical Proof Reduction for Generative Language Models

Automated theorem proving (ATP) has become an appealing domain for exploring the reasoning ability of the recent successful generative language models. However, current ATP benchmarks mainly focus on symbolic inference, but rarely involve the understanding of complex number combination reasoning. In this work, we propose TRIGO, an ATP benchmark that not only requires a model to reduce a trigonometric expression with step-by-step proofs but also evaluates a generative LM's reasoning ability on formulas and its capability to manipulate, group, and factor number terms. We gather trigonometric expressions and their reduced forms from the web, annotate the simplification process manually, and translate it into the Lean formal language system. We then automatically generate additional examples from the annotated samples to expand the dataset. Furthermore, we develop an automatic generator based on Lean-Gym to create dataset splits of varying difficulties and distributions in order to thoroughly analyze the model's generalization ability. Our extensive experiments show our proposed TRIGO poses a new challenge for advanced generative LM's including GPT-4 which is pre-trained on a considerable amount of open-source formal theorem-proving language data, and provide a new tool to study the generative LM's ability on both formal and mathematical reasoning.


Introduction
Automated theorem proving (ATP) requires formal reasoning and deduction from conclusion to axioms or known theorems.This task requires general and The key is to rewrite π 12 into 1 2 * π 6 (the green part), and apply the half-angle formula (the orange part).Both steps need an understanding of numbers and formulas.
flexible reasoning and is easy to validate, making it an appealing domain for exploring the reasoning ability of the recent successful pre-trained generative language models.These models show strong proof generation capabilities (Lample et al., 2022;Jiang et al., 2022), but its ability to perform formal mathematical proof reduction, which involves complex numerical reasoning, has not been thoroughly explored.
Current ATP benchmarks (Wu et al., 2021a;Han et al., 2021;Zheng et al., 2022) mainly focus on symbolic inference but rarely involve the understanding of complex number combination reasoning, such as term grouping, term factorization and equivalent substitution.For advanced mathematical proving such as trigonometric expressions, ATP can be beneficial for evaluating the crucial complex number combination.For example, as shown in Figure 1, to correctly reduce the left-hand-side expression, one must recognize the specific angles or terms such as cos( π 12 ), capable of applying the half-angle formula and know the result is cos( 1 2 * π 6 ).11594 Figure 2: An example of GPT-4 struggling to solve TRIGO.GPT-4 corrects its response after the second prompt (box in right), but continues generating error tactics (highlighted in red with a cross mark).
Such proof steps can be automatically denoted by the formal language deduction as shown in the right-hand-side, and then verified via the interactive theorem-proving environment in ATP.
To develop such profound ATP evaluation for current generative LMs, we propose the task of Trigonometric Expression Reduction (TRIGO).Given a trigonometric expression, a model is required to accept formal input with Lean formal language and then perform step-by-step proof reduction.The proposed TRIGO poses a new challenge for current state-of-the-art generative LMs. Figure 2 presents an illustrative example of a proof generated by GPT-4 (OpenAI, 2023).Inspired by self-refine (Madaan et al., 2023), we first provide the prompt "Please help me prove this problem using Lean: lemma Trigo_0 : sin(107 * pi) = 0 :" to GPT-4 and the GPT-4 generates non-exist tactics and gives the incorrect equations.We further prompt it with "We require to use lemma from Lean's standard library for the proof and to ensure the correctness of the equation" to correct the GPT-4.However, in the second attempt, the GPT-4 still applies the tactic "sin_periodic_pi" that does not exist in the Lean standard library and comments it "−−available in Lean standard library".Surprisingly, GPT-4 produced the correct equation "have h: sin(107*pi) = sin(1*pi)" in the second proof attempt, even though the proof for this subgoal is incorrect.This example demonstrates the potential of GPT-4 in accurately manipulating numbers and formulas, as well as the challenge of strict formal reasoning posed by the TRIGO task.
To construct the TRIGO dataset, we collect trigonometric expression reduction problems and corresponding answers from high school exercises and exams.We then develop an interactive annota-tion software to manually label the reduction steps and formalize the processes into "Lean" formal language.Finally, based on this manually formalized data, we develop an automatic proof generation program to expand the dataset with real-world data and create datasets of artificially generated samples.Specifically, we generate 3 types of samples by controlling their proof length and generating trigonometric functions with larger numerical values to assess the models' ability to generalize to out-of-distribution data.
Our contributions are three-fold: • We propose the new trigonometric expression reduction tasks that are the first to explore formal mathematical reasoning abilities with regard to both formulas and numerical elements understanding.
• We construct the new TRIGO dataset with manually labeled reduction steps and convert them to the formal language Lean (de Moura et al., 2015).We also generate extra samples with controlled difficulties and distribution to further evaluate different aspects of generative LMs.• We conduct extensive experiments and detailed analysis of a broad range of methods, identifying the new challenges for current state-of-the-art generative LMs.
2 Related Work  (Saxton et al., 2019;Hendrycks et al., 2021;Cobbe et al., 2021;Shen et al., 2021).Our work is most similar to (Wu et al., 2021a).They use formal mathematical reasoning to reduce equations and inequalities and employ programs to automatically generate proofs to explore combinational generalization.However, they lack real-world problems to assess the model's generalization to real distributions and do not involve complex numerical operations combination.
Compared with previous work (Wu et al., 2021a) which has 18 axioms and 9 transformations, our generation process has a total of 85 transformation rules and diverse sampled parameters.Our proposed TRIGO generates complex samples with controlled difficulties and distribution, and includes manually annotated samples and proof steps from real-world problems.

Background on Lean Environment
Formal language systems are effective tools for strictly verifying the correctness of each proof step generated by the model.In this work, we use Lean (de Moura et al., 2015) as formal environment.

TRIGO Dataset
In this section, we first introduce how we collect trigonometric expression reduction problems from "tiku"1 , annotate the step-by-step reduction processes, and transform them into Lean formal language to create the TRIGO-real and TRIGO-web datasets.Then we introduce how to automatically generate data to construct the TRIGO-gen.

Problem Collection
We collect the trigonometric expression reduction problems from "tiku", a large-scale math problem set from textbooks and exams.Specifically, we collect problems and their answers from the "trigonometry" topic.We eventually collect 427 problems and denote them as TRIGO-real.To expand our dataset, we further collect additional trigonometry reduction problems from different websites.After manually filtering the duplicate problems, we obtain an additional 453 samples as TRIGO-web and use them as the test set.These data are collected from other websites found through search engines.These sources contain high school math exam questions with standard answers.Throughout the collection process, we aim to gather data randomly whenever possible, ensuring diversity in the distribution of the test set to reflect the model's performance on real human exam questions.

Interactive Proof Annotation
The collected problems have only the final results without step-by-step reduction processes.To facilitate the annotation of these crucial processes, we develop interactive software specifically tailored for this purpose.The annotation process has the following steps: Step. 1 The software shows an expression to the annotator.
Step. 2 The annotator inputs a transformation equation that will be applied in the next step.
Step. 3 The software checks if the equation is valid by matching it with a rule in a predefined bank.If no rule is matched, the software reports "No Matched Rule" and goes back to step 2.
Step. 4 The software applies this transformation to the current problem.If succeeds, the software outputs the new expression.Otherwise the software reports "Rule Failed" and goes back to step 2.
Step. 5 Repeat steps 2-4 until the expression equals the answer.
Equation-Rule Matching In step 3, each annotated equation must match with a predefined rule to ensure its correctness.We define a total of 85 rules that can cover most of the trigonometric transformation.Some examples are shown in Table 1.

Lean Formalization
After obtaining the stepwise reduction annotation, we manually transform the annotated equation into Lean formal language.Since we use Lean-Gym with mathlib (mathlib, 2020) backend as our formal environment, Lean-Gym can only accept tactics inside the mathlib.To ensure the correct acceptance and processing of our defined trigonometric rules in Lean-Gym, we derive these rules from mathlib theorems and convert them into tactics before adding them to mathlib.
We then construct the framework of the proof script.The script begins with the keyword "lemma" followed by a name and the premises of the lemma and then presents the goal equation where the lefthand side (LHS) represents the original expression and the right-hand side (RHS) represents its reduced result.Lastly, we add the "begin", "sorry", and "end" keywords where the "sorry" is a placeholder that will be replaced in the following steps.
Given the empty proof script, we convert the annotated step into Lean tactics.Recall that during the annotation phase, each annotated step is matched with a predefined rule, which can be further converted to a Lean-Gym tactic using a Python program.Thus, we only need to apply the corresponding rule with proper arguments parsed by sympy.Take the equation sin 13π Although the above step can complete most of the transformation to Lean, we still need to manually fix the Lean proof.For example, Lean does not reduce the above new goal state sin 13π 6 − 2π to sin π 6 , and rewriting tactics "rw sin(x+y) = • • • " fails when applied to "sin(y + x) • • • = • • • " as Lean can not match sin(x + y) with sin(y + x).Thus, we manually add more steps such that the Lean-Gym can correctly process the entire proof.

Generated Data
To comprehensively analyze the performance of models across various levels of difficulty and different ranges of numbers, as well as to study the gap between generated and real-world data, we automatically generate trigonometric problems and proofs by applying random predefined rules repeatedly.Specifically, we randomly choose a rule denoted as r from our predefined rule bank and select corresponding value arguments X and Y from the value list , and initialize K with an integer value between 0 and 100, to construct our initial goal expressions G. Then at each step, we sample a r and try to match the r with LHS or RHS of goal expressions G.If either side is matched, we replace the corresponding part of goal expressions G with rule r's corresponding Lean tactics set.
During the replacement, we need to determine the value arguments X, Y , and K given the goal G.For example, consider the expression sin( 3π 4 ) and the rule sin (X + Y) = sin (X) cos (Y) + sin (Y) cos (X), where the argument X + Y must equal 3π 4 .We first sample the value of X from the value list C, then calculate the Y = 3π 4 − X.
To obtain the parameter K for some rules such as cos(X) = sin(2 * π * K − X + π 2 ), we uniformly choose an integer between [0, 100] as its value.
To control the difficulties of the generated samples, under the assumption that the difficulty of the problem increases as the number of sampled rules grows, we sample and apply 1, 2, and 3 rules to construct TRIGO-gen, denoted as TG-1, TG-2, and TG-3, respectively.For each rule length, we generate 9,000 training samples, 1,000 validation samples, and 1,000 test samples.To close the gap between the generated samples and real-world samples, we also use the trigonometric expression in TRIGO-real as initial goal expressions G, then sample and apply the exact 3 rules to generate the set TG-E as generated training data.

Data Statistics
Finally, the TRIGO-real has 427 problems and a total of 10,574 proof tactics.We divide TRIGO-real into train, validation, and test splits with a 7:1:2 ratio, resulting in 299 training samples, 42 validation samples, and 86 test samples.The average proof step size for TRIGO-real, TG-1, TG-2, TG-3, and TG-E are 37, 22, 35, 49 and 81.Since Lean-Gym only accepts one tactic at a time, the tactic length of each problem in the dataset typically matches the size of the proof step.Figure 4 displays the sample proportions concerning their tactic length.We can observe that the generated samples have similar tactic lengths, while the real-world data has various but more uniform lengths.More statistics are in Appendix G.

Baseline Models
Recent works utilize GPT-based language models for automated theorem proving and have made significant improvement (Polu and Sutskever, 2020;Han et al., 2021;Polu et al., 2023;Jiang et al., 2022;Zheng et al., 2023).In this work, we use GPT-2 (Radford et al., 2019) with a proof search algorithm as a baseline method for our dataset.

Data Preparation
The Lean-Gym (Polu et al., 2023) provides an interactive formal environment to obtain a new goal state given the previous state and tactic, as shown in Figure 3.During training, at each step, we obtain the (state, tactic) pairs from Lean-Gym and training samples respectively, and concatenate them into a sequence with the "GOAL" and "PROOFSTEP" special tokens: GOAL ⟨state⟩ PROOFSTEP ⟨tactic⟩.Table 2: Pass rates of benchmark models and baselines.
We take the above sequence as input and train the GPT-2 models to predict the "tactic" sequence with autoregressive loss (Bengio et al., 2000): where θ indicates model parameters, and x i is the i-th token of the input sequence: Proof Search After training GPT-2 to generate a tactic given a goal state, we search the complete proof by expanding the most probable state at each step.We employ Breadth-First Search (BFS) in this paper.Specifically, we define the probability of the goal state as the cumulative logarithm probability of its corresponding generated tactics: where p tactics i is the tactic's probability generated by the GPT-2.Lean-Gym outputs the new state state N by applying the tactics i to a previous goal state state N −1 .At each proof search step, we select a goal state with the highest probability and feed the sequence "GOAL ⟨state⟩ PROOFSTEP" into the trained GPT-2 to generate the tactics.We sample 8 tactics based on GPT-2 output probability.
The generated tactics with the goal state are input to Lean-Gym to obtain a new valid goal state if possible.We repeat the search process until "no goal" state is reached, the queue becomes empty, or reach the maximum search step 512.

Experiment
In this section, we evaluate the performance of GPT-2 BASE (GPT-2 B ), GPT-2 LARGE (GPT-2 L ), and GPT-2 L -PACT, a GPT-2 LARGE pre-trained on the formal proof dataset PACT (Han et al., 2021).Furthermore, we evaluate the model's out-ofdistribution generalization ability by examining its performance across various levels of difficulty and different ranges of numbers, while also evaluating the impact of generating data distributions beyond those observed in real-world data.Additionally, we conduct a comprehensive analysis of the models, including an evaluation of GPT-4's performance.

Implementation Details
All models are trained with Adam optimizer (Kingma and Ba, 2015), learning rate of 2.5 × 10 −4 , batch size of 512, and a cosine schedule.More implementation details of these models are demonstrated in Appendix A. In Table 2 we find that: (1) Models with different parameter sizes (GPT-2 B , GPT-2 L ) achieve similar performance on both TRIGO-real and TRIGO-web when trained on the smaller dataset TRIGO-real.On TRIGO-gen that has more training samples, larger model parameter scales lead to better performance;

Main Results
(2) GPT-2 L -PACT achieves the best results on each test set, indicating the significant improvement of pre-training on PACT and raising the question of whether we can achieve a similar improvement if fine-tune on TRIGO-gen.
To study the above question, we merge the generated dataset TG-1, TG-2, and TG-3 training split to fine-tune GPT-2 B , GPT-2 L , and further train GPT-2 L -PACT.We denote the resulting models as GPT-2 B−D , GPT-2 L−D , and GPT-2 L -PACT-D and evaluate their pass rate.We find that only GPT-2 L -PACT-D pre-trained on PACT can obtain significant improvement on TRIGO-gen.However, when we continue to train GPT-2 L -PACT-D on TRIGO-real, the performance does not improve significantly, achieving accuracies of only 23.25% and 13.02% on TRIGO-real and TRIGO-web, respectively.
To explore the gap between TRIGO-real and TRIGO-gen, we train the GPT-2 L -PACT on TG-E whose samples are generated start with expression in TRIGO-real.The results are denoted as GPT-2 L -PACT-E.Compared with GPT-2 L -PACT, GPT-2 L -PACT-E achieves a 2.33% improvement on the TRIGO-real test set but a 0.22% decrease on TRIGOweb.These results suggest that solely increasing the proof length in the training data using a generation program does not enhance model performance on the TRIGO-web.We posit that this is due to the significant distribution gap between TRIGO-real and TRIGO-web, making it challenging for the data generated based on TRIGO-real to generalize to TRIGO-web.
To investigate the out-of-distribution generalization ability on datasets of varying difficulty, we present the results in Table 3. Table 3 shows the results of only training model on TG-1, TG-2, and TG-3 and testing on other test sets with different distribution.It is shown that all models perform consistently worse than the in-distribution setting.On TG-1, the best GPT-2 L -PACT trained on the more complex TG-3 dataset is still 20.1% lower than that trained on TG-1 alone.On TRIGO-real and TRIGO-gen however, model GPT-2 L -PACT trained with three generated datasets separately perform worse than the GPT-2 L -PACT-E trained on the TG-E.This demonstrates that initiating the automatic theorem generation program with input derived from real-world data effectively bridges the distribution gap between real-world data and generated data.

Model Analysis
In this section, we perform a comprehensive analysis of GPT models on our TRIGO with various settings.We mainly evaluate the PACT pre-trained model as it achieves the best overall performance.
Stepwise Evaluation We first evaluate the singlestep generation performance.We obtain all (goal state, tactic) pairs, and select the pairs whose tactic is not "have" as set w/o have.We compare the model's top-1 and top-8 output tactics with the Dataset all w/o have EM@1 EM@8 EM@1 EM@8 ground truth and consider the model is correct if any of the generated tactics is an exact match with the ground truth.The results are shown in Table 4.
GPT models achieve high performance on tactics without the "have" tactics, with EM scores above 86% for the top 8 generated tactics.However, the prediction of "have" tactics poses a significant challenge in overall proof generation, especially in the TRIGO-real dataset where there is an 8.45% gap in EM@8 between "all" tactics and tactics without the "have" tactics.In the TRIGO-gen dataset, the gap in EM@8 between "all" tactics and tactics without the "have" tactics increases as the number of proof steps increases.
To explore the impact of model size on the accuracy of single-step proofs, we evaluate GPT-4.We believe that LLMs such as GPT-4 have already demonstrated their ability to process formal language, particularly in translating informal proofs to formal proofs (Jiang et al., 2023;Wu et al., 2022).Furthermore, LLMs have demonstrated promise in theorem proving using proof assistants such as Lean, as evidenced by Yang et al.'s study (Yang et al., 2023).The study showed that GPT-4 could generate proofs accepted by a zero-shot manner, thus establishing GPT-4 as a robust baseline.These results suggest that GPT-4 may have been trained on Lean examples, as there were publicly accessible proofs on GitHub prior to GPT-4's data cutoff date in September 2021 (OpenAI, 2023).
We also evaluate the single-step performance of GPT-4, focusing specifically on the "have" tactics.Since the "have" tactic only requires generating the Table 6: One-step proofs generated by GPT-4 given in-context learning (ICL) or natural language instruction (Instruction) and a new goal (Goal), compared with the ground truth (GT).
sub-goal equation without additional Lean knowledge, evaluating GPT-4's performance on these tactics effectively reflects its ability in complex number combination reasoning.For each generation, we randomly select 8 (goal state, tactic) pairs from the training set as in-context learning examples for GPT-4.Table 5 presents the exact match scores obtained.The experimental results highlight a significant performance gap between GPT-4 and the fine-tuned smaller GPT-2 models across all settings.Table 6 showcases several one-step proofs generated by GPT-4.

Search Evaluation
We conduct three experiments to study the effects of tactics decoding and proof search methods.We first compare beam search and sampling.When generating tactics, we apply beam search with size 16 and expand the proof goal with the top-8 tactics.As for sampling, we randomly select each token based on the model's output probability and sample 8 tactics.Table 7 shows the proof pass rate, and sampling achieves better performance.After inspecting model outputs, we further observe that sampling produces more diverse tactics, exploring various search paths, and is particularly effective in discovering number combinations.
We then explore the effect of increasing sampling temperatures.With a temperature of 1.5, the model generates many illegal characters but outputs more diverse tactics.Reducing the temperature to 1.25 significantly decreases illegal characters, improving the model's pass rate on TRIGO-real, TG-1, and TG-2.These results suggest a future direction of developing decoding methods to generate diverse and valid tactics.
To compare different search algorithms, we lastly implement breadth-first search (BFS) and Monte Carlo tree search (MCTS) in previous work (Silver et al., 2017).Surprisingly, MCTS does not excel on our dataset.In forms significantly worse than BFS on the artificially synthesized dataset TRIGO-gen.We suppose this is due to the lack of a well-developed value function that should be addressed in future work.
Expert Iteration Figure 5 demonstrates the model performances by expert iterations.Generated samples in TRIGO-gen usually have multiple proof paths, thus we apply the expert iteration (Polu et al., 2023) to discover diverse and better proof paths.Specifically, we train the GPT-2 L -PACT model on TG-i, where i ∈ [1, 3].Then we employ the trained models to prove the training set samples in TG-i.If the proof pass, we add the new proof to the original TG-i.Eventually, we expand the original set to a new training set TG 1 -i, where 1 indicates the 1st iteration.We retrain the GPT-2 L -PACT on TG 1 -i, generate new proofs, add the new proofs to TG 1 -i and obtain TG 2 -i.We repeat the above process 7 times, obtaining TG 1 -i to TG 7 -i.We train the model on the seven datasets and evaluate them on the original test set of TG-i.As shown in Figure 5, the model's pass rate improves significantly across all three TRIGO-gen datasets, with the largest improvement of 10.9% in TG-2.This highlights the diversity of proof path in training data for enhancing model performance.ability on numerical reasoning, we expand the range of angle values C that can be sampled during generation to a more complex set

Large Angle Values Evaluation
, and extend the maximum value of K from 100 to 1000.We generate harder test sets for TG-1/2/3 respectively.We find that all models including the strongest baseline GPT-2 L -PACT-D model achieve a pass rate of 0 on these OOD test sets, revealing the limitation of current language models on the numerical reasoning.

Conclusion
In this paper, we introduce TRIGO, a dataset focusing on trigonometric expression reduction for formal mathematical reasoning with both real-world and generated samples.To the best of our knowledge, TRIGO is the first Lean-based dataset with manually annotated and automatically generated reduction proofs for exploring the formal mathematical ability of current language models.
Our comprehensive experiments reveal that, in comparison to generated data, pre-training on PACT significantly enhances performance on realworld problems.Furthermore, expanding the data scale by utilizing real-world data as the starting point for the theorem generation program can effectively boost the model's performance on the realworld test set.Additionally, we reveal the model's incapability on generalizing numeric operations to larger unseen numbers and find that both the di-versity of tactics and search paths have significant impacts on the final proof pass rate.

Ethics Statement
The trigonometric expression reduction dataset TRIGO is obtained from the Internet.After we collect the data, we formalize it in Lean and submit it to Lean to verify the correctness of the proof, without any bias involved.
When annotating TRIGO-real, we utilize not only the publicly available answers from "tiku", but also compose some of the answers ourselves.As for collecting unlabeled data for TRIGO-web, we make efforts to gather solutions from diverse sources, encompassing blogs, documentation, Q&A communities, and even videos.

Limitations
Our evaluation metric focuses solely on verifying the correctness of the model's proofs.We consider the "no goals" output in the interactive environment Lean-Gym as an indication of success.Hence, this indicator serves as our metric for assessing the model's performance.In our future work, we aim to introduce improved evaluation metrics to assess the model's ability to generate a wider range of proof paths.Due to regional constraints, we cannot access the services offered by OpenAI, such as GPT-4 and GPT-3.5.Therefore, the evaluation of GPT-4 and GPT-3.5 in our paper has been entrusted to researchers from a research institution outside the restricted region of OpenAI, who conducted the assessment for this part.

A Implementation Details.
We employ identical hyperparameters to train GPT models on both the PACT and TRIGO datasets.The Adam optimizer (Kingma and Ba, 2015) is utilized with a learning rate of 2.5 × 10 −4 and a cosine schedule, while the batch size is set to 512.During training on TRIGO, we set a maximum epoch limit of 20 and select the epoch that achieves the lowest validation set loss.For PACT training, we conduct an initial pre-training epoch on the mathlib, mix1, and mix2 datasets provided by PACT.During the proof search phase, beam search is applied to generate tactics with a beam size of 16, and we consider the top 8 outputs.Additionally, a maximum budget of 512 search steps is allotted.All experiments are executed on 8 Nvidia Tesla V100 GPUs.

B Case Study
We provide an example of the model's search on TG-1 in this section.As shown in Figure 8, this example demonstrates a correct proof step, and we can see that the model employs the tactic of "have" to make multiple assumptions.However, often the goal of many hypotheses to prove is trivial, so determining how to make useful assumptions is a crucial factor in the model's proof accuracy.Furthermore, the model's ability to combine the numbers in the "have" tactic also determines whether the model can reach the correct proof path.
As shown in Figure 9, this example demonstrates a proof step where the model fails.The model appears to struggle to output diverse hypotheses to explore more proof paths when forming a search tree but generates a large number of identical "have" tactics.It highlights the importance of using a variety of exploration strategies to improve accuracy.
As shown in Table 10, GPT-4 with the In-context learning approach performs well in both the second and third examples, while methods using natural language instructions fail completely.These examples illustrate that GPT-4 is capable of learning the compositional relationships in numbers through in-context learning in one-step proofs.

C Experimental Details of Monte Carlo Tree Search
We provide here the details of our implementation of Monte Carlo Tree Search.We follow the formula below to generate each tactic t * : where Q(g, t) denotes the value function of sampling tactic t in proof state g, A denotes all the tactics that can be sampled, P θ (t | g) denotes the prior probability, and C(g, t) denotes the number of visit counts of the sampled tactic t in state g.In our experiments, we always set the constant c to 1, and we use the cumulative probability of the model output tactic as the value of Q(g, t).

D Informal Mathematics Benchmarks
In contrast to formal benchmarks, informal benchmarks lack data annotated in a formal theorem proving language.Constructing formal mathematics is time-consuming and demands a high level of mathematical expertise from contributors.Informal math problem datasets, represented in natural language, are more convenient to construct.Math word problems (Koncel-Kedziorski et al., 2016;Wang et al., 2017;Patel et al., 2021;Cobbe et al., 2021;Xiong et al., 2022Xiong et al., , 2023;;Yu et al., 2023) target elementary students, querying an unknown variable given a natural language situation description.MATH (Hendrycks et al., 2021) contains 12,500 high school math competition problems with natural language statements and solutions.These datasets, collected from real human problem-solving, better reflect real distribution but lack strict formal verification to ensure correctness.NaturalProofs (Welleck et al., 2021) uses natural language to describe mathematical statements and proofs, while (Saxton et al., 2019) synthetically generates sequence-to-sequence math problems represented as pure strings, covering various topics.However, due to natural language ambiguity, the correctness of the proof process in these works cannot be verified.

E Neural Theorem Proving
DeepHOL (Bansal et al., 2019b) first applies reinforcement learning to automatic theorem proving without human-written proofs, achieving the best performance on HOList.AStactic (Yang and Deng, 2019) treats tactics as programs and composes abstract syntax trees (ASTs) during tactic generation.LIME (Wu et al., 2021b)

F More Details on Automatic Sample Generation
As illustrated in Algorithm 1, we have devised our automated sample generation program by drawing inspiration from the manual annotation process of problem-solving, as depicted in Figure 7.
Lean-Gym implements a strict replacement strategy and does not perform reduction operations like sympy.Specifically, when performing the expression replacement step, since we use sympy to parse the expression tree, it automatically reduces the expression to the new equation eq t , causing misalignment with the equation eq lean in Lean's proof goal.To solve this problem, we use the "have" tac-

G Dataset Statistics
In this section, we present the average proof length for TRIGO, along with the split of the training, validation, and test sets in TRIGO-real.We show the proportion of different tactics in Figure 6.Additionally, we display the proportion of occurrences containing the "have" tactic in Table 11.Table 12 shows the average lengths of the datasets in TRIGO.
Table 13 shows our split on TRIGO-real.
We can observe that TG-2 has the closest average length to the TG-real.From Table 11, the proportion of occurrences containing the "have" tactic in TG-2 is smaller than that in TRIGO-real.Although the average proof length is different in TRIGO-gen, the occurrence of the "have" tactic is roughly the same.Additionally, upon closer examination of the data, we find that the "have" tactics in TG-3 is comparable in complexity to the "have" tactic in some of the data in TRIGO-real.The aforementioned statistical data reflects the disparity between the distribution of generated data and real-world data.

H General Tactic
In this section, we present several tactics we generated along with their corresponding annotations3 .Table 14 shows several typical tactics, such as "field_simp", both of which are high-level tactics that can handle many expressions involving field operations such as addition, multiplication, and inverse.We do not consider "tidy" when generating the data because this tactic is easily timed out.

I Rule Specifications
In this section, we present all the rules manually defined in Table 15-17.For each rule, we create corresponding tactics to ensure that the proof of the problem can be collected backward after generating the corresponding tactics forward.We demonstrate the mapping of some rules to their corresponding tactics in Table 38.During the complete process of annotating these missing reduction steps, as demonstrated in Figure 7, we illustrate how it is matched with the rule specifications, undergoes a one-step transformation, and results in a new problem state.

J Annotator Demographics
Our annotation team consists of four Master's students and three PhD students.

K Data Example
We show examples from our dataset in this section.

L Large Language Model test examples
In this section, we demonstrate examples of incontext learning and zero-shot methods on large language models.In contrast to the single-step proof context learning in Table 6, the examples of context learning presented in this section directly provide the entire proof.We find that it is challenging for the model to provide correct proof.Tables 30,32,26,and 28 show examples of our in-context learning approach tested on our dataset using GPT-3.5 and GPT-4 models.Tables 31, 33, 27, and 29 show the outputs of these models.We find that LLMs have difficulty learning to use the correct "have" tactic, which suggests that LLMs may not be able to manipulate numbers and lack generalization abilities such as the commonly used techniques of grouping and factoring in trigonometric reduction.
We conduct numerous zero-shot tests on our dataset using GPT-3.5 and GPT-4, employing the prompt "Please help me prove the following lemma using Lean: lemma Trigo_0 : [PROBLEM] :=".However, we find that GPT-4 tends to output a lot of meaningless tactics, such as the case shown in Table 35.Moreover, GPT-4 often generates tactics that do not exist in the dependencies, as seen in 35.The example in Table 35 reveals that GPT-4 is prone to making trivial assumptions.By comparing the output of GPT-3.5 and GPT-4 in tables 34 to 37, we observe that GPT-4 is more inclined to generate "have" tactics, which is closer to the proof pattern of our dataset.
The experiments above all indicate the limitations of LLMs on our dataset.We leave to future work on how to enable large language models to acquire complex number combinations ability and reduce illusions.

Function field_simp
The goal of field_simp is to reduce an expression in a field to an expression of the form n ÷ d where neither n nor d contains any division symbol.simp In Lean, simp is a tactic that stands for "simplification."It is used to simplify expressions and goals by applying a set of predefined rewrite rules and simplification procedures.The purpose of simp is to automatically transform complex or convoluted expressions into simpler forms, making them easier to work with and reason about.ring_exp A tactic for solving equations in commutative (semi)rings, where the exponents can also contain variables.ring Evaluate expressions in the language of commutative (semi)rings.assumption The assumption tactic looks through the assumptions in context of the current goal, and if there is one matching the conclusion, it applies it.repeat assumption The repeat assumption tactic looks through the assumptions in context of all goals, and if the assumption of the context of the current goal can match the target, then it is applied.left The tactic tries to solve the left disjunct immediately by assumption; if that fails, it tries to focus on the right disjunct; and if that doesn't work, it invokes the assumption tactic.refl In the proof language Lean, refl is an abbreviation for "reflexivity."It is used as a tactic to automatically prove goals of the form a = a, where a is any term or expression.Essentially, it asserts that any term is equal to itself, which is a fundamental property of equality.have In Lean, "have" is a keyword used in proof scripts to introduce a new intermediate goal or hypothesis.It allows the user to assert a proposition and then prove it separately before continuing with the rest of the proof.conv In Lean, conv is a tactic that allows users to perform step-by-step rewriting and manipulation of expressions within a proof.It stands for "conversion" and provides a flexible way to apply various rewrite rules, simplify expressions, and rearrange terms.to_lhs The to_lhs modifier is typically used within a tactic block, such as conv or rewrite, to specify the side of the equation or expression that should be modified.When to_lhs is used, the tactic will focus on the LHS of the equation or expression and perform the specified operations on that side.rw In Lean, rw is a tactic that stands for "rewrite".It is used to apply a specific rewrite rule to an expression or goal within a proof.The rw tactic is commonly used to replace occurrences of a specified term or pattern with a different term or pattern according to a given equality.apply In Lean, the apply tactic is used to apply a theorem or a hypothesis as a rule to prove a goal or to generate new subgoals.It allows the user to use an existing proposition to infer or establish other propositions.congr_arg In Lean, congr_arg is a function that allows users to apply congruence to a function applied to an argument.It is used to prove equalities by reasoning about the effects of a function on its arguments.

Figure 1 :
Figure 1: The task of trigonometric expression reduction.The key is to rewrite π 12 into 1 2 * π 6 (the green part), and apply the half-angle formula (the orange part).Both steps need an understanding of numbers and formulas.

Figure 3 :
Figure 3: The proof flow is produced by the interactive Lean-Gym environment.The language model generates proof steps given the formal prompts until reaches "no goal".mutative (semi)rings.More details of used tactics are given in Appendix H.Lean-Gym(Polu et al., 2023)  is an interactive environment that allows language models to interact with formal systems.As depicted in Figure3, we begin by acquiring the initial goal state G 1 as "⊢ sin(π/3) + 2 * cos(π/12) * * 2 − cos(π/2) = sqrt(3) + 1".This goal state is inputted into language model with the prompt "GOAL G 1 PROOF-STEP".Subsequently, GPT-2 generates the corresponding tactic T 1 as "rw cos_pi_div_two,".Given the goal state and tactic, Lean-Gym outputs a new goal state "⊢ sin(π/3) + 2 * cos(π/12) * * 2 − 0 = sqrt(3) + 1" for language model to obtain the next tactic.We iteratively perform this process until the Lean-Gym returns "no goals" which indicates the proof is complete.

Figure 4 :
Figure 4: Tactic length distribution based on the number of tactics.

Figure 5 :
Figure 5: The accuracy of the GPT-2 L -PACT model on the test set during expert iteration.

Figure 7 :Figure 8 :Figure 9 :
Figure 7: The interactive annotation system for trigonometry reduction.(a) The interface of our system.Region ① shows the problem to be annotated.Region ② is the main interaction area where annotators input an equation for the current step.The system then matches it with the rule bank, performs a one-step transformation, and outputs a new problem state.Region ③ shows the annotation history and region ④ includes interactive buttons for annotators to change or reset the problem, check examples, and trigonometry knowledge to help their annotation.(b) The workflow of our annotation system.

Table 1 :
Examples of our pre-defined rule bank.

Table 3 :
Pass rates of models on OOD test set.

Table 2
presents the pass rate of different models on TRIGO-real (training, validation, and test sets), TRIGO-web (test set only), and TRIGO-gen (training, validation, and test sets).The pass rate indicates the percentage of problems that a model outputs a correct proof within the maximum search steps by interacting with Lean-Gym.All models are trained and evaluated on the corresponding training and test splits, except for the TRIGO-real which the models are trained on TRIGO-real training split and tested on the TRIGO-real test split and TRIGO-web.

Table 4 :
The single step performance of GPT-2 L -PACT on different datasets.EM@k represents the exact match scores of the top-k generated tactics.

Table 5 :
Exact match scores of single step performance on "have" tactic.We obtain a tactic with highest probability from GPT-2 L -PACT, and provide GPT-4 with an 8-shot prompt (randomly sampled from the (state, tactic) pairs in the training set that contain "have" tactic).

Table 7 :
Pass rate at different decoding methods.

Table 8 :
Pass rate with different sampling temperatures.

Table 9 :
Pass rate at different search methods.
To evaluate the model's out-of-distribution (OOD) generalization

Table 11 :
Statistics for Ratio of "have" tactic

Table 12 :
Average number of tactics per dataset.tic for alignment, for example: "have e lean =e p , try field_simp at *, try repeat left, tryring, conv to_lhs, rw this", where e lean and e p respectively represent the side of the equation eq lean and eq t that is to be replaced.These tactics automatically align the proof target in Lean's proof goal to eq t .We present our generation algorithm in Algorithm 1.The algorithm continuously iterates in the forward process, randomly sampling rules and parameters to generate problems at each iteration, and interacts with Lean-Gym to ensure correctness.Once the specified number of replacement rules or the maximum number of iterations is reached, the algorithm stops and collects tactics in reverse to obtain training data.
Table tables 18 to 25 shows typical examples in TRIGO.This data can be compiled correctly within Lean-Gym, and after compilation, we interact with Lean-Gym to obtain the final training data.

Table 14 :
Examples of general tactic.