Chaining Simultaneous Thoughts for Numerical Reasoning

Given that rich information is hidden behind ubiquitous numbers in text, numerical reasoning over text should be an essential skill of AI systems. To derive precise equations to solve numerical reasoning problems, previous work focused on modeling the structures of equations, and has proposed various structured decoders. Though structure modeling proves to be effective, these structured decoders construct a single equation in a pre-defined autoregressive order, potentially placing an unnecessary restriction on how a model should grasp the reasoning process. Intuitively, humans may have numerous pieces of thoughts popping up in no pre-defined order; thoughts are not limited to the problem at hand, and can even be concerned with other related problems. By comparing diverse thoughts and chaining relevant pieces, humans are less prone to errors. In this paper, we take this inspiration and propose CANTOR, a numerical reasoner that models reasoning steps using a directed acyclic graph where we produce diverse reasoning steps simultaneously without pre-defined decoding dependencies, and compare and chain relevant ones to reach a solution. Extensive experiments demonstrated the effectiveness of CANTOR under both fully-supervised and weakly-supervised settings.


Introduction
Numerical reasoning over text is an essential skill for a neural model to help analyze rich numerical information from large-scale textual data (Chen et al., 2021).Many question answering benchmarks (Dua et al., 2019;Patel et al., 2021) have been created to promote the numerical reasoning ability of neural models, where, typically, models are required to answer questions about given contexts with numerical answers.This is challenging, as it requires comprehensive structural analyses of text as well as precise and possibly complex deduction.Existing models mostly decode the equations and return the execution results.To better exploit structures of equations, many complex structured decoders (Xie and Sun, 2019;Cao et al., 2021) have been proposed and significantly outperform sequential decoding (Tan et al., 2021).However, all these methods construct a single equation in a predefined order (e.g., top-down or bottom-up order), which may place an unnecessary restriction on how a model should grasp the reasoning process.
Intuitively, after reading a reasoning problem, humans may have several pieces of thoughts which pop up in no pre-defined order, and finalize a solution by comparing and chaining relevant pieces.Take Fig 1(a) for example.Possible thoughts include necessary reasoning steps (e.g., how to get the number of boys and girls separately) and loosely-relevant ones (e.g., those learned from previous similar questions like "how many more girls than boys are in the school?").There is arguably no pre-defined strict order where a thought should conditionally emerge after some other thoughts.By comparing these diverse thoughts, we finally select and chain proper ones to reach a solid solution, which will be less prone to mistakes.
In this paper, we propose CANTOR, which compares and Chains simultANeous ThOughts for numerical Reasoning.As in Fig 1(b), CANTOR constructs a Directed Acyclic Graph (DAG) of diverse reasoning steps in a non-autoregressive way: all vertices are produced simultaneously, which correspond to operations like addition, and edges in the graph are constructed by chaining operations with their best-matched operands; the final equation is a selected sub-graph in the whole DAG.With no pre-defined decoding order, logical dependencies among reasoning steps are freely captured by the model internally.With our training methods, CANTOR captures diverse reasoning steps at different vertices, and learns to prune away possiblydistracting candidates during both training and inference, resulting in chaining reasoning steps that are more consistent with given problems.
To summarize, compared with previous models with structured decoding, CANTOR has no pre-defined restrictions on the decoding dependencies while also benefiting from modeling the structures of equations.Besides, by comparing diverse reasoning steps and chaining logically consistent ones, our model is less prone to errors.Our model establishes a new state-of-the-art record on two math word problem datasets under the fullysupervised setting, and is also applicable to weaklysupervised scenarios (where problems are only annotated with final answers, and the equations are unavailable) with significant improvements over baselines.Though not directly comparable, on two numerical reasoning datasets, fully-supervised CANTOR achieves even higher accuracies than hundreds of times larger language models (e.g., PaLM-62B (Chowdhery et al., 2022)) that use the effective chain-of-thought prompting technique (Wei et al., 2022), demonstrating CANTOR's great potential.

Related Work
Numerical Reasoning Numerical reasoning tasks can be formulated in many ways (Mishra et al., 2022), such as (1) question answering with numerical answers directly derived from arithmetic operations (Koncel-Kedziorski et al., 2016;Wang et al., 2017;Dua et al., 2019;Amini et al., 2019;Miao et al., 2020;Patel et al., 2021), (2) or other tasks like quantitative natural language inference (Ravichander et al., 2019) whose expected outputs are non-numerical but require implicit arithmetic reasoning.In this work, we focus on the former type of task which is widely studied.To generate equations precisely, previous work proposed to enhance number-related representations in problem encoding (Zhang et al., 2020;Shen and Jin, 2020;Liang et al., 2021), re-rank equation samples with a verifier (Shen et al., 2021;Cobbe et al., 2021), or exploit the structures of equations with complex top-down tree-structured decoding (Xie and Sun, 2019;Li et al., 2022) or bottom-up DAG-structured decoding (Cao et al., 2021;Jie et al., 2022).Our numerical reasoner also models equations with DAGs but with three major differences: (1) there is no pre-defined decoding order which may place unnecessary burden on how a model should learn the dependencies among operations; (2) the decoding process is largely simplified, which is reduced to simultaneous predictions of an operator and operands at each vertex of a graph; (3) our model explores diverse operations in a DAG and is trained to compare and chain relevant ones, so that logical consistency between given problems and equations are better captured during both training and inference.
Non-Autoregressive Decoding Our model is also relevant to non-autoregressive decoding.For machine translation, non-autoregressive translation (Gu et al., 2018;Ghazvininejad et al., 2020;Du et al., 2021) aims at fast inference; the recently proposed DA-Transformer (Huang et al., 2022), which utilizes a DAG to capture diverse translations, has made great progress in bridging the performance gap with autoregressive models.Recent work has also proposed non-autoregressive models for efficient task-oriented semantic parsing (Babu et al., 2021;Shrivastava et al., 2021), which achieved comparable performance with autoregressive parsers.All these methods model a target as a sequence and adopt token-wise decoding (one token at a position).By contrast, we model a target as a DAG and adopt step-wise decoding (one complete reasoning step at each vertex), which facilitates structure modeling and learning meaningful vertex representations.Experimental results show that our model significantly outperforms both autoregressive and non-autoregressive baselines.Notably, for open text generation, autoregressive methods are probably still the better choice for strong probabilistic modeling of diverse targets.However, for the numerical reasoning task we focus on, it is the logical relationships among quantities (both known and unknown in a given problem) that matter, and non-autoregressive methods, with proper designs, suffice to decode equations precisely and can provide new perspectives on how numerical reasoning can be better grasped by neural models.

Task Definition
Given a problem description X which mentions a list of numbers N = {n 1 , n 2 , ..., n |N | }, our task is to return the numerical answer A which is derived from an equation Y that takes arithmetic operations (e.g., addition, subtraction, multiplication, division, and exponentiation) on N as well as a set of predefined constants C = {c 1 , c 2 , ..., c |C| }.
For scenarios that consider only binary operators1 , a ground-truth equation Y can be formally defined as follows: where F is the set of pre-defined operators.y i is an operation that applies the operator y f i to the two operands y a i and y b i .Y can be directly transformed into a DAG with c i , n i , and y i being vertices, and y i → y a i and y i → y b i being edges.The final operation (the root vertex) y |Y | returns the answer.

Overview
We propose to model diverse reasoning steps with a DAG.Vertices of the graph correspond to reasoning steps which are decoded in parallel.This is analogous to humans' burst of thoughts after reading a reasoning problem.No pre-defined restriction is placed on how a reasoning step should conditionally depend on others; logical dependencies among reasoning steps are captured by the model internally.Our DAG also allows the model to explore diverse reasoning steps at different vertices, including necessary or wrong ones; the model is trained to compare the semantics of diverse reasoning steps and chain the most proper ones to be the final equation, which benefits model performance.

Architecture
Our model (Fig 2) comprises a pre-trained Transformer encoder (e.g., RoBERTa) and a shallow Transformer-based DAG decoder.The encoder encodes a problem X; from the encoder outputs, we can obtain the representations of mentioned numbers N = [n 1 , n 2 , ..., n |N | ] ∈ R d×|N | (d is the hidden size).The DAG decoder, with positional embeddings as inputs and cross attention over encoder outputs, produces representations for L vertices V = {v 1 , v 2 , ..., v L } in a non-autoregressive way, which are denoted as V = [v 1 , v 2 , ..., v L ] ∈ R d×L .Each vertex representation encodes the semantics of a reasoning step, including its operator, the expected operands, and the meaning of the resulting quantity.We then verbalize the operator for each vertex and chain it with its best-matched operands in parallel, and finally, select one root vertex and return its execution result.The selected root vertex along with its vertex descendants constitutes a decoded sub-graph, which is also the DAG representation of an equation.Let Z be the decoded sub-graph, which can be formulated as: where z j is the operation for the vertex at position p j , with z f j being the operator, and z a j and z b j being its operands.p |Z| is the index of the root vertex, and Definition of P θ (Z|X): P θ (Z|X) can be further decomposed based on operations in Z: !:

Number of boys more than girls in the school
Figure 2: Overview of CANTOR.CANTOR models diverse operations using a DAG.Each vertex corresponds to an operation, which is chained with its operands via edges in the graph.We decode an equation by simultaneously verbalizing operators at each vertex, chaining operations with operands, and selecting the root vertex; the selected root vertex along with all its descendants is the resulting equation in a DAG format.In this example, the ground-truth equation Y can be represented by the decoded sub-graph Z, as mapping y 1 to v 2 and y 2 to v 4 produces Z exactly.
where P r (•) and P z (•) are the probability functions of the root vertex and an operation, respectively; P f (•) and P a (•) (P b (•)) are for operator verbalization and operand matching, respectively.

Verbalizing Operators
We verbalize an operator for each vertex based on its representation: where W f ∈ R |F|×d is trainable parameters.

Chaining Operations with Operands
Each operation is connected with its best-matched operands chosen from all available quantities (including the other operations, constants, and mentioned numbers).Let C = [c 1 , c 2 , ..., c |C| ] ⊤ be the embedding matrix for pre-defined constants.
Then the representation matrix for all quantities is denoted as The probability distribution over candidates when predicting the first operand for the vertex at position p j can be computed as follows: The probability for predicting the second operand P b (•) can be computed likewise.W q , W a , W b ∈ R d×d are trainable parameters.To avoid cycles in the graph, we apply probability masking so that a vertex can not use itself or vertices with larger indices as its operands.

Selecting the Root Vertex and Finalizing the Equation
The final equation is represented by a sub-graph of the whole DAG, which comprises a selected root vertex and all its descendants.We introduce a special vertex v L+1 at position L+1 of the decoder, and use its representation to select the best-matched root vertex: where v L+1 is the representation of v L+1 , computed the same way as other vertex representations.

Training
To capture diverse reasoning steps at different vertices, we explore four training methods2 , namely, naïve mapping, hard EM, MML, and hard EM with annealing.Notably, as will be discussed by Section 5.7.3, in practice, one has no need to consider all four training methods; hard EM with annealing should be the default choice.

Naïve Mapping
A naïve way of mapping from Y to V is to map Let Z ′ be the resulting sub-graph, then the training objective is: which leaves {v j ||Y | < j ≤ L} unused.

Hard EM
Hard EM is to optimize the probability of Z * that best aligns with Y : As |Γ| can be quite large3 , we use beam search to find Z * approximately, which is feasible as P θ (Z|X) can be factorized into probabilities of constituent operations of Z (Eq 3).Notably, the probability of an operation depends on which vertices its operands (if being operations) are mapped to.We search Z by iteratively determining where to map y i , until y |Y | is settled.

MML
MML optimizes the marginal likelihood of Z: Marginalization is expensive due to the large size of Γ.We therefore adopt a strong (but risky) assumption so that we can use dynamic programming to marginalize P θ (Z|X) in polynomial time.Specifically, for any operation y i , we assume that the two sub-graphs rooted at y a i and y b i respectively (in the DAG counterpart of Y ) are independently mapped to {v 1 , v 2 , ..., v L }. Notably, with this assumption, we in fact marginalize P θ (Z|X) over a superset of Γ and even allow mapping multiple operations to a single vertex.However, we empirically found that MML (with this assumption) works well on short equations4 , and can be used to warm up hard EM.

Hard EM with Annealing
To avoid optimizing the model on its early decisions, we follow Min et al. (2019) to apply annealing to hard EM: we optimize the model using MML for τ training steps and use hard EM afterwards.

Inference
During inference, we adopt greedy decoding which conducts the argmax operation for operator prediction, operand matching, and root vertex selection in parallel.The execution result at the root vertex is returned as the numerical answer.

Datasets
We applied CANTOR to Math Word Problem (MWP) solving under the fully-supervised setting  (Dua et al., 2019) consists of all problems with numerical answers from the reading comprehension dataset called DROP.Problems are only annotated with final answers but not the corresponding equations.
(b) DROP is a reading comprehension dataset consisting of problems with different types of answers, e.g., number, date, and span(s).

Metrics
For MWP solving, we evaluated models with value accuracy and equation accuracy.Previous work evaluated equation accuracy with string matching, failing to recognize positive equations that are structurally different from the ground-truth.In our evaluation, an equation is considered correct if it has consistent results with the annotated equation for 100 random replacements of numbers mentioned in the problem.For discrete reasoning on DROP, we followed previous work to use F1.

Baselines
We considered the following three categories: Sequential Models generate an equation sequentially based on a given problem.mBERT2Seq (Tan
Structured Models utilize structured autoregressive decoders to generate an equation.Graph2Tree (Zhang et al., 2020) and DEDUCTREASONER (Jie et al., 2022) are the representative tree-structured model and DAG-structured model, respectively.
Tagging-based Models refer to the arithmetic modules of those modular networks dominant on DROP, which assign a plus, minus, or zero to each constant and number mentioned in a problem, and return the sum of the signed numbers.TASE (Segal et al., 2020) is a representative modular network which consists of modules specialized for different types of answers, e.g., a tagging-based arithmetic module, a count module, and modules for span-typed answers.We referred to a TASE model with only an arithmetic module as TASE arith .

Implementation Details
For all experiments, we used two Transformer blocks (Vaswani et al., 2017) as the DAG decoder, which was trained with random initialization.
For MWP solving, we used RoBERTa base as the problem encoder.We experimented with different training methods whose effect on model performance will be discussed in Section 5.7.3 with the graph size L set to 60 and the beam size B for hard EM set to 20.We further investigated the effect of graph size L (Table 9 in Section 5.7.3) and beam size B (Table 13 in Appendix D.1).The best model on MathQA used hard EM with annealing (τ = 2, 000, B = 20), with L = 80, and the best model on SVAMP used MML, with L = 60.Following previous work, all experiments on SVAMP were run with 5 random seeds, with both the average performance and standard deviation reported.

Results for MWP Solving
As shown by Table 2 and Table 3, CANTOR established a new state-of-the-art record on MathQA and SVAMP with large improvements.The finegrained analyses in Table 4 and Table 5 show that CANTOR (1) outperforms the best baseline on nearly all problems of different levels of complexity measured by the number of operations needed, ( 2) is better at exploiting equation templates 5 seen in training or creating novel ones to solve problems, (3) and is more robust to different types of variations, including those that evaluate question sensitivity (whether questions asked in problems are ignored in prediction), reasoning ability (how predictions are adjusted to subtle changes in given problems), and structural invariance (whether predictions are invariant to structural changes of given problems that preserve the reasoning logic).CANTOR is also applicable to weakly-supervised scenarios where only final answers are annotated.

Results for Discrete Reasoning
Given problem-answer pairs {⟨X, A⟩}, if it is feasible to find Y that evaluates to A, we can adapt hard EM, MML, and hard EM with annealing for weakly-supervised training by simply re-defining Γ for the objective functions as follows: where P (A|Y ) is 1 if and only if Y evaluates to A.
For weakly-supervised training on DROP num , we followed TASE to enumerate Y by searching addition or subtraction of two numbers, and used MML for training 6 .As each Y has only one operation, MML conducts exact marginalization over Γ.
As shown by Table 6a, CANTOR significantly outperforms TASE arith on DROP num .If using CAN-TOR as a drop-in replacement for the arithmetic module of TASE, we can obtain further improvements on DROP (Table 6b).

No Pre-defined Order Restrictions
To investigate the effect of removing restrictions on decoding dependencies, we considered a vari-5 Equation templates are equations with numbers replaced with placeholders, e.g., const_10 + num@7 adds 10 to the 7-th number in a problem. 6There are more advanced weakly-supervised training methods (Chen et al., 2020;Shao et al., 2021)  ant of CANTOR called vanilla CANTOR, which also produces all operations in parallel, but is not designed to have diverse and possibly redundant operations for comparisons in both operand matching and root vertex selection.Specifically, instead of using a pre-specified value of L, vanilla CANTOR predicts the number of operations needed to solve a given problem as L (using the [CLS] representation from the encoder), and was trained with naïve mapping; the last vertex v L is the root vertex.As shown by Table 7, vanilla CANTOR already outperforms the best baseline which adopts a pre-defined decoding order, indicating that our model does well in capturing the structures of equations internally, and that using a pre-defined decoding order may be an unnecessary burden on model learning.

Structure Modeling
Previous work has proposed non-autoregressive models for semantic parsing, but without explicit structure modeling.

Interpretations of Different Operations
×: 1.0 +: 0.96 +: 1.0 ⟨×, 5 * , 5 , ⟩ 7 * = ⟨+, 5 * , 5 + ⟩ 7 + = ⟨×, 7 * , 5 , ⟩ Figure 3: A test case from SVAMP.Operations leading to the same quantity are marked with the same color.Purple ones are operations evaluating to the correct answer.For a clear presentation of our DAG, we only retain top-5 root vertices along with their descendants.We also present probabilities of predicted operators, operands, and root vertices.The best baseline DEDUCTREASONER overlooks bonus points in its prediction; while the same prediction appears as a sub-graph in our DAG, CANTOR succeeds in filtering it out and recognizes the correct one.

Capturing Diverse Reasoning Steps
CANTOR decodes an L-sized DAG that encompasses diverse reasoning steps which are necessary or possibly redundant.Comparing diverse choices is beneficial to pick out the proper one.In this section, we investigate how well CANTOR captures diverse reasoning steps and its effect on model performance.As it is pointless to merely have different operations at different vertices, we focused on the quality of top-k root vertices (top-k P r (p |Z| |X))8 and evaluated the recall of answers (Val.@k).
Training Methods Compared with vanilla CAN-TOR, CANTOR trained with methods that leverage more vertices than necessary (for ground-truth equations) achieved higher Val.@k most of the time (Table 8).One exception was applying MML on MathQA, which led to much worse performance.We conjecture that this is because our assumption in MML is incompatible with the complex equations in MathQA (please refer to Appendix C for detailed discussion on the limitations of our MML).However, it is still helpful to warm up hard EM with MML, which is demonstrated by the improve- Table 9: Val.@k with varying graph sizes L. Models were trained using hard EM with annealing (τ = 2000) on MathQA and MML on SVAMP.Val.@k is the answer recall over execution results at top-k root vertices.ments of hard EM with annealing over hard EM.Notably, CANTOR trained with naïve mapping outperforms vanilla CANTOR on SVAMP; this is because the former was trained to leverage more vertices than necessary in testing (due to max |Y | on the train set being larger than max |Y | on SVAMP) and compares different vertices for root vertex selection, while the latter has no access to extra vertices and uses the last vertex as the root vertex without comparisons.
In practice, hard EM with annealing should be the default training method; as in Table 8, it always outperforms naïve mapping and hard EM, and is at least competitive with MML.As shown by Table 8 and Table 13, the two hyperparameters to tune, i.e., the number of warm-up steps τ and the beam size B, are robust to a wide range of values.
Graph Size L A larger DAG can encompass more reasoning steps, but also increases the difficulty of operand matching and root vertex selection.Training methods like hard EM may even suffer from suppressing false negative operations.Table 9 shows the effect of varying graph sizes L. Model performance improves until L reaches 80 and 60 on MathQA and SVAMP, respectively.

Case Study
Fig 3 presents a test case from SVAMP.For a clear presentation of our DAG, we only show top-5 root vertices 8 along with their descendants.By comparing diverse operations and chaining relevant ones, CANTOR succeeds in discriminating logically correct operations from distracting ones (e.g., the one predicted by DEDUCTREASONER which overlooks bonus points), even though the final equation is structurally different from the annotated reference.

CANTOR vs. LLMs with Chain-of-Thought Prompting
Recently, Wei et al. (2022) proposed chain-ofthought prompting which endows large language models with the ability to generate a series of intermediate reasoning steps to reach the final answer of a given problem, achieving state-of-the-art performance on a wide range of reasoning tasks.Table 10 compares CANTOR and chain-of-thought prompting.Though being 392× smaller, CANTOR with RoBERTa base already outperforms PaLM-62B on SVAMP; using RoBERTa large gives an aggressive improvement, demonstrating CANTOR's great potential.
CANTOR is also applicable to the challenging GSM8K dataset (Cobbe et al., 2021) which was created to probe the reasoning ability of large language models and has high diversity among problems.As GSM8K was annotated with natural language solutions, the extracted equations are noisy and incomplete; we ended up with 6,312 (out of 7,473) noisy training examples.As shown in Table 10, CANTOR is close to the 175B GPT-3 model fine-tuned on whole train set, and is on a par with PaLM-62B with chain-of-thought prompting.

Conclusion
We propose a numerical reasoner called CANTOR.Unlike previous structured decoders that model a single equation with pre-defined restrictions on the decoding dependencies, CANTOR models diverse reasoning steps using a directed acyclic graph without a pre-defined decoding order, and derives equations by comparing and chaining relevant reasoning steps.With our training methods, CANTOR is capable of capturing the logical dependencies among reasoning steps internally, and produces equations that are more consistent with the reasoning problems by comparing diverse reasoning steps.CAN-TOR achieves state-of-the-art results on two math word problem datasets under the fully-supervised setting, and is applicable to weakly-supervised scenarios with significant improvements.
In future work, we plan to extend CANTOR for general structured prediction tasks, e.g., sequence labeling and parsing.

Limitations
Though CANTOR significantly outperforms baselines, there is still a large room for improvement in solving numerical reasoning problems with novel equation templates and being robust to variations in the problems.For example, our value accuracy on SVAMP problems with unseen equation templates is lower than 20% (Table 4), and the value accuracy on problems that evaluate question sensitivity barely reaches 30% (Table 5).We also argue for more benchmarks that expose weaknesses of existing models, as we observe that more than half of test problems in MWP datasets can be solved with equation templates seen in training, which may overestimate the numerical reasoning ability of neural models.We trained CANTOR for up to 100k training steps for the MWP task and up to 20 epochs for the discrete reasoning task, using hyperparameters specified in Table 11.All experiments were conducted with V100 GPUs.

B.1 Hard EM
The training objective of hard EM is formulated as: where z i is the operation y i mapped to the vertex at position p i .
As {p 1 , ..., p |Y | } defines a valid mapping from Y to Z, finding Z * is equivalent to finding the optimal mapping {p 1 , ..., p |Y | }, which we search for via beam search.For convenience of illustration, we define the level of an operation in Y as the length of the longest path from its corresponding vertex in the DAG counterpart of Y to a leaf vertex (which is a constant or a number mentioned in the problem).Let D l be the set of indices of operations with the same level l.For any Z ∈ Γ, P θ (Z|X) can be factorized as follows: Therefore, we can use beam search to approximately find the optimal mapping level-by-level.To guarantee valid mappings, we restrict that To find the B-best mappings from D l according to i∈D l P z (z i |p i , X), we utilize an open-source implementation9 of Murty's algorithm (Miller et al., 1997), whose worse case complexity is O(B|D l | 3 ).

B.2 MML
The training objective of MML is formulated as: We adopt a strong (but risky) assumption so that we can use dynamic programming to marginalize P θ (Z|X) in polynomial time.Specifically, for any operation y i , we assume that the two subgraphs rooted at y a i and y b i respectively (in the DAG counterpart of Y ) are independently mapped to {v 1 , v 2 , ..., v L }.Let M i,j be the marginal probability of the sub-graph rooted at y i mapped to {v 1 , ..., v j }, and G(y i ) is the set of indices of constituent operations in the sub-graph, then M i,j is computed as (we omit X for a brief presentation): Based on our assumption, if y a i = y ∈ Y and y b i = y v ∈ Y , we have: otherwise, we have: Mv,p v P b (zv|j) Finally, the training objective can be computed as:

C Limitations of Our MML
For our MML method, we impose an independence assumption for efficient marginalization of P θ (Z|X) over all Z that denote valid mappings from operations in Y to decoding positions, but at the cost of failing to compute exact marginalization and giving a noisy training objective when the target equation Y is complex, like those in MathQA.As shown by Table 13, model performance is insensitive to beam size when using hard EM on MathQA.To investigate whether the choices of Z matter for optimization, we considered a baseline called random mapping, which optimizes a model on random Z ∈ Γ.We observed that hard EM outperforms random mapping substantially, indicating that beam search finds effective Z for training.

E Inference Efficiency
Due to non-autoregressive decoding, CANTOR is significantly faster than previous autoregressive baselines in terms of inference efficiency.For example, on a single V100 32G GPU, CANTOR achieves a 7× speedup over DEDUCTREASONER on the dev set of MathQA.In the upper case, the baseline DEDUCTREASONER misunderstands "increase" and "decrease", and conducts wrong operations.In the lower case which mentions numerous quantities in the problem, DEDUCTREASONER, despite arriving at the correct value, operates on wrong quantities at the second and the third reasoning steps.By contrast, our proposed model CANTOR produces precise reasoning processes with proper choices of quantities to operate on.

Figure 1 :
Figure 1: (a) Possible pieces of human thoughts that pops up in no pre-defined order; (b) How our model captures the reasoning process similarly.Reasoning steps inside solid frames and dashed frames are necessary and loosely-relevant ones, respectively.
and 387 girls.290 more boys joined the school.How many more boys than girls are in the school?

Fig 4
Fig 4 presents two test cases from MathQA.In the upper case, the baseline DEDUCTREASONER misunderstands "increase" and "decrease", and conducts wrong operations.In the lower case which mentions numerous quantities in the problem, DEDUCTREASONER, despite arriving at the correct value, operates on wrong quantities at the second and the third reasoning steps.By contrast, our proposed model CANTOR produces precise reasoning processes with proper choices of quantities to operate on.

Table 1 :
Data statistics.Note that DROP num and DROP are only annotated with answer texts but not equations Y ; we followed previous work to enumerate binary operations that evaluate to the answers (max |Y | = 1).

Table 4 :
Breakdowns of performance on the MWP solving task.Baseline refers to the previous best model DEDUCTREASONER.Equ. and Val. are equation accuracy and value accuracy, respectively.

Table 5 :
A breakdown of robustness evaluation w.r.t.different variations in SVAMP.Baseline refers to the previous best model DEDUCTREASONER.Equ. and Val. are equation accuracy and value accuracy, respectively.

Table 6 :
F1 scores on DROP num and DROP.w/ CANTOR is a TASE model that replaces the original tagging-based arithmetic module with CANTOR; all modules share one problem encoder.
for discrete reasoning on DROP.Investigation of how CANTOR is compatible with them is left for future work.

Melissa scored 109 points in each game. She also got 82 bonus points in each game. How many points did she score in 79 games ?
Problem:

Table 8 :
Comparisons among different training methods.
Val.@k is the recall of answers over execution results at top-k root vertices (top-k P r (p |Z| |X)).

Table 10 :
Value accuracy on SVAMP and GSM8K.CoT is short for Chain-of-Thought prompting.

Table 12 :
Value accuracy breakdown on the test set of MathQA w.r.t. the number of branches (# Branch) and the number of operations (# Operation) in annotated gold equations.Naïve stands for naïve mapping.When does Our MML Conduct Exact Marginalization?And What are the Effects on Model Performance?Our MML conducts exact marginalization only if Y has a linear structure, i.e., Y has no branches; we define a branch in Y to be an operation taking another two operations as operands.If Y have branches, our MML will include the probability of invalid Z where different operations share one decoding position, which may mislead a model.As validated by Table 12, (a) our MML works well on test problems whose gold equations have no branches (# Branch=0: value accuracy=83.8%),evenwhen equations are long (# Branch=0 and # Operation>=4: value accu-racy=79.5%);(b)However, it becomes poor if equations have more branches (# Branch>=2: value accuracy=35.0%).Empirically, our MML works well when most equations are linear, and short equations are likely linear in existing datasets (e.g., SVAMP and DROP).When target equations are complex, hard EM should be more suitable, but we can still benefit from using our MML for warming up.Effect of Beam Size B on Hard EM

Table 13 :
Value accuracy of models trained with hard EM using different beam sizes.Random Mapping is a baseline which uses random Z ∈ Γ for training.