Multi-View Reasoning: Consistent Contrastive Learning for Math Word Problem

Math word problem solver requires both precise relation reasoning about quantities in the text and reliable generation for the diverse equation. Current sequence-to-tree or relation extraction methods regard this only from a fixed view, struggling to simultaneously handle complex semantics and diverse equations. However, human solving naturally involves two consistent reasoning views: top-down and bottom-up, just as math equations also can be expressed in multiple equivalent forms: pre-order and post-order. We propose a multi-view consistent contrastive learning for a more complete semantics-to-equation mapping. The entire process is decoupled into two independent but consistent views: top-down decomposition and bottom-up construction, and the two reasoning views are aligned in multi-granularity for consistency, enhancing global generation and precise reasoning. Experiments on multiple datasets across two languages show our approach significantly outperforms the existing baselines, especially on complex problems. We also show after consistent alignment, multi-view can absorb the merits of both views and generate more diverse results consistent with the mathematical laws.


Introduction
Math word problem (MWP) is a very significant and challenging task with a wide range of applications in both natural language processing and general artificial intelligence (Bobrow, 1964). The MWP is to predict the mathematical equation and the final answer based on a natural language description of the scenario and a math problem. It requires mathematical reasoning over the text (Mukherjee and Garain, 2008), which is very challenging for conventional methods (Patel et Figure 1: Human solving has multiple reasoning views, and math equation also can be expressed in multi-order. Pre-order traversal can be seen as a top-down reasoning view. Post-order traversal corresponds exactly to the bottom-up reasoning view. Consistent contrastive learning aligns two views in the same latent space.

2021).
MWP tasks have attracted a great deal of research attention. In the early days, MWP was treated as a sequence-to-sequence (seq2seq) translation task, translating human language into mathematical language . Then, Xie and Sun (2019); ; Faldu et al. (2021) proposed that tree or graph structure was more suitable for MWP. Those generation methods (Seq2Tree and Graph2Tree) further improved generation capabilities through a specific structure. Although very flexible in generating complex equation combinations, the fixed structure decoder also limits its fine-grained mapping. Recently, ; Jie et al. (2022) introduced an iterative relation extraction approach, providing a new solving view for MWP. It performs well at capturing local relations, but lacks global generation capabilities, especially for complex mathematical problems.
From the seq2seq translation to the seq2tree generation and relation extraction, those are essentially seeking a suitable solving view for MWP. However, MWP is more challenging than that as it requires both precise relation reasoning about quantities and reliable generation for diverse equation combinations. Both are necessary for mathematical reasoning. Existing methods all consider the MWP from a single view and thus bring certain limitations.
We argue that multiple views are required to comprehensively solve the MWP. As shown in Figure 1, the process of human solving inherently involves multiple reasoning views, i.e., top-down decomposition (remaining f ruits Two reasoning views are reversed in the process but consistent in results. Meanwhile, mathematical equation can be expressed in multi-order traversal, i.e., pre-order (−, ×, +, 2, 3, 4, 5) and post-order (2, 3, +, 4, ×, 5, −). Two sequences are quite dissimilar in form but equivalent in logic. Two order traversal equation corresponds exactly to the two reasoning processes, i.e. , the pre-order equation is a top-down reasoning view, while the post-order can be seen as a bottom-up reasoning view.
Inspired by this, we design multi-view reasoning using multi-order traversal. The MWP solving is decoupled into two independent but consistent views: top-down reasoning using pre-order traversal to decompose problem from global to local and a bottom-up process following post-order traversal for relation construction from local to global. Pre-order and post-order traversals should be equivalent in math just as top-down decomposition and bottom-up construction should be consistent. In Figure 1, we add multi-granularity contrastive learning to align the intermediate expressions generated by two views in the same latent space. Through consistent alignment, two views constrain each other and jointly learn a accurate and complete representation for math reasoning.
Besides, math operator must conform to mathematical laws (e.g., commutative law). We devise a knowledge-enhanced augmentation to incorporate mathematical rules into the learning process, promoting multi-view reasoning more consistent with mathematical rules.
Our contributions are threefold: • We treat multi-order traversal as a multi-view reasoning process, which contains a top-down decomposition using pre-order traversal and a down-up construction following post-order.
Both views are necessary for MWP. • We introduce consistent contrastive learning to align two views reasoning processes, fusing flexible global generation and accurate semantics-to-equation mapping. We also design an augmentation process for rules injection and understanding. • Extensive experiments on multiple standard datasets show our method significantly outperforms existing baselines. Our method can also generate equivalent but non-annotated math equations, demonstrating reliable reasoning ability behind our multi-view framework.

Related Work
Reliable reasoning is a necessary capability to move towards general-purpose AI. How to achieve human-like reasoning has been extensively researched in areas such as natural language processing, reinforcement learning, and robotics Zhang et al., , 2022a. In particular, mathematical reasoning is an important manifestation of intelligence. Automatically solving mathematical problems has been studied for a long time, from rule-based methods (Fletcher, 1985;Bakman, 2007;Yuhui et al., 2010) with hand-crafted features and templates-based methods (Kushman et al., 2014;Roy and Roth, 2018) to deep learning methods Ling et al., 2017) with the encoder-decoder framework. The introduction of Transformer (Vaswani et al., 2017) and pretrained language models (Devlin et al., 2019;Liu et al., 2019b) greatly improves the performance of MWPs. From the perspective of proxy tasks, we divide the recent works into three categories: seq2seq-based translation, seq2structure generation, and iterative relation extraction. Seq2seq-based translation MWPs are treated as a translation task, translating human language into mathematical language . Wang et al. (2017) proposed a large-scale dataset Math23K and used the vanilla seq2seq method (Chiang and Chen, 2019). Li et al. (2019) introduced a group attention mechanism to enhance seq2seq method performance. Huang et al. (2018) used reinforcement learning to optimize translation task. Huang et al. (2017) incorporated semantic-parsing methods to solve MWPs. Although seq2seq-based methods have made great progress in the field, the performance of these methods is still unsatisfying, since the generation of mathematical equations requires relation reasoning over quantities than natural language.
Seq2structure-based generation Liu et al. (2019a); Xie and Sun (2019) introduced treestructured decoder to generate mathematical expressions. This explicit tree-based design rapidly dominated the MWPs community. Other researchers have begun to explore reasonable structures for encoder. Li et al. (2020); Zhang et al. ( , 2022b used graph neural networks to extract effective logical information from the natural language problem.  adopted the teacher model using contrast learning to improve the encoder. Several researchers have attempted to extract multi-level features from the problems using the hierarchical encoder  and pre-trained model (Yu et al., 2021). Many auxiliary tasks are used to enhance the symbolic reasoning ability (Qin et al., 2021). Wu et al. ( , 2021 tried to introduce mathematical knowledge to solve the difficult mathematical reasoning. These structured generation approaches show strong generation capabilities towards complex mathematical reasoning tasks. Iterative relation extraction Recently, some researchers have borrowed ideas from the field of information extraction (Shen et al., 2021b), and have designed iterative relation extraction frameworks for predicting math relations between two numeric tokens. Kim et al. (2020) designed an expressionpointer transformer model to predict expression fragmentation.  introduced a DAG structure to extract numerical token relation from bottom to top. Jie et al. (2022) further treated the MWP task as an iterative relation extraction task, achieving impressive performance. These works provide a new perspective to tackle MWP from a local relation construction view, improving the fine-grained relation reasoning between quantities.
The above proxy tasks are designed from different solving views. The seq2seq is a left-to-right consecutive view, while seq2tree is a tree view, and the relation extraction method emphasizes a local relation view. Unlike these single-view methods, our approach employs multiple consistent reasoning views to address the challenges of MWP.

Overview
The MWP is to predict the equation Y and the answer based on a problem description T = {w 1 , w 2 · · · w n } containing n words and m quantity words Q = {q 1 , q 2 , · · · , q m }. The equation Y is a sequence of constant words (e.g., 3.14), mathematical operator op = {+, −, ×, ÷, · · · } and quantity words from Q. Solving MWP is to find the optimal mapping T →Ŷ , allowing predictedŶ to derive the correct answer. Existing methods learn this mapping from a single view, e.g., seq2tree generation and iterative relation extraction. Our consistent contrastive learning approach solves this by reasoning from multiple views. Both top-down and bottom-up view are necessary for a complete semantics-to-equation mapping.

Multi-View using Multi-Order
We use the labeled mid-order equation to generate two different sequences Y pre = {y f 1 , y f 2 , · · · , y f L } and Y post = {y b 1 , y b 2 , · · · , y b L } using pre-order and post-order traversal. As shown in Figure 1, we treat the Y pre as the label for the top-down process and the Y post is for the bottom-up process training.
Global shared Embedding Firstly, we design three types of global shared embedding matrix: text word embedding E w , quantity word embedding E q , mathematical operator embedding E op . Text embedding and quantity word embedding are extracted from the pre-trained language model (Devlin et al., 2019;Liu et al., 2019b), and operator embeddings are randomly initialized. Besides, all constant word embeddings are also randomly initialized and added to E q . As shown in Figure 2, three global embeddings are shared by two reasoning processes. Then, text embeddings E w are fused into a target vector t root by the Bidirectional Gated Recurrent Unit (GRU) (Cho et al., 2014), where t root means the global target for top-down reasoning. Quantity embeddings E q is for quantity relation construction in bottom-up reasoning.
Top-down view using Pre-order The topdown view is a global-to-local decomposition that follows the pre-order equation Y pre (e.g., −, ×, +, 2, 3, 4, 5). This process is similar to Xie and Sun (2019). Starting from the root node, each node needs to conduct node prediction, and the operator node also conduct node decomposition, e.g., in Figure 1, root node predicts its node type is "operator" and output token is "−" and then is decomposed into two child nodes. Two child nodes are predicted to "×" in step 2 and "5" in step 7.
Node prediction Each node has a target vector t n decomposed from their parent (for root node, t n = t root ), and then calculates the node embedding e n and node output y n based on t n and global shared embedding E w , E op , E q : where ; means the concatenation operation and • denotes the element-wise product between two vectors. MLP e calculates the node embedding from target and MLP s computes the score of predicted output (y n = op i or q j ) using the node embedding and the corresponding embedding (e op i or e q j ). s(op i ) and s(q j ) are the scores of the current node predicted to be op i and q j .
Node decomposition After node prediction, any operator node (y n = op) needs to be decomposed into two child nodes using their target vector t n and corresponding embedding E op [y n ]: where MLP d is used for left and right child nodes decomposition, and t n l and t nr represent the target vectors of two child nodes. As shown in Figure 2, the top-down process repeats above two steps: each node first predicts its own output, then the operator node is decomposed into two child nodes, and child nodes continue its node prediction. If any child node is still an operator node, the decomposition continues until the quantity node. The objective is to minimize the negative log-likelihood of training data (T, Y ) using the pre-order equation Y pre = {y f 1 , y f 2 , · · · , y f L }: where P (y f n | * ) is the predicted probability of y f n in node prediction, which is computed from all possible s(op) and s(q) (Equation 1) by Softmax. The pre-order equation has L tokens, so the top-down process also requires L times of node prediction. Bottom-up view using Post-order The downup view is a relation construction process that follows the post-order expressions Y post (e.g., 2, 3, +, 4, ×, 5, −). Inspired by Jie et al. (2022), we devise a concise bottom-up process. The subexpression is treated as a relation mapping, i.e. operator is the math relation between two quantities, e.g.,(2, 3) → +. Thus, as shown in Figure 2, in each iteration we map two quantities to a specific operator for a sub-expression, and then use this subexpression as a new quantity for the next iteration, e.g., (q (2,3,+) , 4 → ×). Specifically, in step t, a quantity pairs (q i and q j ) and a operator (op k ) form a relation mapping (q i , q j → op k ), we get their embeddings from the E q and E op , i.e.,e q i , e q j ∈ E q t and e op k ∈ E op , where E q t is the embedding of the all quantity words at step t. We first fuse two quantity embeddings, and then with operator embedding: where e B2T q i ,q j ,op k means the embedding of the subexpression, MLP h fuses two quantity embeddings into h j i and MLP m fuses h j i with operator op k . Then, to select the best mapping from all possible combinations of quantity pairs and operators, we score sub-expression based on its embedding: where s(e B2T q i ,q j ,op k ) means the score assigned to this sub-expression. Lastly, the selected sub-expression is added to E q t and treated as a new quantity for next iteration, i.e. ,E q t+1 = E q t ∪ {e B2T q i ,q j ,op k }. During training, we obtain the gold mapping 1 , y b 2 , · · · , y b L } and select the highest scoring mapping (q m i , q m j → op m k ) from all combinations. The optimization is to maximize the score of the gold mapping in all possible combinations: where K denotes that equation Y post has K times relation extraction in total.

Consistent Contrastive Learning
The top-down reasoning provides a coarse-to-fine decomposition process in a flexible manner. In contrast, the bottom-up reasoning provides a localto-global construction view step by step. Although the two views are reversed in process, they should be consistent regardless of the observation view. To  this end, we use consistent contrastive learning to constrain the representations of the sub-expression generated in two independent views.
Multi-view Representation For the top-down view, we fuse the embedding of the parent node with two child nodes in a sub-tree as a subexpression representation. First, we calculate the parent node embedding E op [y p ], where y p means the predicted operator of the parent node. Then, the left child node embedding e l is calculated according to its node type. If the left child node is a quantity node, its embedding is e l = E q [y l ], where y l means the predicted quantity. If left child is a operator node, the entire left subtree's representation is used as its embedding, i.e., e l = r T 2B l=sub-tree . The embedding of the right node e r is calculated in a similar way. Finally, we fuse three embeddings: where r T 2B sub-exp means the sub-expression representation, and the entire sub-tree is treated as a new fusion node for the next calculation.
For the bottom-up view, we directly use the embedding of the sub-expression obtained from each relation mapping (Equation 4) as its representation: Multi-granularity Alignment We align two views of the same sub-expression in multigranularity. The sub-expression generated initially is the minimum granularity, and the maximum granularity is the complete equation representation. First, we select representations r B2T sub-exp and r T 2B sub-exp from two views of the same sub-expression. Then, we project them into the same latent space (h T and h B ) and compute the similarity as consistent loss L ccl . Finally, we repeatedly compute the consistent loss for each sub-expression: where K denotes the total number of subexpressions in the top-down view, and <, > means dot product of two vectors for similarity. By alignment, two reasoning processes constrain each other at multiple granularities and jointly learn a more accurate and complete representation. We provide a detailed example ( Figure A2 in Appendix) to show the whole process. Augmentation We argue that external math rules are essential for understanding diverse equations, e.g., different questions with the similar calculation logic are sometimes labeled as (q 1 + q 2 ) × q 3 and sometimes as q 1 × q 3 + q 2 × q 3 . It is challenging to train with those labels. This inconsistency caused by diverse equations may impair performance. So we add a knowledge-enhancing augmentation (KE-Aug) process, which actively injects math laws for alleviating the impact of diversity. Specifically, we exert deformations on all equations using a mathematical law, generating a new equation. Then, both new and origin samples are used for training, e.g., we use the multiplicative distributive law to convert all equations containing (q 1 ± q 2 ) × q 3 into q 1 × q 3 ± q 2 × q 3 . After that, the inconsistency is alleviated and the model can learn a similar representation for the equivalent equation.

Training and Inference
During KE-Aug, we only use multiplicative distributive law as external knowledge for augmentation. Then, all samples are converted into the pre-order and post-order expressions. During training, to minimize the loss function L = L t2b +L b2t + L ccl , we train three processes: top-down reasoning, bottom-up reasoning, and consistent contrastive learning simultaneously from scratch. During inference, we discard the bottom-up model and use top-down reasoning to compute the final prediction. Since top-down view is a generative model with more flexibility to generate diverse predictions than classification-based model (bottom-up) and also gain higher accuracy in our multi-view training framework (discussed in Section 4.2).

Experiments
Datasets We evaluate our method on three standard datasets across two languages: MAWPS (Koncel-Kedziorski et al., 2016), Math23K , and MathQA (Amini et al., 2019). Math23K and MathQA are two widely used large datasets that contain 23k Chinese mathematical problems and 20k English mathematical problems, respectively, and MAWPS only contains 1.9k English problems. We follow (Tan et al., 2021;Jie et al., 2022) to preprocess some unsolvable problems in the dataset. We consider five operators for the datasets: addition, subtraction, multiplication, division, exponentiation as previous works did.
The number of mathematical operations is used to measure the reasoning complexity and the text length denotes the semantic complexity. In Figure 3, we plot the average reasoning complexity (x-axis) and semantic complexity (y-axis) of the three datasets in the two-dimensional plane. The MathQA is the hardest to solve as it has the highest semantic complexity and reasoning complexity. In contrast, the MAWPS is the easiest to answer, as almost all problems require only two mathematical operations. The Math23K dataset is the largest, with moderate reasoning complexity.
Baselines We divide the baselines into the following categories: seq2seq, seq2structure, iterative relation-extraction (I-RE). Besides, we also consider the methods that use contrasting learning for generation (CL-Gen). In seq2seq, Li et al. Training Details We adopt Roberta-base and Chinese-BERT from HuggingFace (Wolf et al., 2020) for multilingual datasets as previous works. We use an AdamW optimizer (Kingma and Ba, 2014; Loshchilov and Hutter, 2019) with a 2e-5 learning rate, batch size of 12, and beam search of size 4. All experiments were set up on an Nvidia Model Acc. 5-fold.

Results
As shown in Table 1, 2, we observe our method achieves consistent improvements over the strong baselines across multiple datasets, with +1.7% improvements on Math23K, +1.9% improvements on 5-fold Math23K, +2.0% improvements on MathQA. The improvement is particularly significant when our method is evaluated on larger and more complex datasets, like MathQA, which includes many GRE problems requiring complex reasoning. We achieve the greatest improvement on this most difficult dataset. It demonstrates the reliable reasoning ability of our method. Additionally, although the MAWPS dataset is small and simple, we still obtain a slight boost (+0.1%) compared to Model Acc.

Structure -Gen
GTS (Xie and Sun, 2019) 82.6 Graph2Tree  85.6 Roberta-GTS*  88.5 Roberta-G2T*  88.7 H-Reasoner* (Yu et al., 2021) 89.8 CL-Gen T-Dis*    the other baselines in Table 3. Compared with three single-view methods: seq2seq, seq2structure and I-RE, our method is more stable and outperforms all of them. Although, the I-RE method performs the best among all single-view methods, it still lags behind ours by almost 2% (RE-deduction) on average. In addition, the performance of the other two single-view methods is unstable: on the simpler but larger dataset Math23K, seq2structure achieves comparable accuracy with seq2seq, but lags behind ours by 2.7% (BERT-Tree), 1.7% (Gen-Rank), respectively. In contrast, on the more complex dataset MathQA, seq2seq is better than seq2structure, but worse than ours by 3.5% (mBERT*) and 6.8% (BERT-Tree).
Furthermore, we also observe that the method which adopts contrastive learning (CL-Prototype) is considerably lower than ours by 3.9% (Math23K) and 4.3% (MathQA). It suggests that our multiview design is pretty effective for math reasoning, and contrastive learning can play a more significant role in our consistent multi-view framework. A fine-grained analysis can be found in Section 4.3.

Ablation Experiments
Through the above experiments, we found that data augmentation can alleviate inconsistency between different instances and multi-view contrastive learning can alleviate inconsistency between different views of an instance. To better illustrate the contribution of each module, we devise several variant models and evaluate them on Math23K.
As the Table 4 shows, Multi-view means that the model contains both top-down and bottom-up reasoning processes, and keeps both views consistent through global shared embedding and contrastive   (3) In contrast, we remove the data augmentation module in that the two reasoning views can learn more precise representations by a consistent contrastive learning. In this case, there is a slight decrease in the top-down view (-0.6%), while the accuracy of the bottom-up is instead improved by +1.1%. (4) Moreover, after removing KE-aug and Multi-view, it only consists of two completely independent reasoning processes and can only be trained on the original inconsistent dataset. The two views achieve 84.9% and 85.1% accuracy respectively, which are comparable to the other single-view baselines. These ablation experiment clearly reveal that data augmentation brings small or negative improvement on single-view approaches, but multiview alignment can maximize the effect of augmentation. We suspect that it may be because the bottom-up view focuses more on local features, and the data augmentation brings multiple local relations, thus making such local features more difficult to extract. Therefore, during training, we use consistent contrastive learning and data augmentation to train multi-view processes. As for the inference process, we directly use the top-down view as the final prediction model.

Analysis Experiments
Fine-grained Comparison To verify that our method can handle more complex math problems, we conduct a fine-grained comparison with the best baseline (RE-deduction) on two challenging datasets (MathQA and Math23K). Specifically, we calculate the performance of the subset divided by the number of mathematical operators.
As shown in Figure 4, our proposed method gains consistent improvements over the baseline across all subsets. In particular, on the more complex MathQA, we still maintain high prediction accuracy (≥ 78%) over hard problems (number of operators ≥ 4 ), but the performance of the baseline drops dramatically, e.g., on the most complex subsets with 8 and 9 operators, our performance outperforms the baseline by nearly 20% and 28%. A similar trend can be observed on math23K, i.e., our method achieves more significant results on more difficult subset, with 2.06% improvements on the 3 operators subset, 4.08% on the 4 operators subset and 7.7% on the 5 operators subset. The superiority we achieve on these difficult samples demonstrates strong global generation and accurate local mapping capabilities for math reasoning.
Performance Attribution Analysis To further demonstrate our method can achieve a high prediction accuracy while also predicting equations with diversity, we split the overall precision into two parts: equation precision and diversity. Equation precision indicates the proportion of samples in the test set whose prediction is exactly the same as the label equation. Contrary to this, diversity counts those samples whose prediction are different from the label, but also derive the correct answer, e.g., Y pred = {+, −, ×, 2, 4, 5, ×, 3, 4}, Y label = {−, ×, +, 2, 3, 4, 5}.
As Figure 5 shows, the overall precision (87.1%) and diversity (12.4%) of ours both are the highest among the seven methods, and equation precision is only inferior to RE-deduction. Besides, we plot the equation precision (x-axis) and diversity (yaxis) on a two-dimensional plane. We find that the I-RE methods (RE-deduction and DAG) has low diversity but relatively high equation precision. In contrast, the seq2structure methods (GTS, Teacher-Dis, BERT-Tree) generate more diverse results with low equations precision. However, our method performs well in both diversity and equation precision.
We also provide some examples in the case study ( Figure A3 in Appendix). This experiment illustrates that each of these single-view approaches has specific limitations, either lacking fine-grained mapping or global diverse generation capabilities. Our multi-view approach can incorporate the merits of both views, achieving precise and versatile solving.

Teacher-Dis
Ours RE-deduction BERT-Tree GTS CL-Prototype DAG Teacher-Dis

Equation Accuracy
Diversity Figure 5: We split the overall accuracy into two parts, i.e., Equation accuracy and Diversity. Our overall accuracy and diversity are the highest (Left). The seven methods are plotted in a 2D plane according to two metrics (Right). Our method performs well on both metrics.

Conclusion
We treat the pre-order traversal of math equation as a top-down view and the post-order equation as bottom-up view. Two reasoning views are naturally existing and both are necessary for complex mathematical reasoning. We design a multi-view reasoning containing a top-down decomposition and a bottom-up construction and ensure the consistency of the two views through contrastive learning. This consistent multi-view design can endow us with a complete and precise semantics-to-equation mapping. Experiments on standard datasets show that our framework achieves new state-of-the-art performance, especially demonstrating reliable generation capabilities on long and complex problems.

Limitations
There are two main limitations of our work: first, although we design two reasoning processes during training: top-down and bottom-up, we discard the bottom-up process when inferring and only adopt the prediction from the top-down reasoning. In future work, we will explore how to select the best prediction from both views. Second, our multiview reasoning process is capable of generating more diverse equivalent equations, but this generation process is not controllable, and it is not clear for now what underlying factors control different generation patterns.
We investigate the design of consistent contrastive learning for multi-view alignment. As shown in Figure A1, we evaluate four factors on Math23K: Metric for alignment. We consider two metrics for alignment: cosine similarity and L 2 distance. The former is a simplification of the conventional contrastive metric with only positive instances.
Granularity of alignment. As shown in Equation 9, we use multi-granularity sub-expressions for alignment. Besides, we also show the results of aligning two views only using the global equation representation.
Top-down representation. We investigate how to obtain the representation of sub-expressions. We design two types of representation for the top-down view. First, as shown in Equation 7, we use subtree fusion to get the representation for each subexpression, which is denoted as sub-tree fusion. Besides, we treat the embedding of the parent node e n (Equation 1) as a representation for this subexpression. We denote it as parent embedding.
Bottom-up representation. For the bottom-up process, there are also two options for its representations. As shown in Equation 8, we use the embedding of the relation mapping as the representation, which denotes as mapping embedding. Besides, we fuse the concatenation of the three embeddings using MLP layer: r B2T sub-exp = MLP([e q i ; e q j ; op q k ]), which denotes it as triples fusion.

A.2 Visualization
In Figure A2, we show an example from MathQA. The top-down process breaks down the overall problem through 15 reasoning procedures which are exactly the same as the pre-order traversal. Each reasoning step includes two steps: node prediction and node decomposition, until the leaf node (quantity nodes). Meanwhile, the bottom-up view predicts the entire equation after five relation extractions following the post-order equation. Two reasoning views work in reverse order.
Then, in the consistent contrastive learning process, the top-down view continuously computes the sub-expression representation based on the sub-tree fusion. For bottom-up reasoning, we directly use the embeddings from each relation extraction as representations. Since the bottomup process reuses the previously constructed sub- Cosine Similarity L2 Distance Figure A1: Evaluation on Math23K using multiple configurations of consistent contrastive learning. expressions (step 8 and 10), it generates fewer subexpressions than the top-down process. Finally, all sub-expressions representations from both views are aligned in the same latent space.

A.3 Case Study
We perform a case study to demonstrate the capability of generating diverse equations. As Figure  A3 shows, the algorithm generates equivalent equations that are not the same as the labeled equations. Most of these predicted equivalent equations can be derived from labeled equations by simple mathematical deformations, e.g., (57 + 43) × 24 and 57×24+43×24 in case 6. In addition to simple deformations, our algorithm also can solve complex problems using the different solving ideas, e.g., in case 8, it starts from a simpler reasoning idea and solves the problem correctly. At the bottom of Figure A3, we also count the deformation pattern distributions among all diverse prediction equations. We summarize six patterns: addition and multiplication commutative law, multiplication and division distributive law, different problem-solving idea and others. Then we manually identify the deformation patterns of each equivalent equation predicted by ours. We discover that more than half of the equivalent equations (≥ 60%) can be derived from additive or multiplicative commutative law deformations. Nearly 30% of the equivalent equations can be derived by deforming the distributive law and about 8% belong to the different solving ideas (e.g., cases 8 and 9). It shows that our multi-view method has mathematical reasoning capabilities and can be applied to solve complex mathematical problems.

Question:
In a division sum , the remainder is 8 and the divisor is 6 times the quotient and is obtained by adding 3 to the 3 times of the remainder. What is the divident? Answer: 129.5 Equ: ( (8*3 + 3) * (8*3 + 3) / 6 ) + 8 Multi-view representation Figure A2: A MathQA example of multi-view reasoning and consistent contrastive learning process. It contains independent reasoning processes of two views, the computations of sub-expressions representation, and multigranularity alignment.
There are 44 willow trees planted on four sides of a square flower pond, and the interval between every two willow trees is 20 meters. What is the perimeter of this square?

Addition and Multiplication Commutative Law
The cost of each piece of clothing is now 20% lower than in the past, and the cost of each piece of clothing is now % of what it used to be ?

Label Equation:
(1 -20%) / 1 Predict Equation: 1 -20% An aqueduct has been repaired for 5.6 kilometers, and what has not been repaired is 2.7 times as long as it has been repaired. How many kilometers is the total length of this aqueduct? Label Equation: 5.6 * (1 + 2.7) Predict Equation: 5.6 * 2.7 + 5.6 The summer supermarket sold 57 cases of Coke and 43 cases of mineral water in one day, and each case of Coke and mineral water was 24 bottles. How many bottles of cola and mineral water were sold in the summer supermarket?

Multiplication and Division Distributive Law
There are 96 students of Primary School to visit the Technology Museum. They are divided into 4 teams. Each team is divided into 3 groups. How many people are in each group? Distribution of Diverse Generation Patterns (%) Figure A3: Nine examples demonstrate the capability of our approach for generating equivalent but non-labeled equations. At the bottom, we count the distribution of the six generation patterns among all equivalent equations. Each pattern represents a mathematical deformation using a specific mathematical law. This diverse generation indicates that our model can understands the underlying mathematical relation and generates reasonable equation based on mathematical laws.