Investigating Math Word Problems using Pretrained Multilingual Language Models

In this paper, we revisit math word problems (MWPs) from the cross-lingual and multilingual perspective. We construct our MWP solvers over pretrained multilingual language models using the sequence-to-sequence model with copy mechanism. We compare how the MWP solvers perform in cross-lingual and multilingual scenarios. To facilitate the comparison of cross-lingual performance, we first adapt the large-scale English dataset MathQA as a counterpart of the Chinese dataset Math23K. Then we extend several English datasets to bilingual datasets through machine translation plus human annotation. Our experiments show that the MWP solvers may not be transferred to a different language even if the target expressions share the same numerical constants and operator set. However, it can be better generalized if problem types exist on both source language and target language.


Introduction
How to use machine learning and NLP techniques to solve Math Word Problems (MWPs) has attracted much attention in recent years (Hosseini et al., 2014;Kushman et al., 2014;Roy et al., 2015;Ling et al., 2017;Wang et al., 2017aWang et al., , 2018;;Amini et al., 2019).Given a math problem expressed in human language, a MWP solver typically first converts the input sequence of words to an expression tree consisting of math operators and numerical values, and then invokes an executor (such as the eval function in Python) to execute the expression tree to obtain the final numerical answer.Figure 1 shows an example math word problem, the correct expression tree, and the final answer.
Despite the relatively simple syntax of these expression trees, building MWP solvers is not a trivial task, and researchers have proposed various methods to tackle the different challenges of this problem such as statistical methods (Kushman Problem: A chef needs to cook 9 potatoes.He has already cooked 7. If each potato takes 3 minutes to cook, how long will it take him to cook the rest?et al., 2014;Roy et al., 2015), parsing-based methods (Shi et al., 2015) and generation-based methods (Wang et al., 2018;Xie and Sun, 2019).However, an aspect that has been largely overlooked is cross-lingual and multilingual MWP solving, i.e., whether a MWP solver trained on one human language can still work on another human language, or whether a MWP solver trained on multiple human languages together is more effective than a solver trained on only one language.We believe this is an interesting aspect to study for the following reasons.First, in cognitive science, people have long studied the relationship between humans' numerical processing abilities and language abilities, and found that on the one hand, the two are largely independent (Xu and Spelke, 2000), but on the other hand, "acquiring and mastering symbolic representations of exact quantities critically depends on language and instruction" (Van Rinsveld et al., 2015).It is therefore also intriguing to study whether machines separately acquire arithmetic and language abilities.Second, with pre-trained large-scale multilingual language models such as mBERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), which presumably project different human languages into a common embedding space, we have seen some success in cross-lingual NLP tasks such as XNLI (Conneau et al., 2018) and MLQA (Lewis et al., 2020) in both zero-shot and few-shot settings (Wu and Dredze, 2019;Conneau et al., 2020).It is therefore reasonable to expect that for MWP solving, there is the possibility of transferring machine's capability of MWP solving from one language to another by leveraging these pre-trained multilingual language models.
In this paper, we conduct an empirical study to understand to what extent MWP solvers can work in cross-lingual and multilingual settings.Specifically, we ask the following questions: (1) Crosslingual setting: Given a model trained with monolingual dataset, can the model solve MWPs over another language?(2) Multilingual setting: Can combining datasets of different languages further boost the performance for each language?(3) Can we identify some critical factors that may affect the results in (1) and (2)?
In order to empirically answer the questions above, we need multilingual MWP datasets, which are limited currently.We first use large scale datasets like Math23K (Wang et al., 2017b) and MathQA (Amini et al., 2019) as monolingual MWPs resource and further adapt MathQA to have the same operator set and expression style with Math23K.To better evaluate the models with parallel corpus, we extend some existing MWP datasets by translating them from English into Chinese.We then conduct three sets of experiments on the constructed datasets.We find that: (1) a cross-lingual MWP solver finetuned on one language cannot work on a second language, even if they are sharing the same decoding vocabulary, (2) a multilingual MWP solver may not boost performance for all the training languages but can improve those problems of similar types if one training language is close to the evaluation language, (3) combining (1) (2), we think for multilingual MWP solvers, despite language similarity, the performance relies heavily on domain similarity (problem types).

Our work makes the following contributions: (1)
To the best of our knowledge, we are the first to study cross-lingual and multilingual MWP solving, and we empirically demonstrate that crosslingual MWP solving is still difficult, but multilingual MWP solving is to some extent effective.
(2) We discover that multilingual MWP solving is mostly effective for questions with similar problem types.(3) Our constructed datasets can help other researchers to further study cross-lingual and multilingual MWP solving.

Related work
Solving Math Word Problems (MWPs) has been attracting researchers since the emergence of artificial intelligence.STUDENT(Bobrow, 1964) is a rule-based math word problem solver which contains a pipeline that consists of heuristics for pattern transformation.Many researchers start with the fundamental problem types like addition and subtraction (Hosseini et al., 2014) or those that have only one single operator (Roy et al., 2015).Roy and Roth (2015) look at problems that require multisteps using two or more operators.The question types of MWPs are also expanding.Rather than focusing on problems that need only one variable, Kushman et al. (2014) propose a dataset ALG514 which includes problems with a system of equations.With the development of deep learning, there has been a demand for large-scale datasets with more variations.Dolphin18K (Huang et al., 2016) is a large-scale dataset that is more than 9 times of the size of previous ones, and contains many more problem types.Math23K (Wang et al., 2017a) contains math word problems for elementary school students in Chinese language and is crawled from multiple online education websites.MathQA (Amini et al., 2019) is a new large-scale, diverse dataset of 37k multiple-choice math word problems in English and each problem is annotated with an executable formula using a new operationbased representation language.HMWP (Qin et al., 2020) contains three types of MWPs: arithmetic word problems, equations set problems, and nonlinear equation problems.
Various approaches have been proposed to solve MWPs.Template-based approaches (Kushman et al., 2014;Zhou et al., 2015;Upadhyay et al., 2016;Huang et al., 2017) are widely adopted as numbers appeared in the expressions are usually sparse in the representation space and the expressions may fall into the same category.More recently, the community is also paying more attention to train a math solver by fine-tuning pretrained language models.For example, EPT (Kim et al., 2020) adopts ALBERT (Lan et al., 2020) as the encoder for its sequence-to-sequence module.
The monolingual performance gains achieved recently have not been evaluated from cross-lingual and cross-domain perspectives.Therefore, we decide to revisit MWPs using current SOTA pretrained multilingual language models to construct a competitive math solver and conduct experiments over various bilingual evaluations.

The MWP Solver Task
We first formally define the task of building MWP solvers.Given a math word problem with n words W = (w 1 , w 2 , . . ., w n ), and k numerical values N = (n 0 , n 1 , . . ., n k ), the model needs to generate a flattened tree representation using operators from permitted operator set O and numerical values from constants C and N .The generated tree should be able to be evaluated via some compiler and executor to return a numerical value.

Solution Framework
A MWP solver needs to generate executable code for a target programming language to be evaluated by an executor compiled for the programming language.
Our MWP solver is built upon a sequence-tosequence model with copy mechanism (Gu et al., 2016).Specifically, we use a pretrained multilingual model as the encoder to get contextualized representations of math word problems.Due to the word piece tokenizer, the encoded context is not well-aligned to original input words.We choose to map these word pieces back to input words through mean pooling.Then we pass the mean pooled word representations to a bidirectional LSTM.Finally, we use a LSTM decoder with copy mechanism, which takes in the last decoded word vector and intermediate reading states, to predict the next token one by one.When the decoding finishes, we are expecting to get a linear tree representation.We attach the full model details in Section A.
Given the decoded tree representation, we first convert the generated linear tree representation into a piece of python expression with basic operators (+,-,*,/,**), then use the built-in function eval in Python to execute the generated code.

Existing Datasets
We use two large-scale datasets for this crosslingual research.One is Math23K (Wang et al., 2017a) in Chinese and the other is MathQA (Amini et al., 2019) in English.Although the two datasets are similar in size and question types, there are still differences in terms with permitted operators and annotations.

Math23K
The dataset Math23K (Wang et al., 2017a) contains math word problems for elementary school students in Chinese (zh) and is crawled from multiple online education websites.The dataset focuses on arithmetic problems with a single-variable and contains 23,161 problems labeled with structured equations and answers.
MathQA The dataset is a new large-scale, diverse dataset of 37k multiple-choice math word problems in English (en).Each question is annotated with an executable formula using a new operation-based representation language (Amini et al., 2019).It covers multiple math domain categories.To make MathQA a comparable counterpart with Math23K, we choose to filter those solvable problems with shared permitted operators from MathQA to create an adapted MathQA dataset.
Other datasets focusing on specific problem types These datasets are smaller in size but more focused on specific problem types.We follow the dataset naming conventions from MAWPS (Koncel-Kedziorski et al., 2016).
Specifically, AddSub (Hosseini et al., 2014) covers arithmetic word problems on addition and subtraction for third, fourth, and fifth graders.Its problem types include combinations of additions, subtractions, one unknown equation, and U.S. money word problems.SingleOp (Roy et al., 2015) is a dataset with elementary math word problems of single operation.MultiArith (Roy and Roth, 2015) includes problems with multiple steps which we listed all the seven types in Table 1.These datasets are all in English (en).We will illustrate how we extend them into bilingual datasets in Section 4.2.

Cross-lingual and Multilingual MWP Solvers
In this work, as we are focusing on the cross-lingual and multilingual properties of MWPs, we need to train separate MWP solvers using different datasets.
Our cross-lingual MWP solver will be trained using one language but evaluated using another.Our multilingual MWP solver can be trained on all languages available and evaluated separately.To suffice these goals, it would be better if the problems in different languages have comparable properties.Since we are using pretrained multilingual language models as the sequence embedder of the encoder, all the languages can be projected into a shared representation.However, the candidate datasets also need to share a common operator set and numerical constants to make the decoding process consistent.But some of the categories from MathQA do not exist on Math23K or one of the operators is not in our permitted set.Therefore, we need to adapt MathQA as a counterpart of Math23K sharing the same decoding vocabulary, including operators and constants.

Adaptation of MathQA
We adapt MathQA by doing the following: 1) We notice that the annotated formulas in MathQA are function calls of predefined functions which can be converted into a tree using an abstract syntax tree (AST) parser.2) To be consistent with Math23K, which covers only basic arithmetic operators like addition (Add), subtraction (Sub), division (Div), multiplication (Mult) and exponentiation (Pow), we keep only functions in MathQA that can be expressed in such operators.For example, volume_sphere(r) from MathQA equalizes to 4 3 πr 3 and is adapted using the method shown in Figure 2. Formulas containing operators not used in Math23K, like sine and permutation, are not considered in this work.A full list of adapted operators can be found in Table 6 of Appendix A. 3) Upon constructing the trees using permitted operators, we evaluate each sample to verify its correctness against its ground-truth answer.Those cases that fail to get the correct answer are not considered in this work.
After the adaptation, we get the adapted MathQA dataset of solvable problems with comparable sizes and question types to Math23K.For Math23K, we further sample a development set of size 1000 from Table 3: Examples from each dataset used for zero-shot cross-lingual evaluation.
datasets, including AddSub (Hosseini et al., 2014), SingleOp (Roy et al., 2015) and MultiArith (Roy and Roth, 2015).To extend these datasets for crosslingual evaluation, we use online machine translation APIs to translate them into Chinese and further manually refine the translations to be more native.
For each dataset, we list an example in Table 3, in both English (En) and Chinese (Zh).

Template-based Contrastive Training
Math word problems can be categorized by expression templates if we replace numerical values of expressions with a special token.Such templates have been adopted for supervision in other math solver approaches like (Wang et al., 2018) and (Xie and Sun, 2019).Different from these methods, we don't use templates directly for supervision but make an assumption that problems sharing the same template are closer with each other from the point view of arithmetics, regardless of the surface forms of languages and descriptions.
To make use of this assumption, we introduce inter-language template-based contrastive training into our training process.Specifically, we first group math word problems based on their templates.During training, we pair each problem with a random sample from a different language sharing the same template.
As the representation learned by the encoder in Section A is M, we use its maxpooling with normalization as the latent representation for each problem and its positive sample, denoted as z and z + respectively.Then, we conduct a batch-level contrastive training similar to SimCLR (Chen et al., 2020) and use the NT-Xent loss (the normalized temperature-scaled cross entropy loss) as the fol-lowing: where •, • is the inner product of the two vectors and the batch size is N .It's worth noting that the distribution of templates is highly skewed.In our experiments, we further consider two settings: (1) CL, contrastive learning, when a problem doesn't have a candidate with the same template from another language, it contrasts with itself.(2) CL + TC, contrastive learning with template constraint, we only use those problems which have at least one sample from another language.
Our contrastive learning approach differs with that of Li et al. (2022) in the following ways: (1) our method is focusing on cross-lingual setting that each pair of examples come from different languages, (2) we use batch-level contrastive training in consist with SimCLR.
There are also other works making use of latent representations of math word problems to enhance generalization ability of math solvers.For example, Liang and Zhang (2021) designed a teacher module to make the latent vector to match the correct solution rather than its variations.

Experiment Setup
Evaluation metrics: The model is expected to be a math problem solver, so the generated expressions should be executable by a specific compiler and executor.During evaluation, each problem is counted as solved if the absolute error rate for the executed value and the target value is lower than a predefined threshold.In our experiments, we choose 1e −4 as the threshold.The final evaluation metric is the accuracy of solved problem against all the problems.
Other experiment settings: We choose to use multi-lingual BERT (mBERT) (Devlin et al., for cross-lingual training.We train our models using one Nvidia 2080ti and a batch size of 160.The learning rate is set to 3e −5 with a scheduler supporting polynomial decay.The training lasts for at most 150 epochs and will stop after 30 epochs if no improvement is observed.1

Results
We list experiment results of all the methods in Table 4.

Cross-lingual MWP Solver
The first research question we want to answer is to what extent a MWP solver trained on one language can work on another language, with the help of pretrained multilingual language models.Table 4 shows that the MWP solvers trained using either Math23K ( mBERT-zh) or MathQA (mBERT-en) have achieved impressive performance when tested in the same language.However, the performance over a different language drops drastically and is almost negligible.In a word, the MWP solver is almost non-transferable when it is trained on one language but evaluated over a second with the same operator set.

Multilingual MWP Solver
The second research question we want to answer is whether training a MWP solver on multiple languages helps improve its effectiveness compared with training on a single language.We can see that mixing two languages to train can give us a more language-agnostic model as the performance on Test split of both languages are competitive with monolingual cases.What's more, on the newly extended bilingual datasets, there are consistent improvements for most of the datasets, especially for the English language.
Considering the difficulty of problems, these bilingual evaluation datasets are closer to Math23K (primary school) than to MathQA (GRE or GMAT).Adding that mBERT-zh is also doing better than mBERT-en on English language, we suspect domain similarity is more important than language for MWP solvers.

Template-based Contrastive Training
The last section of Table 5 shows how contrastive learning affects performance.Firstly, adding contrastive learning can further boost performance on the test set of both languages.There's a significant increase (3 points) for Math23K.However, in zeroshot evaluation settings, performance over English drops consistently.We suspect this might be caused by the diversity of templates on MathQA is much larger than that of Math23K.
Therefore, we further conduct a templateconstrained experiment that ensures each template can be found on both languages.Due to the number of training cases are reduced, performance of the test sets also drop by a large margin.However, English problems over zero-shot setting benefit most from this experiment, which further verifies that

Conclusion
In this paper, we revisit the math word problems using a generation-based method constructed over pretrained multilingual models.To assist analysis of cross-lingual properties of math solvers, we adopt two large-scale monolingual datasets and further adapts MathQA into the same annotation framework with Math23K.We also reuse earlier smaller datasets and upgrade them into bilingual datasets by machine translation and manual checking.Our experiments show that the MWP solvers may not be transferred to a different language even if the target expressions have the same operator set and constants.But for both cross-lingual and multilingual cases, it can be better generalized if problem types exist on both source language and target language.Problems considered to be easy by humans may still be hard for a math solver trained over the same language but from a different domain.This tells us that for math word problem solvers, it might be beneficial to consider balancing different question types and permitted operators during training.

A Method
In this section, we construct a generation-based MWP solver using a sequence-to-sequence model with copy mechanism.Our whole model can be visualized in modules through Figure 3.The detailed illustration for each module is given as following: Encoder Our encoder is built upon a pretrained multilingual transformer, either BERT or XLM-R.Suppose our input word w i is tokenized into word pieces (x i1 , x i2 , . ..) and let h ij ∈ R d h denotes the hidden vector produced by the pretrained model representing x ij .We use average pooling to get the representation for the word w i , denoted as h i .Then we feed this contextualized representation of the math word problem into a two-layer bidirectional LSTM.The output of this biLSTM is the encoder hidden states for decoding, denoted as M = (m 0 , m 1 , . . ., m n ).
Decoder We use a LSTM cell as the decoding cell to predict the next token.For each decoding step t, the cell will accept the embedding for previous word as input and output a decoder state s t ∈ R ds .Most of the numerical values in MWPs do not exist in the target vocabulary.Therefore, we need copy mechanism (Gu et al., 2016) to facilitate generation of numerical values during decoding.
The copy scores are calculated as follows, where W c ∈ R d h ×ds .However, the embedding of a copied token will be identical to an out-ofvocabulary token.To better capture the information from last decoding step, we use the copy score to further derive a state of selecting from source tokens, which is called Selective Read.
We use a bilinear attention to attentively read information from M, getting the context vector c t , which is called Attentive Read.
where W a ∈ R d h ×ds .
From the problem definition, the target vocabulary is V = O ∪ C. The generation score for the next token is given by: where The state updating process for the decoding cell takes in a fused information of last word embedding e t ∈ R de , selective read state b t and attentive read state c t .
where W s ∈ R ds×(de+d h +d h ) .

Figure 1 :
Figure 1: Example of an MWP and its expression tree.

Figure 2 :
Figure 2: Adaptation of MathQA to Math23K.The part highlighted with dashed lines shows the adaptation of the function volume_sphere.

Figure 3 :
Figure 3: Sequence-to-sequence model with copy mechanism.

Table 1 :
Datasets which are focusing on specific problem types.

Table 4 :
Comparisons of different cross-lingual models over Test set and zero-shot datasets.

Table 5 :
Performance of template-based contrastive training models over Test set and zero-shot datasets.math word problems depend closely on the problem types of the training set.

Table 6 :
Operators that are adapted in MathQA.