Towards Robust Numerical Question Answering: Diagnosing Numerical Capabilities of NLP Systems

Numerical Question Answering is the task of answering questions that require numerical capabilities. Previous works introduce general adversarial attacks to Numerical Question Answering, while not systematically exploring numerical capabilities specific to the topic. In this paper, we propose to conduct numerical capability diagnosis on a series of Numerical Question Answering systems and datasets. A series of numerical capabilities are highlighted, and corresponding dataset perturbations are designed. Empirical results indicate that existing systems are severely challenged by these perturbations. E.g., Graph2Tree experienced a 53.83% absolute accuracy drop against the “Extra” perturbation on ASDiv-a, and BART experienced 13.80% accuracy drop against the “Language” perturbation on the numerical subset of DROP. As a counteracting approach, we also investigate the effectiveness of applying perturbations as data augmentation to relieve systems’ lack of robust numerical capabilities. With experiment analysis and empirical studies, it is demonstrated that Numerical Question Answering with robust numerical capabilities is still to a large extent an open question. We discuss future directions of Numerical Question Answering and summarize guidelines on future dataset collection and system design.


Introduction
Numeracy is an essential part for real-world NLP applications (Sundaram et al., 2022;Thawani et al., 2021b;Sundararaman et al., 2020;Spithourakis and Riedel, 2018).Numerical QA (Question Answering) is one representative group of such number-dependent NLP tasks.E.g., Math Word Problem Solving (Zhang et al., 2020a;Miao et al., 2020;Koncel-Kedziorski et al., 2016), Discrete Reasoning (Dua et al., 2019;Hu et al., 2019;Al-Negheimish et al., 2021a), Tabular Question Answering (Zhong et al., 2017;Chen et al., 2020b;Zhu et al., 2021;Chen et al., 2021).These Numerical QA tasks require NLP systems to arrive at a numerical answer from the numbers in the question and context.By studying how existing NLP systems perform in these Numerical QA tasks, we could take a glimpse at what capabilities are required for building NLP systems in the future.
In an ad-hoc manner, a line of work revealed robustness issues of handling Numerical QA in existing NLP systems.Through adversarial attacks with designed dataset perturbations, numberrelated limitations were exposed: E.g., utilizing spurious correlation in datasets (Patel et al., 2021;Kumar et al., 2021;Al-Negheimish et al., 2021b;Pi et al., 2022b), incorrectly representing numbers (Nogueira et al., 2021;Kim et al., 2021) and failing to extrapolate (Kim et al., 2021;Pal and Baral, 2021).This line of work inspires us to ask following questions: 1) What is the overall landscape of robustness issues of numerical capabilities in existing NLP systems?Can we find a more systematic way to investigate the number-related limitations?2) How to diagnose each numerical capability and evaluate the severity of it not being captured in a system?Can we further develop new adversarial perturbation methods on Numerical QA for diagnosis and evaluation? 3) How to address the numerical robustness issues?How do existing solutions work and what are possible future directions?
To answer the above questions, in this paper we propose the DNC (Diagnosing Numerical Capbilities) framework1 as shown in Figure 1.
Most existing Numerical QA systems (see §2.1) take a two-stage approach to extract and manipulate numbers.As shown in the QA Stages part of Figure 1, systems usually first recognize numbers Four capabilities are required to complete the stages, each maps to two perturbations.Perturbations can be applied to appropriate train / validation / test splits of Numerical QA datasets under Attack or Defense Setting.Models of the NLP systems are trained and then evaluated on the perturbed datasets as a diagnosis of their numerical capabilities.
in the context and question and treat them as candidate operands.Then, with the understanding of the question semantics, they select corresponding operands, and explicitly generate logical forms or implicitly execute operations to get the final result.
The above two stages correspond to the two groups of numerical capabilities (see §4.1) covered by our DNC Framework (as shown in Figure 1).In Stage 1, we focus on a system's capabilities to recognize different forms of numbers ("Number Detection"), and to parse and represent number values correctly ("Number Value Understanding").In Stage 2, we focus on the capabilities to correctly choose operands ("Operand Selection") and operations ("Operation Reasoning") by understanding context and question.For each of these four capabilities, two perturbations (see §4.2) are proposed by us to diagnose the capability.Each perturbation is designed to be trivial to humans and thus cannot easily fool humans, but it could bring down existing NLP systems (under the "Attack" setting), and therefore expose the robustness issue of lacking its corresponding capability.
By applying the above diagnosis to various NLP Systems and Numerical QA Datasets (as shown in Figure 1), in §5 we find that existing systems experience significant performance drops, which verifies their lack of robust numerical capabilities.E.g., Graph2Tree experienced a 53.83% absolute accuracy drop against the "Extra" perturbation on ASDiv-a, and BART experienced 13.80% accuracy drop against the "Language" perturbation on the numerical subset of DROP.
From another point of view, the perturbations are also applicable for data augmentation.Under the "Defense" setting (see §4.3), perturbations are applied to all splits of the dataset.A system's performance of the same perturbation under both "Attack" and "Defense" settings are compared (in §5.2) to show if the corresponding robustness issue could be relieved by augmenting the training data.Empirical results indicate that despite the recovery in most cases, the performance still fall lower than the original level.
Finally, based on the "Attack" and "Defense" results in §5 and additional experiments, in §6 we compare some existing design choices in Numerical QA, such as: Is it better to generate logical forms (and then execute the program/expression) or predict answers directly in an end-to-end way?Shall we break numbers into subword tokens or substitute them with a placeholder that can be later re-substituted?We also discuss the open questions and future directions on the robust numerical capa-bilities of NLP systems, including recent relevant development such as neural program execution and numerical data synthesizing.
In summary, our major contributions are: • The DNC framework is proposed by us to systematically diagnose the robustness of NLP systems on numerical capabilities.A series of number-related perturbation methods are designed for the capabilities.
• Comprehensive diagnosing experiments on adversarial attacks and data augmentations are conducted by us on five systems over three Numerical QA tasks.We show the overall picture of numerical robustness issues of the systems, and the partial effectiveness of our simple defense mechanism.Math Word Problem (Kushman et al., 2014;Upadhyay and Chang, 2017;Miao et al., 2020;Qin et al., 2020;Lan et al., 2022) concerns arithmetic questions collected from lower-grade elementary school coursework.Neural network are employed with different architectures such as Seq2Seq (Wang et al., 2017;Chiang and Chen, 2019), Seq2Tree (Xie and Sun, 2019;Liang et al., 2021) and Graph2Tree (Zhang et al., 2020b;Shen and Jin, 2020).Recently, large end-to-end pretrained language models (Chowdhery et al., 2022;Pi et al., 2022a) have also been showing impressive results in Math Word Problem.
Discrete Reasoning (Dua et al., 2019;Al-Negheimish et al., 2021a;Hu et al., 2019) concerns questions requiring logistic and arithmetic operations on real-world paragraphs.Discrete Reasoning Systems are mainly based on Graph Attention Networks (Chen et al., 2020a) or the Transformer architecture (Ran et al., 2019).
Tabular QA and Semantic Parsing (Zhu et al., 2021;Chen et al., 2021;Zhong et al., 2017;Pasupat and Liang, 2015) concerns question answering in the domain of tabular data, which often involves a large amount of numbers and requires arithmetic aggregations to arrive at the final answer.Tabular QA systems (Dong et al., 2022;Liu et al., 2022;Iida et al., 2021;Herzig et al., 2020;Yin et al., 2020) are mainly based on Pretrained Language Models with Transformer backbones.Tabular QA systems mainly aim at converting natural language utterance into executable expressions such as commands in SQL language.

Numeracy Limitations in NLP Systems
Efforts have been dedicated to reveal numeracy limitations in NLP systems.(Patel et al., 2021;Kumar et al., 2021;Al-Negheimish et al., 2021b;Pi et al., 2022b;Nogueira et al., 2021;Kim et al., 2021;Pal and Baral, 2021).However, previous work mainly focused on borrowing adversarial attack methods from general QA such as re-ordering sentences (Patel et al., 2021;Al-Negheimish et al., 2021b;Kumar et al., 2021), substituting synonyms (Kumar et al., 2021;Pi et al., 2022b), or adding irrelevant information (Patel et al., 2021;Pi et al., 2022b), while having limited exploration into capabilities specific to Numerical QA problems such as understanding different number values, recognizing different number surface forms or selecting related numbers.

Preliminaries
A Numerical Question Answering problem is defined to consist of a problem prompt (question) P and a problem body (context) B. Depending on the task type, the problem body takes the form of either a paragraph or a mixture of free-form text paragraphs and structured data such as tables.Let V be the vocabulary of the textual words, Q be the set of the numerical values in P ∪ B, and Q + be the numerical values that can be arithmetically computed with Q, then the problem prompt and body can be formulated as Here ⊕ denotes the concatenation operation, p • and b • are prompt and body textual words, and τ •,• are the body tabular cells.
The target output T of the problem is either a numerical value T ans that is an element in Q + or a mathematical expression T eq that consists of elements in the concerned numerical values Q and the simple operators O = {+, −, ×, ÷}.I.e.
. With P and B as input and T as output, a trained Numerical QA system can be regarded as a mapping f such that Note that this expression not only describes the Numerical QA tasks, but also generalizes to other numeracy-related NLP tasks such as Tabular Entailment (Chen et al., 2020b) and Timeseries-based Fraudulent Detection (Padhi et al., 2021).
In this paper, we design and apply perturbations to the samples in the dataset to form perturbed prompt P ⋆ , perturbed body B ⋆ and perturbed ground truth target T ⋆ .We show that existing systems are fragile against numerical pertubation by showing that on a large portion of the dataset, the previous mapping fails to generate correct perturbed target, i.e.:

DNC Framework
Our approach aims at diagnosing the numerical weakness of existing Numerical Question Answering models.We list out and explain a series of numerical capabilities that are critical to solving Numerical Question Answering problems in §4.1.
We then design numerical perturbations targeting these capabilities in §4.2.With the designed perturbations, we examine the weaknesses under two different perturbations settings in §4.3.These three sections are represented in Figure 1. as the "Capabilities" stripe, the "Perturbs" stripe, and the "Attack Setting" and " Defense Setting".

Numerical Capabilities
We classify numerical capabilities into three major categories, concerning different aspects of numerical understanding, as below: Number Detection is the capability of recognizing numbers of different surface forms.For instance, the English word "Forty-two" and the Arabic number "42.0" are regarded the same number in Numerical QA and should not affect the final arithmetic answer of a question.
Number Value Understanding is the capability of understanding numbers of different value distributions.Systems are expected to not only apply arithmetic calculation on a specific set of numbers (e.g., integers of values smaller than 500 as included in the BERT tokenizer vocabulary).Robust Numerical QA systems are also expected to handle values such as float-point numbers and numbers larger than 500.
Operand Selection is the capability of deciding which numbers to select as the operands in the arithmetic process.One important aspect of selecting related values is to exclude numbers that are 1) irrelevant to the Numerical QA problem scenario, or 2) relevant to the problem scenario but not essential to the question solving.Systems are expected to select as operands the important values from the unimportant values.
Operation Reasoning is the capability of inferring operations from the logic pattern described in the text.In an arithmetic process, the operation is independent from the operands, therefore different operations can be applied to the same set of selected related numbers in different questions.Systems are expected to decouple operation from operands and select the operation in an operand-agnostic way.

Perturbations
Perturbations are designed according to each numerical capabilities.In Table 1, an example problem is provided for each of the perturbations.The formal definition of the perturbations is provided in Appendix A.
Language Perturbation targets the Number Detection capability and diagnoses how accurate can systems detect numbers in different surface forms.To perturb a number string n s , we replace it with its English form of the number with Num2Words.2This perturbation changes number surface forms but not their values.
Type Perturbation targets the Number Detection capability and challenges systems to detect numbers in their float-point forms.To perturb a number string n s , we concatenate it with the string ".0".Similar to Language Perturbation, only the number detection capability is diagnosed with this perturbation.Contrary to the Noise perturbation in the next paragraph, the Type perturbation does not propose additional calculation difficulty by changing number values.
Noise Perturbation targets the Number Value Understanding capability and challenges systems to not only understand arithmetic operations of not only integers but also float-point numbers.To perturb a number n, we randomly attach a one- digit fractional part with uniform distribution.This perturbation introduces new float-point numbers and breaks the original number value distribution in the dataset by adding an random variable.Distribution Perturbation targets the Number Value Understanding capability and challenges systems to conduct arithmetic with larger integers.To perturb a number n, we randomly offset the value with a normal distribution.Based on the observations in Wallace et al. (2019), we choose to perturb the majority of the numbers to larger than 500.This perturbation introduces large numbers and breaks original number value distribution in the dataset.
Verbosity Perturbation targets the Operand Selection capability and challenges systems to select the correct quantity in the problem by adding explicitly irrelevant numbers into the problem.To perturb a number string n s , we concatenate it with an irrelevant number in parentheses, the irrelevant number is preceded by "not".This perturbation introduces numbers without breaking the distribution of relevant numbers in the dataset.
Extra Perturbation targets the Operand Selection capability and challenges systems to exclude irrelevant numbers.To perturb a problem (B, P), An irrelvant sentence containing numbers randomly sampled from the corpus is added to the body B. This perturbation breaks the number distribution by introducing extra instances of different numbers for the same problem.
Logic Perturbation targets the Operation Reasoning capability and challenges systems to choose correct operations for the same set of numbers.In this paper, for two datasets described in §5.1, TATQA and ASDiv-a, the Operation perturbation demands additional attention.On TATQA it is based on template matching via SpaCy 3 and automatic conversions, while on ASDiv-a it is based on manual annotation due to the diversity of patterns in the ASDiv-a dataset.This perturbation introduces extra problems of different operations.
Order Perturbation targets the Operation Reasoning capability and challenges systems to choose correct operations for the same set of numbers.On ASDiv-a, the order of sentences in the problem body is manually altered in a manner that changes the order of number occurrence but not the problem logic.This perturbation does not break the operation distribution within the dataset.

Perturbing Settings
With the aforementioned perturbations, we construct perturbed datasets under different settings to investigate systems' numerical capabilities and the effectiveness of the perturbations from different perspectives.For a specific dataset with a training / validation / testing split, different splits are perturbed under different settings.In this paper we consider the following two settings of Attack and Defense, as compared in Table 2: Attack.By applying the perturbations to the testing split of the dataset, we construct a challenge set to evaluate the corresponding numerical capability of existing systems.Systems are trained on the original datasets and evaluated on the perturbed challenge set.
Defense.Under the defense setting, perturbations are applied to all of training, validation, and testing split of the dataset.By comparing systems' performance under the Defense with Attack settings, we investigate to what extent the performance drop can be alleviated by using the perturbations as a data augmentation approach.To perturb under Attack or Defense setting, suitable samples are first filtered according to a series of conditions.The perturbations are applied only to these filtered samples.The filtered samples in the dataset split(s) are replaced with their perturbed version to form the perturbed dataset.The filtering conditions and the formalized algorithm are provided in Appendix B.

Experiment Setup
Datasets.In this paper, we used ASDiv-a (Miao et al., 2020), DROP (Dua et al., 2019), and TATQA (Zhu et al., 2021) as our Numerical Question Answering datasets.For DROP and TATQA, we filtered out DROP-num and TATQA-a, the numerical subsets of them.The statistics of these datasets are shown in Table 4.
Systems.We selected representative systems on each dataset and test their performance against perturbations.For the ASDiv-a dataset, we use Graph2Tree (Patel et al., 2021).For the DROP dataset, we use BART-base and T5-base from Huggingface. 4For the TATQA dataset, we utilize TagOps with the RoBERTa backbone as described in the original paper.
Compute Environment.All experiments are done on a Linux machine equipped with 4 NVIDIA Tesla V100 16GB GPUs.The average runtime of our experiments ranges from one to three hours.
Hyperparameters.In our experiments, we adopt a general setting of hyperparameters of epoch number = 40, learning rate = 1e − 5 and batch size = 32.It is observed in our exploratory experiments that while the hyperparameters such as learning rate and batchsize do affect the absolute performance of the models, they have a modest effect on the general trend of the models' strengths and weaknesses against the numerical perturbations.The details and analysis are provided in Appendix C.

Experiment Results and Analysis
The experiment results are provided in Table 3.The metric we report is 1) the metric on original datasets (Original), and 2) the absolute change of the metric on perturbed datasets, denoted by "∆".We additionally provide the raw metric and relative drop in Table 9 and Table 10 in the Appendix.The calculation details of the observation can be found in Appendix D.2.
Attack.As can be observed in Table 3 and Table 10, most systems were severely challenged under the Attack setting and experienced significant performance drop, ranging from 5% to 50% absolute drop and 5% to 80% relative drop in answer denotation accuracy.Table 3: The Results of DNC Framework.Five NLP systems are evaluated with three Numerical QA tasks under both Attack and Defense settings.The symbol "∆" stands for the absolute metric difference between the current setting and the original setting.The color scale represents the distance from the original setting, deeper means further from the original setting.For ASDiv-a, Acc eq and Acc ans refer to the prediction accuracy of ground truth equations and denotation accuracy of answers, respectively.For DROP-num and TATQA-a, Acc refers to the denotation accuracy of the answers.We provide the raw performance and relative change of the metrics w.r.t. the original setting in Appendix D.1."-" denotes that automatic perturbation and automatic data augmentation as described by §4.3 is not applicable here.We provide detailed explanation of the reason why they are not applicable in Appendix E.  tive drop, as compared to the 13.15% absolute drop and 19.66% relative drop by Numerical Parsing.

Dataset
Among the considered systems, Transformerbased Seq2Seq systems (T5, BART, GPT2) are more sensitive than the tasks-specific Graph2Tree system against the perturbations stemming from the Numerical Parsing goal.The former resulted in 17.42% absolute drop and 27.06% relative drop, while Graph2Tree only experienced 3.07% absolute drop and 4.48% relative drop.The masking of numbers used by Graph2Tree allows it to remain unaffected against a portion of the perturbations targeting the Numerical Parsing goal.
Defense.As a counteracting approach, the defense mechanism helps alleviate systems' lack of corresponding numerical capabilities by applying automatic perturbations to the training and validation set.Via Defense, the lack according to the Semantic Parsing gets more recovery of (17.96% ab-solute improvement and 26.95% relative improvement vs. 6.52% absolute improvement and 11.42% relative improvement).
Among the considered systems, Transformerbased Seq2Seq systems benefits more from Defense than the Graph2Tree system (12.53%absolute improvement and 20.52% relative improvement vs. 11.58%absolute improvement and 16.88% relative improvement).
Despite the recovery from Defense, the challenge is still not solved.As the majority of the defense performance is still more than 10% below the original performance.This observation indicates that the lack of Numerical Capabilities is still an open question.
Summary.Our DNC framework provides insights on two major aspects of the diagnosis to Numerical QA systems: 1) It is demonstrated that severe numerical weaknesses exist in current Numerical QA systems ("Attack"), and they can not be trivially eliminated via, although benefiting from, an automatic data augmentation process ("Defense").
2) The systems' weaknesses are explicitly profiled in a quantitative and interpretable manner through the models' susceptibility difference to a diversity of perturbations.

Guidelines and Open Directions
In this section, phenomena observed on different systems and datasets were summarized to provide comparison for existing methods.Also, recent related efforts corresponding to these phenomena were discussed to point open directions in the domain of Numerical QA.
6.1 Target: Logical Form Generation vs.
Answer Predicting One attribute specific to Numerical QA is the reasoning processes leading to the numerical answers, which is usually described by logical forms.On datasets where the ground truth logical forms are provided as an additional supervision (e.g., ASDiva and TATQA), the systems have two options for the target: 1) Logical Form Generation, where systems generate the logical form which is later input to external symbolic executing systems such as Python scripts or SQL engines, and 2) Answer Predicting, where systems directly predict the output answer in an end-to-end manner.On datasets where ground truth logical forms are not provided (e.g., DROP), the latter is the most frequently adopted approach.Logical Form Generation and Answer Predicting differ in the actual object to conduct the executing step of the logical form insinuated by the question (external symbolic systems vs. neural systems).With Answer Predicting, systems are expected to possess the capability of executing the logical forms internally.
We investigate to what extent do existing systems possess this execution capability, by comparing the impact of the problem target T in Numerical QA on ASDiv-a.The systems are trained to predict two different targets: 1) the logical form (i.e., the MWP equation), and 2) the logical form and the execution result.Since most MWP-specific systems are incapable of predicting answers directly, we choose the Transformer-based systems GPT2, BART and T5.Results in Table 5 indicate that: 1) on existing systems, Logical Form Generation is beneficial for higher accuracy, and 2) even though models managed to compose equations with high accuracy, they struggle to faithfully execute an equation to get the correct answer.
Recent work also pays increasing attention to the execution capability.Systems such as TAPEX (Liu et al., 2022) and POET (Pi et al., 2022a)   been leveraging data synthesizing and intermediate pretraining to learn neural program executors and achieved state-of-the-art results over systems leveraging Logical Form Generation.This recent development shows the potential of neural systems with enhanced execution capability on the Numerical QA task.

Numbers: Tokenization vs. Replacement
We also investigate the impact of different ways of manipulating numbers.There are two mainstream existing methods to process and represent numbers, herein referred to as the Tokenization and Replacement methods.Tokenization methods such as WordPiece (Wu et al., 2016) and BPE (Sennrich et al., 2016) adopted by existing Numerical QA systems divides numbers into potentially multiple sub-word level tokens.E.g., The number 768 will be divided into tokens 7 and 68 by T5's tokenizer.This approach stems from the fundamental fact that existing systems' vocabularies are finite while the occurrences of numbers in a Numerical QA dataset can be too diverse to include in a finite vocabulary.Tokenization causes extra representation cost and erases the digit integrity by potentially introducing multiple tokens for a single number.
Replacement substitutes numbers with special tokens in the input ([NUM1], [NUM2], etc.), which are later re-substituted with the original number in the output logical forms.This approach avoids multiple tokens by providing exactly one representation for each number, but has its own limitations handling number diversity since the recognition of numbers are usually performed with rule-based matching, which is often non-exhaustive.
In this paper, T5, BART, GPT2 and TagOps adopts Tokenization, while Graph2Tree adopts Re-  placement.We implement two variations of GPT2: GPT2 token and GPT2 replace to compare their robustness against different perturbations on the ASDiv-a dataset.Results in Table 6 indicate that Replacement has an advantage when no perturbation is present or when the perturbation only involves changes in number value.However, when the perturbation changes number values, the Replacementbased system is more severely challenged.
We hypothesize that the Replacement method removes all numerical information such as the format and value of numbers in the problem and lost numeracy capabilities, therefore the system receives only textual signals such as number order or word frequency, which further encouraged systems to learn from spurious correlations as stated in Patel et al. (2021).This hypothesis is consistent with the observations of a recent study (Thawani et al., 2021a) that investigates of the mutual-enhancement between numeracy and literacy.
The respective limitations of Tokenization and Replacement are calling for more numeracypreserving number representation methods.Some studies have suggested changing number surface forms (Kim et al., 2021) or using dataset-agnostic representation (Sundararaman et al., 2020), however they either create extra token loads or could not generalize well on large-scale real-world dataset.The numeracy-preserving number representation is another bottleneck for Numerical QA.

Conclusion
In this paper we aim at diagnosing numerical capabilities in existing NLP systems.We list out a series of numerical capabilities and design corresponding dataset perturbations.Empirical results show that existing systems still lack numerical capabilities to a large extent, and this lack cannot be eliminated in a trivial manner.Analysis into the empirical results, discussion of the the existing practices, and insights for future directions of Numerical QA dataset collection and system design are also provided.

Limitations
Our pipeline has limitations in the following two aspects that we plan to address in the future: Dependency on ground truth equation.Currently, three of the eight DNC perturbations have strong dependency on the ground truth solving equation, which is missing in datasets such as DROP.We hope to utilize semi-supervised approaches in the future to enlarge the coverage of the DNC perturbations.
Perturbing scalability.Currently our filters cover only a portion of the whole dataset due to DNC filtering and perturbing questions based on manual rules and templates.we hope to develop more automatic filtering and perturbing in the future.Also, DNC can only apply perturbations to numbers provided by the problem, which limits its diagnosing power in questions where an unspecified number is used, e.g., when numerical commonsense knowledge is involved.

Ethical Statements
The model implementation and datasets utilized in this paper are based on publication and open-source repositories.Licenses protocols are followed in the process of our experiments.No new datasets or NLP applications are presented in this paper and no violation of privacy or usage of demographic information was involved in our process of interacting with the datasets.Our experiments do not involve lots of compute time/power as reported in the paper.

A Formal Definition of Perturbations
We provide the formalized definition of the perturbations as follows.In all definitions, "⋆" denotes perturbed version.
Noise Perturbation.To apply noise perturbation to an number n, an variable X is uniformly sampled on the interval (1, 10).Then a fractional part corresponding to X is added the concerned number n, i.e., X ∼ U(1, 10) Distribution Perturbation.The Distribution Perturbation changes the number distribution in the dataset by adding an normally distributed random variable X to the concerned number n. I.e., In this paper we adopt µ d = 1000 and δ d = 300.
Language Perturbation.The concerned number string n s is replaced by the English word describing the same quantity, i.e., n ⋆ s = N um2W ords(n s ) Type Perturbation.To apply the Type Perturbation, the concerned number is expected to be an integral number.The number string n s is concatenated with an extra ".0" string to change the type of the concerned number from integer to float-point, i.e., n ⋆ s = Concat(n s , Stringf y(.0)) Verbosity Perturbation.The Verbosity Perturbation aims to introduce irrelevant numbers without changing the semantics of the problem.To perturb a number string n s , we concatenate it with an irrelevant number in parentheses, the irrelevant number is preceded by "not", i.e., In this paper we adopt µ v = 100 and δ v = 30.Extra Perturbation To apply the Extra Perturbation to a problem (B, P), an irrelevant sentence containing numbers from the corpus is added to the body B, i.e., P ∆ = SampleOtherQs() Logic Perturbation To apply the Logic Perturbation to a problem (B, P), the prompt is altered to convert the problem logic used in the problem, i.e., P ⋆ = ConvertLogic(P) Order Perturbation For the Order Perturbation, the sentence order in the problem body is manually altered in a manner that changes the order of number occurrence but not the problem logic, i.e., P ⋆ = ChangeOrder(P)

The Filtering Conditions
The filtering conditions for Perturbing Algorithms is different across perturbations.The perturbations can be divided into two major categories: 1) perturbations that do not change the solving equation or final results (Language, Type, Verbosity, Extra, Order), and 2) perturbations that changes the solving equation or final results (Noise, Distri, Operation).
For perturbations in category 1), there is no limitation on the perturbing process, thus all questions naturally pass the the filtering condition.
For perturbations in category 2), the filtering conditions follow the principles of Unambiguity, Suitability and Visibility.
Unambiguity.The filtered question should have an unambiguous mapping between the number to be perturbed and the their location in the context.One example that violates this principle is when there are duplicated numbers in the problem body, then it cannot be determined which occurrence of the number affects the final result.
Suitability.The number to be perturbed should be suitable for the perturbation to be conducted.E.g.A float-point number should not be used as the target of the Noise perturbation which adds fractional part to integral numbers.In DNC, the Noise and Type perturbations requires the concerned number to be integral, and the Operation perturbation requires the question to match a manually created template.
Visibility.The concerned number should be occur in the the problem since the perturbations can only be applied to known input numbers.

C Hyperparameters
In our exploratory experiment, it it observed that while the hyperparameters such as learning rate and batchsize do affect the absolute performance of the models, they have a modest effect on the general trend of the models' strengths and weaknesses against the numerical perturbations.We hypothesize that this is due to the numerical capabilities of a model being contributed mostly by the model architecture instead of hyperparameters.For example, when the hyperparameters are varying from the default setting (1e-5 for learning rate, 32 for batch size), the following results are observed: On Graph2Tree, the results of changing the learning rate and batch size are shown in Table 7, the trend of results with the varied hyperparameters align with the default result as shown in Table 3   Considering this observation, and the fact that the number of our experiment is large due to the combination of different models, datasets, DNC settings, and DNC perturbations, we chose one general setting to reduce search space.We chose the setting as close as possible to the reported setting in the original papers of Graph2Tree and T5.We verified that this setting provides sufficiently good performance to demonstrate the performance gap corresponding to the perturbations, since our experiment focused more on the performance of a same model checkpoint against the datasets before and after the perturbations.

D.1 Raw Performance And Relative
Performance Drop We provide the original result in Table 9 and the relative performance drop in Table 10.

D.2 Observation Calculation Details
We denote the experiment results table in Table 11.
The values, observation explanation, and the formula used are provided in Table 12.

E DNC Experiments that Are Not Applicable
The following types of experiments are not applicable in current DNC framework: E.1 The Defense of the Logic perturbation on ASDiv-a The Logic perturbation requires the problem to be perturbed in a way that the logic is changed while the semantics of the problem is still cohesive.This requirement proposes challenge on the scalability of the perturbation.For the Attack setting, we utilized manually annotated labels.However, under the Defense setting the perturbations are expected to automatically augment the dataset.Thus, the Defense setting results of Logic perturbation on ASDiv-a is not applicable.
E.2 Noise and Type perturbations on DROP-num and TATQA-a DROP-num and TATQA-a do not provide supervision of the operand origins, therefore a mapping from the operands in equation to the context quantities cannot be built, which results in the Noise and Distribution perturbation not applicable on the DROP-num and TATQA-a datasets.

E.3 Logic Perturbation on DROP-num
DROP-num does not provide ground truth reasoning steps or logical forms, thus Logic perturbations that has dependency on the provided supervision is not applicable on DROP-num.

E.4 Order Perturbation on DROP-num
DROP-num is a reasonding dataset based on realworld paragraphs that usually have logical or temporal order information.Order perturbation breaks the semantic of the paragraph and will also confuse humans.Thus Order perturbation is not valid on DROP-num and the results are not applicable.

Figure 1 :
Figure 1: Overview of DNC Framework.The process of Numerical QA solving is divided into two logical stages.Four capabilities are required to complete the stages, each maps to two perturbations.Perturbations can be applied to appropriate train / validation / test splits of Numerical QA datasets under Attack or Defense Setting.Models of the NLP systems are trained and then evaluated on the perturbed datasets as a diagnosis of their numerical capabilities.
A mailman has to give out 192 pieces of junk mail.If he goes to 4 blocks, how many pieces of junk mail should he give each block?Perturbed: A mailman has to give out one hundred and ninety-two pieces of junk mail.If he goes to four blocks, how many pieces of junk mail should he give each block?There were 105 parents in the program and 698 pupils, too.How many people were present in the program?Perturbed: There were 105.0 parents in the program and 698.0 pupils, too.How many people were present in the program?Tony had $20.He paid $8 for a ticket to a baseball game.At the game, he bought a hot dog for $3.What amount of money did Tony have then?Perturbed: Tony had $20.2.He paid $8.5 for a ticket to a baseball game.At the game, he bought a hot dog for $3.5.What amount of money did Tony have then?The roller coaster at the state fair costs 6 tickets per ride.If 8 friends were going to ride the roller coaster, how many tickets would they need?Perturbed: The roller coaster at the state fair costs 6 (not 30) tickets per ride.If 8 (not 119) friends were going to ride the roller coaster, how many tickets would they need?Jack received 8 emails in the morning and 2 emails in the afternoon.How many emails did Jack receive in the day?Perturbed: Jack received 8 emails in the morning and 2 emails in the afternoon.How many more emails did Jack receive in the morning than in the afternoon?
Order Original: A DVD book holds 126 DVDs.There are 81 DVDs already in the book.How many more DVDs can be put in the book?Perturbed: There are 81 DVDs already in the book.A DVD book holds 126 DVDs.How many more DVDs can be put in the book?Original: 126 -81 ✓ Perturbed: 81 -126 × Expected: 126 -81Table1: Examples of DNC Perturbations and Corresponding Predictions by T5.For each perturbation an example original and perturbed problem pair is shown.The rightmost column shows some error cases where T5 generates correct equation on the original problem but fails on the perturbed.The ground truth equation of the perturbed problem is also provided after "Expected".

Table 2 :
The Comparison between Two Settings in DNC.Perturbations (denoted by "⋆") are applied to different dataset splits (train / val / test) under each setting.
Between the two DNC goals, Semantic Parsing causes a more severe challenge, averaging 19.66% absolute drop and 31.79%rela-

Table 4 :
The Statistics of the Datasets Used. have

Table 5 :
Comparing Models with Different Prediction Targets on ASDiv-a.For a model M, M eq / M ans predicts equation / equation and answer, respectively.Acc eq and Acc ans stand for the denotation accuracy of the generated equation and the accuracy of the directly predicted answer, respectively.
Model Perturbation Acc eqAcc ans ∆ Acc eq ∆ Acc ans

Table 6 :
The Results of Tokenization and Replacement on GPT2.GPT2 token adopts the Tokenization method and GPT2 replace adopts the Replacement method.

Table 7 :
The results of the attack Acc eq for Graph2Tree on ASDiv-a Similar behavior can also be observed on large transformer-based model such as T5, as shown in Table8:

Table 8 :
The results of the attack Acc eq for T5 on ASDiv-a