Numerical reasoning in machine reading comprehension tasks: are we there yet?

Numerical reasoning based machine reading comprehension is a task that involves reading comprehension along with using arithmetic operations such as addition, subtraction, sorting and counting. The DROP benchmark (Dua et al., 2019) is a recent dataset that has inspired the design of NLP models aimed at solving this task. The current standings of these models in the DROP leaderboard, over standard metrics, suggests that the models have achieved near-human performance. However, does this mean that these models have learned to reason? In this paper, we present a controlled study on some of the top-performing model architectures for the task of numerical reasoning. Our observations suggest that the standard metrics are incapable of measuring progress towards such tasks.


Introduction
Machine reading comprehension (MRC) primarily involves building automated models that are capable of answering arbitrary natural language questions on a given textual context such as a paragraph. Solving this problem should in principle require a fine-grained understanding of the question and comprehension of the textual context to arrive at the correct answer. Designing an MRC benchmark is challenging, as it is easy to inadvertently craft questions that allow models to exploit cues that allow them to bypass the intended reasoning required .
Recent advances in NLP, currently dominated by transformer-based pre-trained models, have resulted in models that indicate, when measured with standard metrics, human-like performance on a variety of benchmarks over leaderboards for MRC 1 . In this paper, we focus on numerical reasoning based MRC and investigate DROP (Dua et al., 2019), a recent benchmark designed to measure 1 for eg., https://leaderboard.allenai.org/ complex multi-hop and discrete reasoning, including numerical reasoning 2 . In contrast to singlespan extraction tasks, DROP allows sets of spans, numbers, and dates as possible answers. We are particularly interested in numerical questions, for eg., 'How many years did it take for the population to decrease to about 1100 from 10000?' which requires extracting the corresponding years for the associated populations from the given passage, followed by computing the time difference in years. The benchmark has inspired the design of specialized BERT and embedding-based NLP models aimed at solving this task, seemingly achieving near-human performance (evaluated using F1 scores) as reported in the DROP leaderboard.
In this work, we investigate some of DROP's top-performing models on the leaderboard in order to understand the extent to which these models are capable of performing numerical reasoning, in contrast to relying on spurious cues. We probe the models with a variety of perturbation techniques to assess how well models understand the question, and to what extent such models are basing the answers on the textual evidence. We show that the top-performing models can accurately answer a significant portion (with performance exceeding 35%-61% F1) of the samples even with completely garbled questions. We further observe that, for a large portion of comparison style questions, these models are able to accurately answer without even having access to the relevant textual context. These observations call into question the evaluation paradigm that only uses standard quantitative measures such as F1 scores and accuracy. The ranking on the leaderboards can lead to a false belief that NLP models have achieved human parity in such complex tasks. We advocate the community to move towards more comprehensive analyses especially for leaderboards and for measuring progress. Dev (9,536) Num (

Dataset and Models
In this section, we briefly highlight the dataset and models under consideration.
Dataset DROP (Dua et al., 2019) includes questions with various types of reasoning, of which we are interested here in numerical reasoning. We filter the provided dev set to only include questions that require numerical reasoning; this was done by first including all answers with type number, and then augmenting with comparison questions, filtered heuristically based on whether it contains a comparative adjective or a comparative adverb. 5850 questions were number and 998 were comparison. We call this numset and use it in this paper as a basis for all experiments 3 .

Models
We include all publicly available models that appear in the DROP leaderboard 4 , this includes NAQANet (Dua et al., 2019), MTMSN (Hu et al., 2019), NeRd (Chen et al., 2020), Gen-BERT (Geva et al., 2020), TASE (Segal et al., 2020) and NumNet+(Ran et al., 2019), in addition to a simple logistic regression bag-of-words model to ground our results. With the exception of Num-Net+, which has been trained on our local machines from a published codebase, we use the provided model checkpoint by the corresponding authors. All of the included models are based on the transformer architecture (Vaswani et al., 2017). They vary, however, on how they tackle the task: NeRd solves the problem by generating a program from a domain-specific language; GenBERT augments the language model pre-training procedure by adding two more stages, pre-training with numerical data and pre-training with numeric textual data; the rest 3 all variations will be publicly released for reproducibility 4 https://leaderboard.allenai.org/drop/submissions/public rely on specialized modules to solve each of the different question types. The counting module frames the task as a multi-class classification problem of numbers 0-9, whilst the arithmetic module assigns a zero or a sign to each number in the passage and sums it up. Finally, they also differ in the encoder, where NAQANet is based on GloVe (Pennington et al., 2014) embeddings; MTMSN, NeRd and GenBERT use BERT-uncased (Devlin et al., 2019) (Large variation for the first two, whereas the last is only available as Base); and TASE and NumNet+ use RoBERTa LARGE . We note that, while some of the models obtain human parity F1-scores, these models are not public. However, with their corresponding descriptions, these models are markedly similar to the models evaluated in this paper and we believe that our observations hold on these models too. Table 1 shows the performance of the models on the dev set, numset, and the test set scores as reported in the leaderboard. Note that scores for the first two columns and the ones reported in the rest of the paper are based on considering only the main annotator's answer as gold, whereas the official evaluation script considers all annotations. This is done to clearly track changes in output after input perturbations.

Evaluating question understanding
Evaluating if a model understands questions is a non-trivial task. In this paper, we probe the performance of the models in the following two ways: evaluation of the models with question permutation and investigating the affinity to question class. Question permutation experiment Inspired by recent observations in Pham (2020), we perturb the numset by shuffling the words of the corresponding question in each sample. We create three randomly permuted sets, that differ in the n-gram permutation. 1-gram shuffle refers to all words being shuffled, 2-gram and 3-gram refer to corresponding shuffles of ordered 2 and 3 grams in the question. For the question mentioned in the introduction, an example of the possible shuffles for each of the {3,2,1}-grams are: 3-gram 10000 about 1100 from did it take to decrease to How many years for the population? 2-gram population to it take How many about 1100 from 10000 for the years did decrease to? 1-gram 10000 for from years to decrease it population about 1100 did many to take How the? As the random permutations may distort the semantic meaning (and destroy the syntax) of the question, we expect the predictions of the models to be severely impacted (with results approaching that of random chance). However, our experiments reveal (in Figure 1) that this is not the case. Generally the trend suggests that 3-gram permutations do not generally degrade the performance severely. While we notice a that 1-gram permutation degrades the performance, we still observe that the models tend to predict a significant portion of questions correctly (>35% F1 score). Some models are barely affected, such as NAQANet. We remark that most models that are based on BERT or bagof-embeddings seem to be generally more robust to permutation than RoBERTa based models.
We present an analysis on the effect of the number of numerical attributes in the passage and the ability of the models to make correct predictions in Table 2. We bin questions into quartiles, such that each bin contains the same number of questions. The first bin contains passages with at most 12 numerical attributes, the second bin has between 13 and 18 numerical attributes, the third ranges between 19 and 23, and finally, the last bin contains passages with more than 23 numerical attributes. We observe that there is no clear association between the performance of the models and the number of numerical attributes in a passage. This experiment indicates that the models are not sensitive to word-order and this can potentially impact their utility.
Affinity to the class of questions We further probe models on their affinity to answer questions by only relying on the class of questions. As the questions in DROP follow a certain pattern the  type can potentially be inferred by exploiting the first few words. In this experiment, we only make available the first few words of the question to the models. This typically contains insufficient details and should make it difficult for the models to arrive at the correct answer (average length of questions in numset is 12 words). We evaluate over three settings: passing the first two words, passing the first three words, and passing the first five words as the corresponding question. Below is an example of each on the question mentioned earlier: 2 Words How many? 3 Words How many years? 5 Words How many years did it?
Fig 2 shows the performance of the models on partial questions. With only five words, most models can still maintain a third of their correct predictions, and with only the first trigram of the question, they obtain an F1 >11.4%, where NeRd obtains 15.42%. Further showing an affinity of the models to be able to answer questions by exploiting the mere presence of a few words in the question.
Unlike the permutation experiment, here breaking down performance based on existence of numerical attributes in the passage shows a steady decline in performance with more numbers in a passage (see Table 3), suggesting that it is more likely to get an answer correct if the space of possible arithmetic expressions is smaller.

Evaluating passage comprehension
We now examine whether the models are comprehending the passage and if they are basing their    predictions on the evidence provided in the passage. We probe with the following three settings:

Random Passage
We pair each question with a randomly assigned passage from numset. Dummy Passage We create an uninformative passage that contains no numbers, this is a proxy for a blank passage as models are unable to process that. It is the sequence: 'This is a sentence.' Fixed Passage We pair all questions with an unseen passage from the hidden test set. This passage has similar properties as passages in the train and dev, but is irrelevant to the corresponding question. Figure 3 shows the results of these three settings 5 ; we observe a general trend where the mod- 5 Missing bars mean that the model failed to run.  els are able to correctly answer a significant portion of the questions without even having access to the relevant context (with F1 between 10%-17%). On further examination we observed that span-type questions cover the majority of the correctly predicted answers in this setting, indicating that comparison type questions might carry inherent biases that models might exploit, such as questions of the form: 'which group is larger: households or people?', or even models picking up on the structure of the questions and learning to predict one of the two entities. Table 4 shows the number of correctly predicted comparison questions in numset and the percentage of these that can still be predicted correctly if we take its context away. Most worryingly, we observe that NAQANet and NeRd maintain almost all of their predictions across the different settings and do not seem to take the textual context into account. NumNet+ is the only model whose performance degrades drastically. We hypothesize that this could be due to the GNN reasoning module (with each number in the passage appearing as a node) that informs its decision. GenBERT exhibits a curious behavior of a significant drop in performance for Dummy and Fixed Passage settings, while maintaining >56% of its predictions in Random Passage. We postulate that this could be due to similarities in some passages in the devset.

Related work
Recent work has shown the effects of wordpermutation on the performance of BERT-based  Table 4: Percentage of questions correctly predicted after replacing the passage with an unrelated one on comparison-type questions. models in NLU tasks Sinha et al., 2021b,a;Gupta et al., 2021). Chiefly,  analyzed the effect of word-order on 6 binary-classification GLUE tasks and demonstrated the limitations of BERT-based models. In our work, we investigate it in the context of numerical reasoning and observe similar behavior, but more generally in transformer-based models. Investigating NLP models in the context of NLU has been the focus of several recent works. We briefly highlight the prominent related works that include Jia and Liang (2017) who show that span-extraction RC models can be fooled by adding an adversarial sentence to the end of the passage, McCoy et al. (2019) identify superficial heuristics that NLI models exploit instead of deeply understanding the language, and Ravichander et al. (2019) evaluate their quantitative reasoning NLI benchmark on SoTA models and find that they are similar to a majority-class baseline. Rozen et al. (2019) finds that a BERTbased NLI model fails to generalize to unseen number ranges in an adversarial dataset measuring numerical reasoning, suggesting an inherent model weakness.
Contrast Sets (Gardner et al., 2020) and Semantics Altering Modifications (SAMs) (Schlegel et al., 2020) are two works that introduce changes to MRC benchmarks to better understand the decision boundaries of the models. They include a subset of the DROP dataset, wherein the former the benchmark's authors manually modify questions to include more compositional reasoning steps or change their semantics to create a Contrast Set. In the latter, the authors introduce an automatic way of generating SAMs, which alter the semantics of a sentence while preserving most of its lexical surface form. In our work, we take an inverse approach, where we alter the surface form of a question such that it no longer carries the meaning of the original question.

Discussion and conclusion
In this work, we closely examined some of the top-performing models for numerical reasoning on DROP. Our study suggests that models are not necessarily arriving at the correct answer by reasoning about the question and content of the passage. Both question understanding and passage comprehension experiments reveal serious holes in the way the models are able to arrive at the correct answers. We hypothesize that the models have managed to pick up on the spurious patterns of the benchmark, rather than solving the task. Possible reasons for biases include: patterns in the format of passages and questions, where passages either describe the outcome of an American football match, the census of a certain location, or some historical event (Gardner et al., 2020), resulting in redundancies in the structure of a passage and patterns in their content; and answer frequency distribution, with top 5 answers being shared between train and dev splits, covering almost 20% of the data. In fact, we found that in the affinity to the class of questions experiments, there exists a vast disparity in performance between questions with most-frequently occurring answers vs. others. In NeRd, for example, the EM for questions with 2 words is 17.75% in top-10 answers, whereas it is 4.2% in the rest. The disparity narrows with more words, as it is 21.5% vs. 11.90% for questions with 3 words, and 30.81% vs. 22.23% for questions with 5 words.
Benchmark leaderboards as they stand now can be misleading, incentivizing models to improve upon the reported scores without solving the underlying task. We strongly advocate for better methods to assess the capability of models for numerical reasoning. One such direction could be akin to Linzen (2020) who proposes a parallel evaluation paradigm that rewards models for possessing human-like generalization capabilities and Liu et al. (2021) that augments current leaderboards with three extra dimensions of interpretability, interactivity, and reliability. We highly recommend for careful design of the benchmarks and better leaderboards to correctly measure progress in such complex tasks.

A Random Baselines
To ground the models' results, we evaluate two random baselines on the numset. The first randomly samples a pair of numbers from the passage, and finds their absolute difference (since subtraction is most prevalent) -EM 1.8%. The second random baseline samples a final answer proportional to the frequency of answers in the training set -EM 2.46%.