OPERA: Operation-Pivoted Discrete Reasoning over Text

Machine reading comprehension (MRC) that requires discrete reasoning involving symbolic operations, e.g., addition, sorting, and counting, is a challenging task. According to this nature, semantic parsing-based methods predict interpretable but complex logical forms. However, logical form generation is nontrivial and even a little perturbation in a logical form will lead to wrong answers. To alleviate this issue, multi-predictor -based methods are proposed to directly predict different types of answers and achieve improvements. However, they ignore the utilization of symbolic operations and encounter a lack of reasoning ability and interpretability. To inherit the advantages of these two types of methods, we propose OPERA, an operation-pivoted discrete reasoning framework, where lightweight symbolic operations (compared with logical forms) as neural modules are utilized to facilitate the reasoning ability and interpretability. Specifically, operations are first selected and then softly executed to simulate the answer reasoning procedure. Extensive experiments on both DROP and RACENum datasets show the reasoning ability of OPERA. Moreover, further analysis verifies its interpretability.


Introduction
Machine reading comprehension (MRC) that requires discrete reasoning is a valuable and challenging task (Dua et al., 2019), especially involving symbolic operations such as addition, sorting, and counting.The examples in Table 1 illustrate the task.To answer the question "Who threw the longest pass", it requires a model to choose the person with the longest pass from all the people and Passage: Houston would tie the game in the second quarter with kicker Kris Brown getting a 53-yard and a 24-yard field goal.Oakland would take the lead in the third quarter with wide receiver Johnnie Lee Higgins catching a 29-yard touchdown pass from Russell, followed up by an 80-yard punt return... Question: Who threw the longest pass?Answer: Russell Table 1: An example of question-answer pair along with a passage from DROP dataset (Dua et al., 2019).Question words in color indicate the potential operations for reasoning, i.e., ARGMAX and KEY_VALUE.
corresponding distance pairs based on the given passage.This task has various application scenarios in the real world, such as analyzing financial reports and sports news.
Existing approaches for this task can be roughly divided into two categories: the semantic-parsingbased (Chen et al., 2020b;Gupta et al., 2020) and the multi-predictor-based methods (Dua et al., 2019;Ran et al., 2019;Hu et al., 2019;Chen et al., 2020a;Zhou et al., 2021).The former maps the natural language utterances into logical forms and then executes them to derive the answers.For example, Chen et al. (2020b) propose NeRd.It includes a reader to encode the passage and question, and a programmer to generate a logical form for multi-step reasoning.Intuitively, this method has an advantage in interpretability.However, semantic parsing over text is nontrivial and even a little perturbation will result in wrong answers, which hinders the MRC performance.
To alleviate the heavy dependence on logical forms in the first category, the latter directly employs multiple predictors to derive different types of answers.For example, Dua et al. (2019) and Chen et al. (2020a) divide instances of the DROP dataset into several types and design a model with multi-predictors to deal with different answer types, i.e., question/passage span(s), count, and arithmetic expression.It is the capability of deriving different types of answers that improves the performance of models.However, such methods are lack of the necessary components to imitate discrete reasoning, which leads to inadequacy in reasoning ability and interpretability.
To alleviate the shortcomings of the above methods and preserve their advantages, we attempt to summarize reasoning steps into a set of operations and adopt them as the pivot to connect the question and the answer, which makes it possible to perform discrete reasoning.For example, to answer the question in Table 1, it needs two steps: (1) finding all persons and the corresponding distance of touchdown pass, and (2) choosing the one with the longest pass among them.We attempt to convert them into two operations, KEY_VALUE and ARGMAX, respectively.We then use them to produce the answer.Specifically, we design a set of lightweight symbolic operations (compared with logical forms) to cover all of the questions in the datasets and utilize them as neural modules to facilitate reasoning ability and interpretability.We denote this method as OPERA, an operation-pivoted discrete reasoning MRC framework.To utilize the operations, we propose an operation-pivoted reasoning mechanism composed of an operation selector and an operation executor.Specifically, the operation selector automatically identifies relevant operations based on the input.To enhance the performance of this sub-mechanism, we further design an auxiliary task to learn the alignment from a question to operations according to a set of heuristics rules.The operation executor softly integrates the selected operations to perform discrete reasoning over text via an attention mechanism (Vaswani et al., 2017).
To verify the effectiveness of the proposed method, comprehensive experiments are conducted on both the DROP and RACENum datasets, where RACENum used in this paper is a subset of the RACE dataset following (Chen et al., 2020a).Experimental results indicate that our method outperforms strong baselines and achieves the stateof-the-art on both datasets under the single model setting.We further analyze the interpretability of OPERA.Overall, this paper primarily makes the following contributions: (1) We propose OPERA, an operation-pivoted discrete reasoning MRC framework, improving both the reasoning ability and interpretability.
(2) Extensive experiments on DROP and RACENum dataset demonstrate the reasoning ability of OPERA.Moreover, statistic analysis and visualization indicate the interpretability of OPERA.
(3) We systematically design operations and heuristic rules to map questions to operations, aiming to facilitate research on symbolic reasoning.

Related Work
Recently, machine reading comprehension (MRC) methods tend to deal with more practical problems (Yang et al., 2018;Dua et al., 2019;Zhao et al., 2021), for example, answering complex questions that require discrete reasoning (Dua et al., 2019) such as arithmetic computing, sorting, and counting.Intuitively, semantic parsing-based methods, which are well explored to deal with discrete reasoning in question answering with structured knowledge graphs (Bao et al., 2016) or tables, have potential to address the discrete reasoning MRC problem.Therefore, semantic parsing-based methods for discrete reasoning over text are proposed to firstly convert the unstructured text into a table, and then answer questions over the structured table with a grammar-constrained semantic parser (Krishnamurthy et al., 2017).NeRd (Chen et al., 2020b) is a generative model that consists of a reader and a programmer, which are responsible for encoding the context into vector representation and generating grammar-constrained logical forms, respectively.NMNs (Gupta et al., 2020) learned to parse compositional questions as executable logical forms.However, it only adapts to limited question types matched with a few pre-defined templates.
Multi-predictor-based methods employ multiple predictors to derive different types of answers.NAQANET (Dua et al., 2019), a number-aware framework, employed multiple predictors to produce corresponding answer types, including a span, count, and arithmetic expression.Based on NAQANET, MTMSN (Hu et al., 2019) added a negation predictor to solve the negative question and re-ranked arithmetic expression candidates.To aggregate the relative magnitude relation between two numbers, NumNet (Ran et al., 2019) and Num-Net+ leveraged a graph convolution network to perform multi-step reasoning over a number graph.QDGAT (Chen et al., 2020a) proposed a questiondirected graph attention network for reasoning over a heterogeneous graph composed of entity and number nodes.EviDR (Zhou et al., 2021), an evidence-  emphasized MRC method, performed reasoning over a multi-grained evidence graph.Compared with these existing methods, our proposed OPERA focuses on bridging the gap from questions to answers with operations and integrating them to simulate discrete reasoning.

Task and Model Overview
Given a question Q and a passage P , MRC that requires discrete reasoning aims to predict an answer Â with the maximum probability over the candidate space Ω as follows: where the answer Â in this task could not only be span(s) extracted from context but also a number calculated with some numbers in context.To handle this task, it generally requires not only natural language understanding but also performing discrete reasoning over text, such as comparison, sorting and arithmetic computing.
To address the aforementioned challenges in this task, we propose OPERA, an operation-pivoted discrete reasoning MRC framework and it is briefly illustrated in Figure 1.In our framework, a set of operations OP, defined in Table 2, are introduced to support the modeling of answer probability p(A|Q, P ) as follows: p(A|Q, P ) = O∈OP p(A|Q, P, O)p(O|Q, P ), (2) where O ∈ OP represents one of the operations.Concretely, in our framework, we first design an operation selector p(O|P, Q) for choosing the correct question-related operations.These selected operations are then softly executed over the given context.Eventually, answer predictor p(A|Q, P, O) utilizes the execution result to predict the final answer.

Definition of Operations
To imitate discrete reasoning, we design a set of operations OP as shown in Table 2.The set contains 11 operations and each one represents a reasoning unit.Specifically, for questions that need to be answered by calculation, we design operations ADDITION/DIFF to represent addition and subtraction.For questions which need to be answered by counting or sorting, we also design operations COUNT, MAX/MIN, ARGMAX/ARGMIN, and ARGMORE/ARGLESS.The rest operations KEY_VALUE and SPAN are used to extract spans from the question and the passage.To incorporate them into OPERA, each operation is denoted as a tuple.Formally, i-th operation OP i is , where i ∈ {1, 2, ..., n} and n is the numbers of operations.E OP i ∈ R d h represents the learnable embedding of the i-th operation.f i (•) is a neural executor parameterized with trainable matrices W OP q,i , W OP k,i and W OP v,i ∈ R d h ×d h .The neural executor f i (•) is capable of performing execution of OP i on the given context.Specifically, it takes the representation of context as input and outputs the executed representation as m OP i ( § 3.3.2).

Context Encoder
The context encoder aims to learn the contextual representation of the input.Formally, given a question Q and a passage P , we concatenate them into a sequence and feed it into a pre-trained language model (Liu et al., 2019;Clark et al., 2020;Lan et al., 2020) to obtain their whole representation H ∈ R l×d h .After that, we split H into the question and passage representations, which are respectively denoted as H Q ∈ R lq×d h and H P ∈ R lp×d h .l q , l p , and l are the number of tokens in question, passage and concatenation of them.d h is the dimension of the representations.

Operation-pivoted Discrete Reasoning
The operation-pivoted reasoning module is composed of an operation selector and an operation executor.The operation selector is adopted to select operations related to the given question.The operation executor is responsible for imitating the execution of the selected operations with an attention mechanism.
Operation Selector To imitate discrete reasoning, existing methods usually adopt a logical form generated by a semantic parser to address this task.However, these methods suffer severely from the cascade error, where a little perturbation in the logical form may result in wrong answers.Therefore, we propose to map each question into an operation set, instead of logical forms.Namely, we intend to select relevant operations from the OP.To this end, we adopt a bilinear function to compute the similarity between each operation and the question and normalize them with a softmax as follows: where E OP ∈ R n×d h is a learnable parameter, which demotes the operation embedding matrix.h Q ∈ R d h is the representation of the question, which is obtained by executing weighted pooling on the H Q .W ∈ R d h ×d h is a parameter matrix and p(O|Q, P ) is the distribution over operations.
Operation Executor The operation executor is responsible for performing the execution of the selected operations over the given context.Inspired by previous studies (Andreas et al., 2016;Gupta et al., 2020), we implement the operation executor based on the neural module network, which takes advantage of neural network in fitting and generalization, and the composition characteristics of symbolic processing.Specifically, for each operation .., n}, we use a multi-head cross attention mechanism (Vaswani et al., 2017) to implement f i (•).In detail, we leverage the embedding of each operation E OP i as query and the representations of the whole input sequence H as keys and values, respectively, to model f i (•) as follows: where Finally, we softly integrate all of the execution results as the final output h OP ∈ R d h with the distribution p(O|Q, P ) as follows: The operation-aware semantic representation h OP is further fed into the prediction module to reason the final answer ( § 3.3.3).
As described above, OPERA introduces operations that assist in understanding questions and integrates them into the model to perform discrete reasoning.Therefore, it achieves an advantage in the reasoning capability and interpretability over the previous multi-predictor-based methods (Hu et al., 2019;Chen et al., 2020a;Zhou et al., 2021).Moreover, soft execution and composition of operations in OPERA alleviate the cascaded error that the semantic parsing methods (Ran et al., 2019;Chen et al., 2020b) suffer from.More experiments and analyses about reasoning ability and interpretability are illustrated in § 4.4 and § 4.5.

Prediction Module
In this section, we introduce the prediction module to derive different types of answers via multipredictors.Each predictor first reasons out a derivation and then performs execution to obtain the final answer.This answer prediction procedure is formalized as follows:  3, the textual answer A of the question "how many yards was the longest field goals in the game" is "80".The possible derivations D to this answer include a span Span, (100, 102) , and an arithmetic expression AE, (0 * 29) + (1 * 80) .Inspired by previous studies (Chen et al., 2020a;Zhou et al., 2021) Question / Passage Span The probability of a question/passage span is the product of the probabilities of the start index and the end index.Follow-ing MTMSN (Hu et al., 2019), we use a questionaware decoding strategy to predict the start and end index across the input sequence, respectively.Count As indicated in QDGAT (Chen et al., 2020a), questions with 0-9 as answers account for 97% in all the count questions.Hence, such questions are modeled as a 10-class (0-9) classification problem.Arithmetic Expression Similar to NAQANet (Dua et al., 2019), we first assign a sign (positive, negative, or zero) to each number in the context and then compute the answer by summing them.Multi-spans Inspired by Segal et al. (2020), the multi-span answer (a set of non-contiguous spans) is derived with a sequence labeling method, in which each token of the input is tagged with BIO labels.Finally, each span which is tagged with continuous B and I is taken as a candidate span.

Training Instance Construction
Each training instance is originally composed of a passage P , a question Q, and answer text A. Since the derivations (i.e., labels for the spans, arithmetic expressions, and count) are not provided, weak supervision is adopted in OPERA.Specifically, for each training instance, given the golden textual answer A, we heuristically search all the possible derivations D as supervision signals, each of which can derive the correct answer A. Table 3 shows an example of D.
In addition, we propose heuristic rules to map a question to its related operations denoted as O ⊆ OP.For example, to detect the operations intimated by the question Q in Table 3, we design a question template "how many yards [Slot] longest [Slot]" which maps matched questions to the operation MAX.Overall, a training instance can be constructed as a tuple P, Q, A, O, D .The one-shot heuristic rules to obtain operation labels reduce the cost of human annotations.Moreover, when applying OPERA to other discrete reasoning MRC tasks, both the operations OP and the heuristic rules can be extended and adjusted if necessary.Fortunately, there is no need to construct strict logical forms in our architecture, but only the set of lightweight operations involved in the question.It tremendously reduces the difficulty of adapting OPERA to other discrete reasoning MRC tasks.
Meanwhile, we analyze the distribution of operations in the training set.More details about the P ...Oakland would take the lead in the third quarter with wide receiver Johnnie Lee Higgins catching a 29-yard touchdown pass from Russell, followed up by an 80-yard punt return for a touchdown ...  heuristic rules for mapping questions to operations and the operation distribution in the dataset are respectively given in the Appendix A.1 and A.2.

Joint Training
The training objective consists of two parts, including the loss for answer prediction and operation selection.
The loss for answer prediction L a is Note that the calculation of loss L a takes all possible derivations that can obtain the correct answer A into account, which means that OPERA does not require labeling answer types for training.In addition, to learn better alignment from a question to operations, we introduce auxiliary supervision for the operation selector and calculate the loss where O indicates the operations provided by the heuristic rules.Finally, OPERA is optimized by minimizing the loss L = L a + λL op where λ is a hyperparameter as a trade-off of the two objectives.

Dataset and Evaluation
We conduct experiments on the following two MRC datasets to examine the discrete reasoning capability of our model.We employ Exact Match (EM) and F1 score as the evaluation metrics.
DROP Question-answer pairs in DROP dataset (Dua et al., 2019) (Hu et al., 2019) adds a negation predictor to solve the negative question and re-rank arithmetic expression candidates.NeRd (Chen et al., 2020b) is essentially a generative semantic parser that maps questions and passages into executable logical forms.ALBERT-Calc was proposed for DROP by combining ALBERT with several predefined answer predictors (Andor et al., 2019).NumNet+ employs a pre-trained model to further boost the performance of NumNet.QDGAT (Chen et al., 2020a) builds a heterogeneous graph composed of entity and value nodes upon RoBERTa and utilizes a questiondirected graph attention network to reason over the graph.EviDR (Zhou et al., 2021), an evidenceemphasized MRC model, performs reasoning over a multi-grained evidence graph based on ELEC-TRA.

Implementation Details
We utilize adam optimizer (Kingma and Ba, 2015) with a cosine warmup mechanism and set the weight of loss λ = 0.3 to train the model.The hyper-parameters are listed in Table 4, where BLR, LR, BWD, WD, BS, and d h respectively represent the learning rate of the encoder, the learning rate of other parts of the model, the weight decay of the encoder, the weight decay of other parts of the model, batch size and hidden size of the model.Each operation is neutralized with a multi-head attention layer with n h heads and d h dimension.

Results on DROP and Analysis
Table 5 shows the overall results of OPERA and all the baselines on the DROP dataset.OPERA achieves comparable and even higher performance than the recently available methods.Specifically, OPERA(RoBERTa) achieves comparable performance to QDGAT with advantages of 0.32 EM and 0.42 F1.OPERA(ELECTRA) exceeds EviDR by 0.89 EM and 0.90 F1 and OPERA(ALBERT) outperforms ALBERT-Calc by 4.84 EM and 4.24 F1.Moreover, the voting strategy is employed to ensemble 7 OPERA(ALBERT) models with different random seeds, achieving 86.26 EM and 89.12 F1 scores.We think the better performance comes from the modeling of discrete reasoning over text via operations, which mines more semantic information of context and explicitly integrates them into the answer prediction.

Results on RACENum
To investigate the generalization of OPERA for discrete reasoning, we additionally compare OPERA with QDGAT and NumNet+ on the RACENum dataset.We directly evaluate the three models without fine-tuning on RACENum due to its small scale.As Table 6 shows, the scores of models on the RACENum dataset are generally lower than that on the DROP dataset, which is attributed to the lack of in-domain training data.Nevertheless, the performance of OPERA significantly outperforms NumNet+ and QDGAT by a large margin of more than 3.49 EM and 3.53 F1 score on average.It indicates that OPERA has better generalization ability.

Interpretability Analysis
Interpretability is an essential property for evaluating an MRC model.We analyze the interpretability of OPERA from the following two stages: (1) mapping from questions to operations, and (2) mapping from operations to answers.
Mapping from Question to Operation To explicitly show the correlations between questions and related operations, we manually evaluate the performance of the operation selection on 50 samples on the development set of DROP.Specifically, precision@n (P@n) is used as the evaluation metric, i.e., judging whether the top n predicted operations contain the correct ones according to questions.We finally achieve 0.88 on P@1 and 0.98 on P@2 for our model OPERA, which indicates that the operation selection module can accurately predict interpretable operations.0.79 F1 points and 0.85 EM / 0.75 F1 points for OPERA(RoBERTa) and OPERA(ALBERT).We also conduct the ablation study on the subsets containing a specific operation.As shown in Figure 3, OPERA achieves better performance than OPERA w/o OP on the majority of subsets.Overall, it confirms that integrating the operation-pivoted discrete reasoning mechanism contributes to the reasoning ability of the model.

Case Study
We show two examples from the development set of DROP to illustrate the effectiveness of our model by comparing the results of different models in Table 8.The first example shows that operation is essential for the prediction of answer type.Num-Net+ and QDGAT fail to predict the correct answer since the answer type of "how many" questions are wrongly predicted to Count.In contrast, OPERA can capture the ADDITION operation, which prompts the model to answer it with an arithmetic expression predictor.The second example shows that OPERA has stronger reasoning capability.In the example, though NumNet+ and QDGAT correctly predict the answer type, the final answer is wrong.OPERA can utilize more semantic information for answer prediction with the help of the operation-pivoted discrete reasoning mechanism.

Conclusion
We propose a novel framework OPERA for machine reading comprehension requiring discrete reasoning.Lightweight and one-shot operations and heuristic rules to map questions to an operation set are systematically designed.OPERA can leverage the operations to enhance the model's reasoning capability and interpretability.Experiments on DROP and RACENum demonstrate that OPERA achieves remarkable performance.Further visualization and analysis verify its interpretability.

A.3 Details of Prediction Module
In this section, we reveal the architecture details of the prediction module, including a prediction module for answer type and five label predictors corresponding to different answer types.FFN(•) means a feed-forward network that consists of two linear projections with a GeLU activation function (Hendrycks and Gimpel, 2016) and a layer normalization mechanism (Ba et al., 2016).

Answer Type
where h Q and h P ∈ R d h is the representation vector of question and passage calculated by weighted pooling with H Q and H P , respectively.E OP is the embedding matrix of operations.
Question/Passage Span Following MTMSN (Hu et al., 2019), we use a question-aware decoding strategy to predict the start and end indices of the answer span.Specifically, we first compute a question representation vector g Q via weighted pooling.Then derive the probabilities of the start and end indices of the answer span denoted as p s and p e : p s , p e ∝ FFN(M), where denotes element-wised product.h OP is derived by Eq. 6 and H is the representation of input sequence from context encoder.

Figure 1 :
Figure 1: The architecture of OPERA.It consists of a context encoder, an operation-pivoted reasoning module, and a prediction module.The prediction module supports five types of answers, including question span, passage span, arithmetic expression, count, and multi-spans.MHA means a multi-head attention mechanism.
p(A|Q, P, O) = D∈D I(g(D) = A)p(D|Q, P, O), (7) where I(g(D) = A) is an indicator function with value 1 if the answer A can be derived from a derivation executor g(•) based on D, and 0 otherwise.p(D|Q, P, O) models the derivation prediction.Specifically, a derivation D = T, L includes an answer type T and a corresponding label L. For example, in Table , the derivation predictor p(D|Q,P,O) = T ∈T p T (L|Q,P,O)p(T |Q,P,O) (8) is decomposed into an answer type predictor p(T |Q, P, O) and corresponding label predictors p T ∈T (L|Q, P, O) where T = {Question Span, Passage Span, Count, Arithmetic Expression, Multi-spans} includes all the answer types defined in this paper.Each label predictor takes question-passage representation H and the operation-pivo representation h OP as input and calculates the probability of label L. Specifically, these label predictors are specified as follows and more details are shown in Appendix A.3.
The cases from the development set of DROP.The predictions from the state-of-the-art model NumNet+ and QDGAT are shown.The last column indicates our predicted answers and Top-1 operations.
This work is supported by the project of the National Natural Science Foundation of China (No.U1908216) and the National Key Research and Development Program of China (No. 2020AAA0108600).

Figure 4 :
Figure 4: The Distribution of operations in the training set of DROP.
The probability distribution of answer type choices p(T |Q, P, O) is derived by a |T |-classifier with h Q , h P and h E as input:h E = O i ∈OP p(O i |Q, P )E OP i , p(T |Q, P, O) ∝ FFN([h E ; h Q ; h P ]),

Table 2 :
All the operations, descriptions and the corresponding examples.

Table 3 :
An example of building training instances.

Table 5 :
(Chen et al., 2020a)dataset.We solely compare with QDGAT, but leaving QDGAT p alone, since we focus on the reasoning mechanism in this work, while QDGAT p is a variant of QDGAT with data augmentation(Chen et al., 2020a).

Table 6 :
The performance of RoBERTa-based models on the RACENum dataset without finetuning.

Table 7 :
Ablation study on the dev set of DROP.RB and AB mean RoBERTa and ALBERT, respectively.

Table 7
Afterwards, the Falcons took the lead as quarterback Matt Ryan completed a 40-yard touchdown pass to wide receiver Roddy White and a 10-yard touchdown pass to tight end Tony Gonzalez ...