Enhancing Multiple-choice Machine Reading Comprehension by Punishing Illogical Interpretations

Machine Reading Comprehension (MRC), which requires a machine to answer questions given the relevant documents, is an important way to test machines’ ability to understand human language. Multiple-choice MRC is one of the most studied tasks in MRC due to the convenience of evaluation and the flexibility of answer format. Post-hoc interpretation aims to explain a trained model and reveal how the model arrives at the prediction. One of the most important interpretation forms is to attribute model decisions to input features. Based on post-hoc interpretation methods, we assess attributions of paragraphs in multiple-choice MRC and improve the model by punishing the illogical attributions. Our method can improve model performance without any external information and model structure change. Furthermore, we also analyze how and why such a self-training method works.


Introduction
Machine reading comprehension (MRC), which requires a machine to answer questions according to given documents, is an important way to test the ability of intelligence systems to understand human language (Hermann et al., 2015;Chen, 2018). As with other tasks in Natural Language Processing (NLP), deep models have achieved great success on MRC. At the same time, deep models' opaqueness grows in tandem with their power (Doshi-Velez and Kim, 2017), which has motivated efforts to interpret how these black-box models work. Post-hoc interpretation aims to explain a trained model and reveal how the model arrives at the prediction (Jacovi and Goldberg, 2020; Molnar, 2020), as shown in Figure 1. This goal is usually approached with attribution method, which assesses attributions of inputs to model predictions (Bach et al., 2015;Sundararajan et al., 2017;Shrikumar et al., 2017). In NLP, interpretations are usually given by assess-ing attributions of words, phrases, sentences and paragraphs (Ribeiro et al., 2016;Lundberg and Lee, 2017;Plumb et al., 2018;De Cao et al., 2020;Jacovi and Goldberg, 2020), in which positive attributions mean support to the prediction and negative ones mean opposition. It is well known that the strong fit ability of deep models can cause incredibly high performance on the training set. The correct prediction of an MRC model on the training set can't reflect the model has understood the sample and used a suitable way to predict. Since post-hoc interpretations can provide insights into how the model arrives at the prediction, we argue that we can use these insights to explore problems which predictions can't reflect and to improve model performance. In this work, we interpret multiple-choice MRC models by assessing attributions of paragraphs and improve model performance by punishing the illogical parts of these attributions. The illogical attributions here mean the positive ones to the wrong choices and negative ones to the right, reflecting paragraphs' support to the wrong choices and opposition to the right in the model reasoning process. distracted. However, the attributions show strong support of para3 to distractors A and D, which overlap words in it. These attributions show the model's strong dependency on word-overlap form, which is not suitable for answering this question. We constrain the model from such dependency by punishing these attributions. Attributions in this example reflect problems predictions fail to, and we take advantage of this to improve the model.
It is worth noting that we don't simply constrain the model from certain forms. On the contrary, we let the model learn differences between different circumstances. For example, in example2, attributions show strong support of para4 to choice C, reflecting the model's dependency on the wordoverlap form. However, we will not constrain the model from such dependency as in example1 because the choice C is the right choice and the attribution here is logical. This way, we let the model learn which circumstance is suitable for using such forms and which one isn't.
Compared to existing work (Niu et al., 2020;Jin et al., 2020;Zhu et al., 2020), our method does not need any external information and model structure change. We simply train a new model after getting attributions of the original model. We demonstrate the effectiveness of our method through experiments on three representative datasets: RACE (Lai et al., 2017), MULTIRC (Khashabi et al., 2018) and DREAM (Sun et al., 2019). The main attributions of this paper are summarized as follows: • We innovatively explore the illogical attributions of the multiple-choice MRC model, and improve the model by punishing them. To the best of our knowledge, we are the first to improve MRC models resorting to post-hoc interpretations.
• We conduct extensive experiments and the results demonstrate that our method can improve multiple-choice MRC consistently on three datasets. Our method can improve both trivial and strong baselines (BERT base and ALBERT xxlarge ). Furthermore, our method can be applied to the most advanced model.
• We conduct an in-depth analysis of the experimental results and analyze why our method works.
2 Related Work

Attribution Interpretation Methods
In the post-hoc interpretation research field, methods to get attributions can be classified as erasurebased methods, gradient-based methods, and attention-based methods. In erasure-based methods, attributions of inputs are measured by the change of output when these inputs are removed (Li et al., 2016;Feng et al., 2018). In gradient-based and attention-based methods, the magnitudes of the gradients and attention weights serve as feature importance scores, respectively (Serrano and Smith, 2019;Vashishth et al., 2019;Sundararajan et al., 2017;Shrikumar et al., 2017). Erasure-based methods are model-agnostic. Gradient-based and attention-based methods are applicable for differentiable models and models with attention structures, respectively. The advantage of erasure-based methods is that it is conceptually simple and can optimize well-defined goals (De Cao et al., 2020). The advantage of gradient-based and attention-based methods is they are computationally efficient, However, attention-based and gradient-based methods have received much scrutiny (Sixt et al., 2019;Nie et al., 2018;Jain and Wallace, 2019), arguing that they cannot theoretically prove that the network ignores low-scored features.

Multiple-choice Machine Reading Comprehension
Multiple-choice MRC requires the machine to decide the correct choice from a set of answer choices given the relevant documents and questions.The question and choice types of multiple-choice MRC are flexible, such as arithmetic, abstract, common sense, logical reasoning, language inference, and sentiment analysis (Lai et al., 2017;Sun et al., 2018;Jin et al., 2020). It requires many advanced reading skills for the machine to perform well on the multiple-choice MRC task.

Task Description
In multiple-choice MRC, given a relevant document D containing n paragraphs {p 1 , p 2 , ..., p n }, a question Q and an choice set with m choices C = {c 1 , c 2 , ..., c m }, the model should determine which choice is correct. The task can be formalized as:ĉ = arg max c ∈C P (c |Q, D).

Method
The overview of our method is shown in Figure 3, which contains three steps:

Training and Interpreting
The commonly used framework for multiple-choice MRC is shown in Figure 4: The document, question, and one of the choices are concatenated together, resulting in m sequences for one question.
The model takes these sequences as input separately, and outputs logits L = {l 1 , l 2 , , ..., l m } for m choices. choice with the largest logit is the predicted choice. If the softmax function is used to normalize the logits, P (c i |Q, D) = e l i m j=1 e l j , and the corresponding cross-entropy loss is: where c r denotes the correct choice.
We train a multiple-choice MRC model and use erasure-based method to obtain attributions of the trained model. Following previous work Feng et al., 2018;Ribeiro et al., 2016;Li et al., 2016), an input subset's attributions are obtained by calculating the output change when erasing this subset. In this work, we use leave-oneout (Li et al., 2016) method to perform erasure and get attributions of paragraphs.
As shown in Figure 4, given a document D containing n paragraphs {p 1 , p 2 , ..., p n }, we use D −i to represent D with p i erased. For D −i , Q, C, the model will output logits L −i = l −i 1 , l −i 2 , ..., l −i m , which means model's output with p i erased. Thus, the attributions of p i can be Since erasure-based method is model-agnostic, we don't need to make any changes to the structure of the MRC model.

Processing Interpretations
The illogical attributions mean positive ones to the wrong choices and negative ones to the right, reflecting paragraphs' support to the wrong choices and opposition to the right, formally as: . The absolute value of a i j reflects the degree of support and opposition, which can be used to measure the illogical degree. For each choice, if a i j is illogical, we record the corresponding paragraph index i and calculate a i j during retraining. To shorten retraining time, if we find more than one illogical attribution for one choice, we only use the one with the largest illogical degree. For each sample, we obtain a paragraph index set I = {i 1 , i 2 , ..., i m } corresponding to m choices, where i j is a number or None. (If there is no illogical attribution for c j , we record i j = None.)

Retraining
We train a new model and punish the model for generating illogical attributions corresponding to I during the training process. As shown in Figure 5, we calculate attributions Attr I = a i 1 1 , a i 2 2 , ..., a im m Figure 5: Overview of the retraining process.
and add a extra loss: which means a i j j = 0 and loss ex = 0. The extra loss is used to punish the model for generating illogical attributions. The total loss of retraining is the combination of the task-specific loss and the extra loss: where α is a factor to balance the two loss terms. Though we need to calculate a i j j at each step, this only requires one additional subtraction operation. The main complexity introduced is the amount of input is doubled. It takes about twice as long to retrain compared to train the initial model. Recently, a new task form has emerged in multiple-choice MRC, which has an uncertain number of correct choices for each question (Khashabi et al., 2018). It requires the model to determine the correctness of each choice respectively and can be formalized as a binary classification task as shown in Figure 6. We use the same method to solve such tasks, in which the task is seen as a single-choice task with two choices: right and wrong.

Baselines and Implement Details
Since the wide use of pre-trained language models in NLP, we choose two pre-trained language models BERT base (Devlin et al., 2018) and ALBERT xxlarge (Lan et al., 2019) as the trivial and strong baselines respectively. We use the same model architecture as that in Transformer 2 , which is commonly used in multiple-choice MRC: a pretrained language model as the encoder and a singlelayer linear network connected to [CLS] as the matching network. In addition to the commonly used model architecture, we also experimented on DUMA (Zhu et al., 2020), which is the state-of- Our implementation is based on Transformer 2 . We use default model settings in Transformer 2 and follow basic experimental settings in the leaderboards and corresponding papers. We directly adopted the same learning rate and batch size as the baseline models for retraining. We search the coefficient α among 0.1, 0.5, and 1. For MULTIRC and DREAM, we use the original paragraph divisions of the datasets. For RACE with a noisy paragraph division, we limit the length of paragraphs based on the original paragraph division.

Main Results
We evaluate our method on three multiple-choice MRC datasets and adopt the metrics from the referred papers. The results are summarized in Table  1. Our method can improve model performance remarkably: 1.33% and 1.63% average performance improvement for BERT base and ALBERT xxlarge , which demonstrates that our method can help both a trivial baseline as well as a competitive baseline. Furthermore, ALBERT xxlarge + retraining produces competitive results: On the RACE leaderboard, our result only lags behind super-large pretrained language model Megatron-LM (Shoeybi et al., 2019). On the DREAM leaderboard, our results only lags behind DUMA (Zhu et al., 2020). Note that we only compare single-task and nonensemble models. In addition, since our method is model-agnostic, we also experiment with DUMA as model architecture on the DREAM dataset, which is the state-of-the-art model architecture on the DREAM leaderboards.
Because Zhu et al. (2020) did not provide some important details such as the number of DUMA attention heads and the head size, we use the settings in another re-implementation Wan (2020)

The Relationship Between Illogical Attributions and Model Performance
In this section, we explore the relationship between illogical attributions and model performance. We use the maximum value of illogical attributions as the illogical score of a choice and sum the scores of all choices as the illogical score of a sample. According to illogical scores, we sort samples and divide them into 20 subsets of the same number of samples. We evaluate model performance on these subsets and investigate the relationship between illogical score and model performance. We use two widely used correlation coefficients: Spearman rank-order correlation coefficient (SROCC) (Spearman, 1961) and Pearson correlation coefficient (PLCC) (Benesty et al., 2009) to evaluate the correlation between them.

Test Set Results
We first experiment on the test set. As shown in Figure 7, the SROCC and PLCC values on the test are close to -1. The results show that there is a strong correlation between illogical score and model performance, where a higher illogical score corresponds to poorer model performance. Moreover, since MRC models' understanding ability is evaluated via test set performance, the results also demonstrate that we can utilize interpretations to evaluate MRC models' understanding ability from another perspective.

Training Set Results
The correlation on the training set is weaker than the test set. This is because the model has fitted training samples during training, which causes the training set performance cannot reflect MRC models' understanding ability. However, we find a interesting phenomenon: the correlation of strong model is stronger consistently in all datasets. We hypothesize that the stronger model can fit more linguistic features of the training samples while the trivial model needs to fit more unique features of the training samples. Since these unique features are hard to interpret and generalize to test set samples, the correlation between interpretation (illogical score) and performance is weak, and the test set performance is poor.

Effectiveness of Retraining the Illogical Interpretations
In this section, we explore whether the illogical attributions are constrained after retraining. We compare the illogical score of the retrained model to the original model. Figure 8 shows an example of change in illogical scores after retraining. We can see from the figure that most samples' illogical scores are constrained close to zero after retraining. In this example, the average value of illogical scores declines from 1.22 to 0.11 on the training set and declines from 1.43 to 0.37 on the test set. Table  3 shows changes in the average illogical score after retraining. The average illogical scores decline consistently in six experiments, which demonstrates the effectiveness of the retraining strategy.   According to the illogical score of the original model, we divide the test set into ten subsets with the same number of samples . Model performance on these subsets is shown in Figure 10. On subsets with high illogical scores, the model performance gets a remarkable gain after retraining. However, the model performance declines after retraining on some low-score subsets. We hypothesis that the punishment of illogical interpretations will affect the model's confidence in using the right reasoning form in some samples. For example, although we let the model to learn the difference between examples in Figure 2, punishment in example1 may affect the model's confidence in using the same form in example2.

Using Post-hoc Interpretations to Improve NLP Models
Existing work focusing on using post-hoc interpretations to improve NLP models has forced the model to generate the 'correct' interpretation. Although conceptually simple, 'correct' interpretations served as the ground truth are difficult to get. For example, Liu and Avci (2019) uses humanselected terms as the target attributions, which is noisy and hard to be generalized to other datasets.  does sampling during training and resorts to mean-field approximation (Blei et al., 2017) to get target attributions, which leads to difficult training and unstable results. Moreover, their improvements are limited, and they all choose to experiment on simple text classification tasks, in which some words can be regarded as the decisive factors for prediction. However, for MRC tasks that often require complex reasoning, getting interpretations served as the ground truth to guide the model is more difficult and costly. Figure 9: Two ways of using post-hoc interpretations to improve NLP models Different from existing methods, we focus on finding illogical parts of interpretations of trained models instead of the ground-truth interpretations of the task. As shown in Figure 9, we punish the illogical parts and force the model to find other ways to get the prediction by itself. Because forms found by humans are usually easy to learn for deep models, it is hard to create interpretations helpful for Figure 10: Model performance on subsets where the larger index corresponding to the higher illogical score. Since the F1 score is affected by the ratio of positive and negative samples, we use accuracy as the metrics for MULTIRC. deep models. We believe analyzing models' interpretations and finding problems is a more suitable way to improve model performance.

Guilding Strong Models by Penalizing Errors
We get similar average performance improvement for the strong baseline ALBERT xxlarge and the trivial baseline BERT base . This is contrary to many methods, which are usually effective on trivial baselines but difficult to get improvement on strong ones. For example, similar work (Niu et al., 2020) focusing on improving multiple-choice MRC models designs a sentence selector for learning evidence sentences. Their method gets significant improvement on BERT base but fails to apply to a stronger baseline RoBERTa large (Liu et al., 2019). Telling a strong model which sentences are evidence sentences is more difficult because the model's strong learning ability might makes this extra guidance redundant. This suggests a hypothesis that it is more effective to penalize errors than to promote correct answers for a strong model when high-quality labeled correct answers are not available.

Case Study
We can observe the wrong reasoning process of deep MRC models through analysis of the illogical interpretations, some of which are interesting and unexpected. For example in   wrong choice 'March 15th'. We hypothesis that the model understands 'not really' is a negation of 'March 5th'. However, the model notices 'three' in para2 and believes that the choice is 3 times 5 equals 15. We observed the linguistic characteristics of high illogical score examples on the test set, and found they are different between BERT base and ALBERT xxlarge . For example, examples with negation and transition tend to have high illogical scores on BERT base , but have low illogical scores on ALBERT xxlarge . We hypothesis that is because ALBERT xxlarge perform better than BERT base in not being distracted by these grammatical phenomena. We suggest analyzing interpretations and finding problems can help humans get a more comprehensive understanding of deep models.

Conclusion and Future Work
In this work, we improve multiple-choice MRC resort to attribution interpretations. Experimental results show that our method can remarkably improves model performance on three representative datasets. We believe using post-hoc interpretations to improve NLP models is a promising research field. The future work contains two aspects: 1. We plan to experiment with our method on other tasks, such as natural language inference and sentiment analysis, and explore methods applicable to tasks without choice options or specific classes, such as generative MRC and span extractive MRC.
2. In addition to attributions, we plan to use other forms of post-hoc interpretations, such as feature interaction, to improve NLP models.

A Experiments Details Getting Attributions
There is no need to select the model used to generate attributions carefully. In our experiment, we use the last saved results of the maximum training step. We use the original paragraph division of the dataset for MULTIRC and DREAM. For RACE with a chaotic paragraph division, we limit the maximum length and minimum length of paragraphs based on the original division. Specifically, if a paragraph's length is less than 10, then combine it with the previous one. If the length exceeds 30, the beginning of the next sentence after 30 is the beginning of a new paragraph. For MULTIRC with an uncertain number of correct choices, which can be formalized as a binary classification task, we see the task as a single-choice task with two choices: the choice is right, and the choice is wrong. Because of the opposition relation between these two  choices, we only record the paragraph indexes of the option representing the choice is wrong for retraining.

Retraining
For hyperparameters, all three tasks use 512 as the maximum sequence length. We adopted the same learning rate and batch size as the baseline models for retraining. We use the default model settings in Transformer 2 . We search the coefficient alpha among 0.1, 0.5, and 1. The details are shown in Table 5. We follow the experimental settings from the leaderboards and corresponding papers.
If there is no relevant information, we retrain the model three times and pick the model with the best accuracy on the dev set. We use FP16 training from Apex 3 for accelerating the training process, and all the experiments are run on two Titan RTX GPUs and two Tesla V100 GPUs. We calculate the task-specific loss and regulation loss separately during retraining because of the limitation of video memory.

B Case Study
We present cases with the top 5 high illogical scores on the DREAM test set: