Alignment Rationale for Natural Language Inference

Deep learning models have achieved great success on the task of Natural Language Inference (NLI), though only a few attempts try to explain their behaviors. Existing explanation methods usually pick prominent features such as words or phrases from the input text. However, for NLI, alignments among words or phrases are more enlightening clues to explain the model. To this end, this paper presents AREC, a post-hoc approach to generate alignment rationale explanations for co-attention based models in NLI. The explanation is based on feature selection, which keeps few but sufficient alignments while maintaining the same prediction of the target model. Experimental results show that our method is more faithful and human-readable compared with many existing approaches. We further study and re-evaluate three typical models through our explanation beyond accuracy, and propose a simple method that greatly improves the model robustness.


Introduction
Natural Language Inference (NLI) is a fundamental task in Natural Language Processing (NLP) which is to determine if a hypothesis entails a premise. Recently, with the introduction of large-scale annotated datasets (Bowman et al., 2015;Williams et al., 2018), deep learning models are adopted to solve the task in a supervised manner (Conneau et al., 2017;Chen et al., 2017;Devlin et al., 2019) and achieve great success, while inner mechanisms of these methods are still opaque due to high computational complexities.
Towards interpretability, explaining the model behavior has gained increasing attention. Lots of approaches are based on feature attribution which 1 Our code is available at https://github.com/ changmenseng/arec assigns saliency scores for input features (Bahdanau et al., 2015;Lundberg and Lee, 2017;Thorne et al., 2019;Kim et al., 2020), and feature selection or rationale that keeps a subset of features sufficient for the prediction (Lei et al., 2016;Bastings et al., 2019;De Cao et al., 2020;DeYoung et al., 2020). Figure 1 (a) and (b) present a text attribution explanation by LIME (Ribeiro et al., 2016) and a text rationale explanation from Li et al. (2016) of an NLI sentence pair. Both explanations provide insights of which input words are responsible for the prediction. However, NLI is a cross-sentence task requiring a system to reason over alignments 2 (MacCartney and Manning, 2009). Intuitively, it is more sensible to explain NLI systems in the way of alignments instead of isolated words/phrases. For the example in Figure 1, the contradicted phrase pair streetstore is one of the key alignments responsible for the correct prediction.
To explain NLI models over alignments, the literature usually looks at co-attention weights (Parikh et al., 2016;Pang et al., 2016;Chen et al., 2017), which is a dominant way to implicitly align word pairs (Wang et al., 2017;Gong et al., 2018;Devlin et al., 2019). However, attention is argued not as explainable as expected (Jain and Wallace, 2019;Serrano and Smith, 2019;Bastings and Filippova, 2020). Moreover, co-attention assigns scores among words thus forbids us to observe phraselevel alignments, which is a flaw that generally exists for attribution explanations as shown in Figure 1 (c). Other works build hard alignments resorting sparse attention (Yu et al., 2019;Bastings et al., 2019;Swanson et al., 2020). But their selfexplanatory architectures pay for the interpretability at a cost of performance dropping on accuracy (Molnar, 2020). Meanwhile, these techniques are unable to analyze well-trained models.
To resolve above problems, this paper proposes AREC, a post-hoc local approach to generate Alignment Rationale Explanation for Co-attention based models. Analogous with Lei et al. (2016), our alignment rationale is a set that contains text pairs from the NLI sentence pair with two requirements. First, the explanation is supposed to be faithful to the predictive model, where selected text pairs must alone suffice for the original prediction. Second, the explanation should be humanfriendly or readable (Miller, 2019), which means the pairs are few to promote compact rationales, and extracted continuously to make phrase-level rationales as far as possible (Lei et al., 2016;Bastings et al., 2019). Figure 1 (d) presents an example of AREC explanation. It shows that the model reaches the right prediction reasonably: it identifies People -Passengers, walk throughcar driving and storestreet to make up the alignment rationale. AREC is flexible to apply on any co-attention architectures, allowing us for deep investigations of well-trained models.
With the proposed AREC, we study three typical co-attention based models Decomposable Attention (DA) (Parikh et al., 2016), Enhanced LSTM (ESIM) (Chen et al., 2017) and BERT (Devlin et al., 2019) on four benchmarks including SNLI (Bowman et al., 2015), ESNLI (Camburu et al., 2018), BNLI (Glockner et al., 2018) and HANS (McCoy et al., 2019). Experimental results show that our method could generate more faithful and readable explanations. Moreover, we employ our proposed AREC to analyze these models deeply from the aspect of alignments. Based on our explanations, we further present a simple improvement strategy that greatly increases robustness of different models without modifying their architectures or retraining. This proves that our method could factually reflect how models work.
Our contributions are summarized as follows: 1) We come up with AREC, a post-hoc local explanation method to extract the alignment rationale for co-attention based models. We compare AREC with other explanation methods, illustrating its advantages on faithfulness and readability.
2) We diagnose three typical co-attention based models using AREC by re-evaluating them in a more fine-grained alignment level beyond accuracy. Experimental results could reveal potential improvement solutions. To the best of our knowledge, we are the first to study existing models with alignment exhaustively.

Related Works Natural Language Inference
Natural Language Inference has been studied for years. Despite lots of works construct representations for the input two sentences individually (Bowman et al., 2015;Mueller and Thyagarajan, 2016;Conneau et al., 2017), the task actually requires a system to recognize alignments (MacCartney and Manning, 2009). In early days, alignment detection is sometimes formed as an independent task (Chambers et al., 2007;MacCartney et al., 2008), or a component of a pipeline system (MacCartney et al., 2006). Currently deep learning methods seek to model alignments implicitly through co-attention mechanism (Parikh et al., 2016;Pang et al., 2016;Chen et al., 2017;Wang et al., 2017;Gong et al., 2018;Joshi et al., 2019;Devlin et al., 2019). The technique is first proposed in machine translation (Bahdanau et al., 2015), and soon dominates in many applications including NLI. However why models with co-attention layers are effective is still called for answers.

Explaining Models in NLP
Explaining model behaviors has attracted much interests. Existing studies include opening the com-ponent of models (Murdoch et al., 2018), assigning word importance scores (Ribeiro et al., 2016;Li et al., 2016;Kim et al., 2020), extracting predictive related input pieces, referred as sufficient input subset (Carter et al., 2019) or rationale (Lei et al., 2016;Bastings et al., 2019), building hierarchical explanations (Chen et al., 2020;Zhang et al., 2020), and generating natural language explanations (Camburu et al., 2018;Kumar and Talukdar, 2020). However, they usually explain the model on the granularity of words/phrases. Such ways are sufficient for text classification but not suitable for NLI, since atom features in the task are alignments.
Co-attention itself is often viewed as an explanation. Indeed, co-attention is a key proxy to model alignments, where perturbing its weights has a significant impact (Vashishth et al., 2019). Yet recently, attention is argued to be not explainable as expected (Jain and Wallace, 2019;Serrano and Smith, 2019;Grimsley et al., 2020;Bastings and Filippova, 2020). Secondly, co-attention along with feature attribution explanations just assigns scores among words, which is infeasible to observe phrase-level alignments. Furthermore, for models with multiple attentions (Vaswani et al., 2017), it's hard to acquire a global understanding of alignments. Other approaches include Yu et al. (2019), who adopts generator-encoder architecture (Lei et al., 2016) to generate corresponded rationales. But their approach is unable to extract more fine-grained alignments (e.g., one-to-one continuous alignments). Bastings et al. (2019); Swanson et al. (2020) design sparse attention for hard alignments. However, these methods trade performance for interpretability, and are immutable to analyze well-trained models.

Method
In this section, we describe our AREC in details. As mentioned before, AREC is a post-hoc approach for explaining co-attention based models. Thus we first introduce the co-attention layer, then depict the propose AREC.

Background: Co-Attention in NLI Models
In our notation, we have an instance including a premise P = [p 1 , · · · , p |p| ] ∈ R d×|p| and a hypothesis H = [h 1 , · · · , h |h| ] ∈ R d×|h| , where |p|/|h| is the length of the premise/hypothesis, and p i /h j ∈ R d denotes corresponding word embed-ding (fixed or contextual). Co-attention layer accepts P and H as input and outputs alignment enhanced word representationsP ∈ R d×|p| and H ∈ R d×|h| . At the first step, we compute a similarity matrix S ∈ R |p|×|h| where φ is a similarity function, ordinarily a vector dot product (Chen et al., 2017). Then S is normalized to compute soft alignment scores for every word in a sentence w.r.t all the words in its partner Here AP and AH are so-called co-attention matrices, each element inside indicates the matching degree of the corresponding word pair. Next, we obtain soft alignments features for every word in the premise/hypothesis by averaging word embeddings in the hypothesis/premise weighted by the soft alignment scores NowP/H is a richer representation of P/H enhanced by H/P and fed to following modules, such as a classifier which outputs probabilities of candidate categories, i.e., entailment, contradiction and neutral in NLI task.

Problem Formation
The proposed AREC relies on feature selection, keeping few but sufficient alignments while maintaining the original prediction. Thus to restrict the model to only consider some specific alignments, we intuitively mask co-attention matrices AP and AH following Serrano and Smith (2019); Pruthi et al. (2020). Let Z ∈ {0, 1} |p|×|h| be a binary mask indicating the presence or absence of every word pair alignment, and M be a model with co-attention layers. Then the masking process is simply Hadamard product between mask Z and co-attention matrices AP and AH. An alignment rationale is obtained by an optimistic problem The loss contains three terms (L 0 , L 1 and L 2 ) to satisfy faithfulness and readability as mentioned in Section 1. λ 0 , λ 1 and λ 2 are hyper-parameters standing for loss weights. Every rectangular region inZ represents a text alignment in the alignment rationale. We now describe loss terms. The first term L 0 is about fidelity, asking that the model prediction is maintained after masking (Molnar, 2020). Fidelity ensures faithfulness, making the derived explanation depict the true profile of how the model works. We choose the euclidean distance between logits as this loss term, i.e., where M l (P, H) and M Z l (P, H) ∈ R 3 are original output logits and output logits when applying the mask Z respectively. Compared to commonly used KL divergence (De Cao et al., 2020) or label equality (Feng et al., 2018), the euclidean distance between logits is a stricter constraint that narrows down the solution space and would lead to more faithful explanations 3 .
Secondly, an explanation ought to be readable (Molnar, 2020). That requirement contains compactness and contiguity under the context of alignment explanation. Compactness draws intuition from the philosophy that a good explanation should be short or selective (Miller, 2019), which encourages fewer alignments to be selected. Compactness loss is simply the L1 norm of the mask Z where z i,j is an element in Z. Contiguity encourages continuous phrase-level alignments 4 (Zenkel et al., 2020), which is helpful for human understandings. Concretely, contiguity prefers Z with rectangular clusters. Thus, we have where 1(·) is the indicator function and W z The loss is based on the observation that if there are three 1s in the window, there must be a non-rectangle region nearby, as marked by red boxes in Figure 2. A Figure 2: The contiguity loss L 2 could encourage the algorithm to extract phrase alignments, i.e., penalises Z with non-rectangular clusters, as marked by red boxes.

Optimization
Searching the exponential huge (2 |p||h| ) solution space of Z straightforwardly is impracticable. To use the gradient-based method, we relax binary Z to be a stochastic matrix Z, and optimize loss expectation over it. Specifically, we assume that every element Z i,j in Z is an independent random variable satisfying HardConcrete distribution (Louizos et al., 2018a). HardConcrete variables are allowed to be exactly discrete 0 and 1, while having continuous and differential probability densities on the open interval (0, 1). Additionally, HardConcrete distribution accommodates reparameterization, permitting us to obtain a HardConcrete sample z by transforming a parameter-less unit uniform sample u, i.e., z = g(u; α), where g is differential.
Details are shown in Appendix A. Under this setting, we turn to optimize the expectation of the objective. For L 0 , we have Here, U is a random matrix filled with i.i.d unit uniform variables, α ∈ R |p|×|h| + is the parameter of Z. The second line is a Monte-Carlo approximation of the expectation, where n is the sample size, and U i is the i-th sample of U .
For L 1 and L 2 , we have where Z is the up round of Z and P(·; α) is the probability over the parameter α. Now, all the losses are differential over α, making gradient descent feasible. Derivation details are presented in Appendix B.
After training, we obtain the alignment rationale as follows

Experiments
Our experiments include two parts. First, we quantitatively compare the proposed AREC with several typical explanation methods (Section 4.1) to prove the effectiveness of our method. Second, by means of AREC, we study and re-evaluate different models from the aspect of alignment beyond accuracy, revealing potential improvements (Section 4.2).

Datasets
We use four datasets SNLI (Bowman et al., 2015), ESNLI (Camburu et al., 2018), BNLI (Glockner et al., 2018) and HANS as our testbeds. SNLI is a traditional NLI benchmark, while ESNLI extends it by annotating text rationales. BNLI and HANS are stress testing sets to test lexical inference and overlap heuristics respectively.

Models
We choose three typical co-attention based NLI models DA 5 (Parikh et al., 2016), ESIM (Chen et al., 2017) and BERT (base version) (Devlin et al., 2019) for our discussion. DA applies the co-attention directly on word embeddings. ESIM further incorporates order information by putting two LSTMs before and after the co-attention layer (Hochreiter and Schmidhuber, 1997) to boost the performance. Differently, BERT concatenates the input sentence pair with a template "[CLS] p [SEP] h [SEP]" and uses global self-attention (Vaswani et al., 2017). All the models are trained on SNLI training set and tested across datasets.

Implementation
We mask attention matrices for DA and ESIM as described in Section 3.2 since they are directly formed by co-attention. For BERT, we use a single mask to mask co-attention corresponded sub-matrices 6 of all the attention matrices identically, no matter of their layers or attention heads. We consider that faithfulness has a higher priority than readability. Correspondingly, we adjust weights in the loss dynamically, based on fidelity of current mask. To this end, weights are set as where SpAc is the accuracy of current sampled masks (12) Here, M Z y is the model predicted label under mask Z. Thus terms related to readability are controlled by the explanation faithfulness. This simple dynamic weight strategy is similar to the approach in Platt and Barr (1988) and highly improves the explanation quality and the algorithm stability.

Explanation Evaluation
In this section, we aim to evaluate the faithfulness and readability of different explanations.

Metrics
Inspired by DeYoung et al. (2020), we use Area Over Reservation Curve (AORC) to evaluate faithfulness 7 as follows is the mask that reserves top k% coattention weights from an attribution explanation. Though AREC belongs to feature selection explanations, its parameter α also provides importance scores. We also report fidelity defined in Equation (5) as a measure of faithfulness.
For readability evaluation, we report compactness and contiguity defined in Equation (6) and Equation (7) respectively. We also conduct human evaluations on random sampled 300 examples from SNLI testing test to directly measure readability. We let 2 annotators to rate how easy the explanation is to read and understand the model's decision-making process along alignments from 1 to 5 points and report the average scores 8 .
We admit that metrics including fidelity, compactness and contiguity are that AREC optimizes. Actually it's hard to unitedly evaluate different explanations since their contexts and techniques are usually completely different. If we only follow definitions of those metrics, we consider they are reasonable. Note that these metrics are not compatible for feature attribution explanations. For fair comparison, we follow Carter et al. (2019) to induce alignment rationales by thresholding 9 for feature attribution baselines. That is, we sequentially remain co-attention weights according to attribution scores until the fidelity loss is lower than the pre-defined threshold.

Results
Automatic evaluation and readability human evaluation results are shown in Table 1 and Table 2 respectively. We obtain the following findings: 7 We don't use Area Over Perturbation Curve (AOPC) (DeYoung et al., 2020) because our method is to reserve features (i.e., alignments) that keep the prediction, it is fitter to utilize reservation curve. 8 Both annotators are well-educated postgraduates major in computer science. We conduct human evaluation on randomly sampled 300 examples in SNLI testing set. 9 The threshold is set to L0 + 0.1 of AREC to obtain alignment rationales with similar fidelity for fair comparison. We don't use fix size constraint to construct rationales as done in Jain et al. (2020) because we think the size of a rationale depends on the instance. 1) AREC is quite faithful with the lowest AORC and fidelity value in most cases. Perturbation-based methods are equally matched with moderate performances, while gradient-based ones have the least faithfulness. Surprisingly, co-attention is a very strong baseline to indicate important alignments for NLI, surpassing most other baselines on AORC, extremely for ESIM. This result is of accordance with Vashishth et al. (2019) that attention is more faithful in cross-sentence tasks compared with singlesentence tasks.
2) AREC is quite readable which achieves the lowest compactness value and contiguity value in most cases for automatic evaluation. AREC is also the most readable explanation according to human evaluation. As a contrast, feature attribution methods are unable to induce readable alignment rationales. They reserve too much co-attention weights, usually half of which, to ensure similar fidelity with AREC rather than satisfying compactness and contiguity. Appendix E shows some examples for intuitive feelings of different explanations' readabilities.
3) Compared to rationale explanation DIFF-MASK, AREC is far more promising that outperforms it with huge gaps on fidelity while maintains equivalent or better compactness and contiguity. In our knowledge, DIFFMASK is to globally learn to explain local instances: the explainer is trained on a training set which may contain artifacts and biases (Gururangan et al., 2018;Tsuchiya, 2018;Poliak et al., 2018). Therefore this architecture leverages data information. It is susceptible to over-fitting and generate data-relevant biased explanations as a result, leading to poor fidelity when facing heldout data (BNLI and HANS) as shown in Table 1. Moreover, we believe that a faithful explanation is a profile of a model. Correspondingly, an explanation method should only access knowledge from the model instead of from the data. That is an appealing theoretical advantage of our method.   we evaluate its alignment plausibility (Jacovi and Goldberg, 2020): how well do its alignment rationales agree with human judgments (DeYoung et al., 2020). Since it is established in Section 4.1 that our method is faithful, thus alignment plausibility reflects a model's power of alignment detection, i.e., whether it makes a prediction with right alignments. Figure 3 illustrates the evaluation process.

Beyond
Firstly, let's have a look at Table 3 that shows the accuracy performances of various models across datasets. Both DA, ESIM and BERT achieve high and tied accuracy performances on SNLI. However, they are distinguished on lexical reasoning, where BERT surpasses others significantly on BNLI. Additionally, neither of them is robust against overlap heuristic, as their performances are extremely poor on non-entailment instances. We seek to uncover the behind reasons (Section 4.2.2) and try to make improvements (Section 4.2.3) using our AREC.

Metrics
We define different metrics to measure alignment plausibility (or equally speaking, alignment rationale agreements with humans) in various datasets.
For ESNLI, since it's annotated in the text level, we simply collect corresponding words to convert an alignment rationale to a text rationale for comparison. We adopt IOU-F1 and Token-F1 from DeYoung et al. (2020), and only use a subset of ESNLI whose instances are labeled contradiction for our evaluation 10 .
In BNLI, each sentence pair differs by a single word or phrase. Naturally this pair forms up an annotation, which should be counted in a golden alignment rationale. Further, We reasonably presume this pair is the most essential alignment in its corresponding alignment rationale. Thus, three metrics are defined: 1) Max-F1: we remain the alignment with max score from the alignment rationale outputted by AREC according to LEAVEO-NEOUT. Max-F1 is the F1 measure comparing remaining ones and annotations. 2) Exact-Inc: The This is water close to the woman. This is water far away from the woman.  metric is the proportion that the alignment rationale includes the annotated alignment. 3) Soft-Inc: It is a loosed version of Exact-Inc, which is the average recall comparing alignment rationales and annotations. Details are shown in Figure 3. We carry out human evaluations on HANS because it is not annotated in any form of rationales. We ask 2 human annotators if (yes/no) the decision process observed by AREC is agreed with them and report averaged agreed ratio 11 (see Appendix D for details). Table 3 shows alignment plausibility results, where we obtain the following findings: 1) Across datasets, alignment plausibilities are consistent with the accuracy performances in different degrees. Especially on BNLI, where BERT surpasses other competitors on all metrics substantially, quantitatively revealing that the alignment detection ability is important and distinguishes NLI models. We also discover that modeling order information explicitly is also useful for NLI, where ESIM achieves a better accuracy even with a poorer alignment plausibility on SNLI compared to DA. 11 The human evaluation is conducted on randomly selected 300 examples, 10 examples per heuristic.  Combining the two factors makes BERT an effective approach for NLI. 2) Our explanation method is helpful to detect artifacts or biases leveraged by the model. For example, though obtaining high accuracy on HANS E , DA's low alignment plausibility suggests it usually makes a right prediction with wrong alignments (see Appendix D for examples). Further, all the models are brittle on catching reasonable alignments when facing non-entailment instances in HANS. As we will discuss next, they tend do shallow literal lexical matching, which we conjecture the reason why they also fail on accuracy.

Results
In summary, the ability to capture correct alignments is closely related to accuracy performance in NLI. This conclusion is often discussed qualitatively in previous works. But we are the first to illustrate and prove this point exhaustively via quantitative evaluation.

Improving Robustness against Overlap Heuristics
With our AREC, we find that both three models tend to align overlapped words between the sentence pair no matter their syntactical or semantic roles, causing wrong predictions in HANS. Figure  4 presents an example, where the model mistakenly matches identical words. However, president in the premise and doctor in the hypothesis are subjects of the same predicate advised, they should be aligned, and so do doctor in the premise and president in the hypothesis.
To remedy this, we turn to Semantic Role Labeling (SRL), the task to recognize arguments for a predicate and assign semantic role labels to them,  The doctor advised the president. to guide alignments for NLI models. In particular, we employ an off-the-shelf BERT-based SRL model (Shi and Lin, 2019) to extract predicates and their corresponding arguments from the premise and the hypothesis in advance. Then we limit the model to only align identical predicates and phrases with identical semantic roles by applying a corresponding co-attention mask (SRL mask), as presented in Figure 4. In this way the semantic role information is injected into the model. Note that there is no need to modify the model architecture or design new training protocol, contrary to Cengiz and Yuret (2020) who jointly train NLI and SRL in a multi-task learning (MTL) manner. We report model accuracy performances when alignments are guided by SRL masks (subscripted with SRL GUID) in Table 4. The results show that without obvious performance drops on entailment instances, applying SRL masks gains significant improvements on non-entailment instances, especially for lexical heuristic. Nevertheless, it doesn't boost model performances for constituent heuristic. We speculate that is because constituent heuristic instances are accompanied with restrictions such as prepositions, which is unable to handle only with alignments. Overall, the results show that guiding alignments is a potential promising way to incorporate useful information. Additionally, this also proves that our method is faithful towards models from another point of view.

Conclusions
In this work, we propose AREC, a new post-hoc method to generate alignment rationale for coattention based NLI models. Experimental results show that our explanation is faithful and readable. We study typical models using our method and shed lights on potential improvements. We believe our method and findings are illuminating for NLI. For future works, we plan to explore modelagnostic alignment explanations, and analyze models in other NLP tasks.

A The HardConcrete Distribution
The HardConcrete distribution (Louizos et al., 2018b) is derived from the binary Concrete distribution (Maddison et al., 2017) using stretch and rectify, assigning probability densities on the close unit interval [0, 1]. The Concrete distribution is a continuous relaxation of Categorical distribution and submissive for reparameterization (Gumbel-Softmax trick) (Maddison et al., 2017). We only introduce the special binary case here for conciseness.  1, 1,1). Red and blue regions are probability masses that the binary HardConcrete variable equals 0 and 1 separately.
A binary Concrete variableẐ could be sampled by first sampling U ∼ U(0, 1), and conducting the following transformations where σ is sigmoid function, α and τ are parameters ofẐ, where the latter one is called temperature controling the sharpness. In practice, log α is usually the logit outputted by a classifier, e.g., a neural network. The probability density function (PDF) and the cumulative distribution function (CDF) of Z is However, we are about to generate binary masks as our rationales, implying word alignment appearances. That is, we require Z remains some discrete properties, allowing us to sample the exact 0 and 1. For this purpose, Louizos et al. (2018b) introduces stretch and rectify strategy. As illustrated in Figure 5, the binary Concrete PDF is first stretched to support (γ, ζ), where γ < 0 and ζ > 1, via a scaling transformation, then we rectify densities on the close unit interval where γ, ζ and τ are hyperparameters and we set -0.1, 1.1 and 0.2 respectively. Transformations in Equation (14) and Equation (16) compose g in Equation (8). Now, we have and

B Loss Derivation
According to the above basis, for L 1 , we have Note that we optimize L 1 's upper bound instead of itself. For L 2 , we have (20) Optimizing L 1 and L 2 is directly since we don't need to sample. Now the loss functions are differential about α, allowing us to process gradient descent. In the implementation, we actually optimize over log α because it's a free variable.

C Alignment DIFFMASK Baseline
DIFFMASK utilizes a neural network to obtain log α on input representations, and optimizes the neural network on a training set. In the original implementation (De Cao et al., 2020), the neural network is feed with word vectors from different layers. To make it be on alignment level, log α is computed on alignment features log α i,j = FFN([p i ; h j ; p i − h j ; p i h j ]) (21) where FFN is a feed forward neural network with one hidden layer and ; means concatenation. Word representations p i and h j are the input incontextualized word vectors. The subsequent steps are similar to AREC, except that DIFFMASK is trained on a traning set, leveraging data knowledge.

D Alignment Plausibility Human Evaluation
The principle of manual evaluation is that the decision process observed by AREC is agreed with humans when it includes complete alignment information for the correct prediction. Thus, an alignment rationale could not agree with humans even instruct  humans to arrive the correct prediction. This is different from Human Accuracy (Jain et al., 2020). Figure 6 presents an example. From the alignment rationale, a human is able to predict entailment with identical nouns professorprofessor and lawyerlawyer. However, as a human, we also need to identify the predicate pair sawsaw for complete semantics. Thus, we consider alignment rationales like in Figure 6 are not agreed with human justifications.

E Visualization
We plot a few examples of AREC explanations in Figure 7. We also present examples of different alignment explanations in Figure 8. It's clear that our proposed AREC explanation is the most readable one.