Asymmetric feature interaction for interpreting model predictions

In natural language processing (NLP), deep neural networks (DNNs) could model complex interactions between context and have achieved impressive results on a range of NLP tasks. Prior works on feature interaction attribution mainly focus on studying symmetric interaction that only explains the additional influence of a set of words in combination, which fails to capture asymmetric influence that contributes to model prediction. In this work, we propose an asymmetric feature interaction attribution explanation model that aims to explore asymmetric higher-order feature interactions in the inference of deep neural NLP models. By representing our explanation with an directed interaction graph, we experimentally demonstrate interpretability of the graph to discover asymmetric feature interactions. Experimental results on two sentiment classification datasets show the superiority of our model against the state-of-the-art feature interaction attribution methods in identifying influential features for model predictions. Our code is available at https://github.com/StillLu/ASIV.


Introduction
Deep neural networks (DNNs) have demonstrated impressive results on a range of Natural Language Processing (NLP) tasks.Unlike traditional models (e.g.CRFs and HMMs) that optimize weights on human interpretable features, deep neural models operate like a black box by applying multiple layers of non-linear transformation on the vector representations of text data, which fails to provide insights to understand the inference process of deep neural models over the features (e.g.words and phrases) involved in modeling.
Interpreting the prediction of a black box model could help understand model inference behaviors and increase user trust in applying the model to realworld applications.Prior efforts in NLP mainly focus on quantifying the contributions of individual word or word interactions to the prediction.Fig. 1 demonstrates word-level and pairwise word interaction explanations for a sentiment classification task, where the word "not" and the interaction between "not" and "funny" contribute positively to the prediction Negative.Figure 2: Symmetric versus asymmetric pairwise interaction (computed by our method) where the directed edge a → b refers to in the presence of a how much contribution of b made to the model prediction.The presence of "very" does not influence "funny" much while "funny" further modifies "very" and thus the interaction influence of "funny" → "very" is stronger than that of "very" → "funny".
Studying word interaction could help identify to what extent a set of words exert influence in combination as opposed to independently.However, most interaction attribution methods assume symmetric interaction, which may fail to capture asymmetric influence that contributes to model prediction.Fig. 2 presents some symmetric pairwise interactions with graph representation for the instance in Fig. 1, where words becomes nodes and edges between words represent interaction.In individuallevel explanation "funny" has negative influence while the symmetric interaction between "funny" and "not" produces positive influence to model prediction.Therefore the influence of the presence of "not" to "funny" is not the same as that of the presence of "funny" to "not"."funny" has weak positive contribution in the presence of "not" and "funny" further modifies "not", the interaction influence of "funny" → "not" could be stronger than the interaction of "not" → "funny".For the ideal asymmetric interaction, the presence of other features should not negate the positive influence of the important feature and other features would lose their influence when important features are present.Constructing asymmetric interaction graph for the predicted instance could help human have a nuanced understanding toward the inference of deep NLP models.
In this paper, our work aims to provide the explanation that incorporates asymmetric feature interaction1 .The contributions are summarized as follows: • We propose an asymmetric feature interaction attribution method that incorporates asymmetric higher-order feature interactions toward explaining the prediction of deep neural NLP models.
• We investigate three different sampling strategies in NLP field for computing marginal contribution of our defined asymmetric feature interaction attribution score, and empirically show that none is generally better than the others or more broadly applicable.
• We evaluate the proposed model on two sentiment classification datasets with BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019), and the experimental results demonstrate the faithfulness of our explanation model.

Related work
2.1 Feature attribution explanation Most explanation methods mainly focus on modelagnostic explanations and study how to effectively measure the importance of features on the prediction.For example, LIME (Zhang et al., 2019) evaluates the contribution of each feature by learning a linear model locally around a instance.Shapley value (Shapley, 1997;Lundberg and Lee, 2017) estimates the influence of a feature by averaging its marginal contribution among all permutations.
Since the computation of Shapley value is computationally expensive, popular variants like Kernel SHAP (Lundberg and Lee, 2017) and Quasirandom and adaptive sampling (Štrumbelj and Kononenko, 2014) are proposed to efficiently approximate Shapley value.However, these methods do not explain how feature interactions contribute to model predictions, which fails to address model's learning capability from high-order feature interactions.

Feature interaction explanation
There are increasingly research on studying feature interaction explanation methods.For example, Shapley interaction index (Grabisch, 1997) and Shapley Taylor interaction index (Dhamdhere et al., 2019) are proposed to measure the interaction between multiple players.Integrated Directional Gradients (IDG) (Sikdar et al., 2021) borrowed axioms from Integrated Gradients (IG) (Sundararajan et al., 2017), where the desirable characteristics are satisfied by IG and Shapley value in cooperative game theory.Tsang et al. (2020) proposed an efficient framework Archipelago to combine feature interaction detector (ArchDetect) and feature attribution measure (ArchAttribute).Further, feature interaction could be explained in a hierarchical structure.Agglomerative contextual decomposition (ACD) (Singh et al., 2018) builds the hierarchical explanations in a bottom-up way by starting with individual features and iteratively combining them based on the generalized CD scores.Jin et al. (2019) addressed context independent importance that is ignored in ACD, and proposed an easy and model-agnostic Sampling and Occlusion (SOC) algorithm to incorporate conditional context information given the specified text sequence.HEDGE (Chen et al., 2020) designs a top-down framework to construct hierarchical explanations by detecting the weakest interaction point and selecting important sub-span for a given text span.
The above feature interaction explanation methods only focus on symmetric interaction.We note that concurrent work by Masoomi et al. (2021) addressed directed pairwise interaction.However, their Shapley value based formulation also introduces noisy interaction as different subsets may contain several same elements.Moreover, this solution ignores asymmetric high-order interaction.

Asymmetric Shapley interaction value
This section first revisits Shapley value for explaining model prediction, then describes the definition of asymmetric Shapley interaction value and the corresponding approximating computation.
The following notations will be used throughout the paper.For a classification task, given a text sequence x = (x 1 , ..., x n ) and a trained model f , ŷ is the prediction label and f (•) denotes the model output probability on ŷ.

Shapley value for model interpretability
In cooperative game theory Shapley value measures the marginal contribution that a player makes upon joining the group by averaging over all possible permutations of players in the group.
The Shapley value of i th word in x for the model prediction ŷ is weighted and summed over all possible word combinations: (1) where S is the subset of feature indices.The equivalent formulation is (2) where π(n) denotes all permutation of the word indexes {1, 2, ..., n}.pre i (O) is the set of all indices that precede i in the permutation O ∈ π(n).
v(S) is the value function that characterizes the contribution of the subset S to the prediction ŷ: where x ′ denotes the text sequence with the same length as is the expectation of f (•) over possible x ′ where only the subset values x S unchanged.

Definition of Asymmetric Shapley interaction value
The Shapley value above can quantify the contribution of a single word or phrase to the model prediction.The proposed asymmetric Shapley interaction value (ASIV) measures the asymmetric interaction between two different subsets T 1 and T 2 that attributes to the model prediction.That is, ASIV determines the contribution of T 1 conditioned on the presence of T 2 to the prediction ŷ.By treating T 1 and T 2 as two singletons among the players, ASIV is defined as ) computes the difference of marginal contribution of T 1 to the coalition S with and without participation of the subset T 2 , which aims to capture directional interaction influence between T 1 and T 2 .The equivalent formulation is (7) where O denotes the possible permutation where T 2 precedes T 1 , and where pre T 1 denotes the set of all indices that precede T 1 while pre T 1 T 2 excludes T 2 from this set.If both T 1 and T 2 contain a single element (i.e.T 1 = {i} and T 2 = {j}), the directed pairwise relationship could be obtained by

Approximating computation
Computing asymmetric Shapley interaction value has to estimate value function v(S) over all possible permutations (or subsets).First, we investigate three different sampling strategies of computing

Marginal Expectation (ME):
In applying Shapley value to explain predictions of NLP models, prior research assumes individual features are mutually independent.Then E(f |x S ∪ x ′ S ) is computed as where x ′ S could be randomly sampled from training data or a sequence with all ⟨pad⟩ tokens.
In computing f (x S ∪ x ′ S ), the combination of x S ∪ x ′ S may be incompatible.For example, given the instance x "the issue of faith is not explored deeply" and the subset x S "the issue of faith", random sampling could generate the sequence "the issue of faith time changer may not".Such incoherence lie off the data manifold (Frye et al., 2020) where v(S) may fail to capture model's dependence on the whole context information.
Conditional Expectation (CE): The expectation of (f |x S ∪ x ′ S ) with respect to the distribution x ′ S conditioning on x S is computed as where x ′ S |x S could be sampled from a pre-trained language model.
In conditional expectation, the combined input x S ∪ x ′ S is more coherent.For example, given the subset x S "the issue of faith", the generated complete sequence could be "the issue of faith is not very important".
However, both marginal expectation have to evaluate f on some out-of-domain data, where the model f is forced to extrapolate to an unseen part of the feature space.Hooker et al. (2021) conducted many simulation experiments showing that the permutation-based feature attributions are sensitive to these edge cases.
In-domain Expectation (IDE): To enable f to evaluate on in-domain data, we pretrain a language model on the training data x = {x 1 , ..., x N } to model the underlying data distribution p(x).For a text sequence x ′ with known subset x ′ S = x S , by sampling the remaining subset x ′ S from p(x), Second, by enumerating all possible permutations, Eq.( 7) could be replaced with where w could be obtained from the above three different sampling strategies.
Let V O,w for all permutation/instance pairs, (15) we adopt Monte Carlo sampling (Štrumbelj and Kononenko, 2014) to approximate ϕ T 2 (T 1 ) as 4 Experiments We evaluate explanation methods on text classification tasks with BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019) models.

Evaluation
For asymmetric interaction between features, the important features are expected to have more positive incoming edges and less outgoing edges.Therefore to evaluate faithfulness of feature interaction attribution methods in deriving feature interaction relationship, in this paper we focus on pairwise interaction and apply PageRank (Page et al., 1999) algorithm to obtain feature importance ranking.Two evaluation metrics (Chen et al., 2020;Nguyen, 2018;Shrikumar et al., 2017) are employed to evaluate influential features as follows: The area over the perturbation curve (AOPC): average change in the model output probability on the predicted class among the test data by deleting top k influential words from each text sequence.
where xi is obtained by dropping the k% topscored words from x i .The higher AOPC, the more important deleted words for model prediction.
Log-odds (LOR): average the difference of negative logarithmic probabilities on the predicted class over the test data before and after replacing the top k% influential words with "<pad>" token in the text sequence.
where x ′ i is obtained by masking the k% top-scored words from x i .The lower LOR, the more important deleted words for model prediction.

Qualitative Analysis
We first demonstrate interpretability of the feature asymmetric interaction graph by presenting a test example "you might not buy the ideas" from SST-2 that is predicted as Negative by BERT model.More examples are shown in Appendix C. Fig. 3 focuses on two words "not" and "might" that are more consistent with human explanation and describes three feature interaction graphs estimated by the typical feature interaction attribution methods.Bivariate Shapley value also estimates asymmetric interaction relationship and Shapley interaction index models symmetric feature interaction.Here ASIV is estimated with random sampling as it performs best in SST-2 dataset.In the directed weighted interaction graph, as shown in Fig. 3 (a) and (b), word 1 0.05 → word 2 denotes that in the presence of word 1 , the influence score of word 2 to model prediction is 0.05.Fig. 3 (c) shows an undirected weighted interaction graph where the edge weight is the symmetric interaction influence to model prediction.For Negative prediction, we take two pairs ["not", "might"] and ["not", "buy"] as examples.
In ASIV estimation, the interaction influence of "might" → "not" is positively stronger than that of "not" → "might", while in Bivariate Shapley and Shapley interaction estimation, both asymmetric and symmetric interaction influences are negative.Intuitively, the interaction between "not" and "might" could contribute positively to Negative prediction.Compared with "not", "might" does not convey much information against Negative.Therefore ϕ not (might) < ϕ might (not).Similarly, the interaction influence between the pair ["not", "buy"] is positive in ASIV while in Bivariate Shapley and Shapley interaction estimation the interaction influence is negative.In practice, human evaluation tends to attribute more importance to the pair ["not", "buy"] for model prediction.
"not" further modifies "buy" so in ASIV the interaction influence of "not" → "buy" could be positive and stronger than that of "buy" → "not".

Quantitative Analysis
We follow the prior works (Chen et al., 2020;Guerreiro and Martins, 2021) that set k to 20 in sentiment classification task.The sampling size m is set to 500.Due to increasing computational cost in computing each pairwise interaction (statistics of the datasets in Appendix B), we randomly choose 1000 samples from SST-2 dataset and 100 samples from Yelp-2 dataset that the review length is restricted to be less than 100 words.Further, to estimate conditional expectation and in-domain ex-pectation, in each permutation, for each token in x ′ S only the most likely word is sampled from the pretrained language model given x S .

Comparison with baselines
Table 1 shows the evaluation performance of BERT and RoBERTa models on two different datasets.We observe that the proposed ASIV consistently outperforms the compared baselines in identifying influential features for the predictions of BERT and RoBERTa.ASIV with random sampling strategy performs better in SST-2 while ASIV with in-domain sampling demonstrates its effectiveness for Yelp-2 classification with RoBERTa.
We first analyze the baseline as follows: for the univariate feature attribution methods (i.e.Hedge and SOC), the explanation heavily depends on their attribution estimation.For example, compared with Shapley values, SOC could assign different values to the equivalent important features, therefore the corresponding estimated feature ranking may not be faithful.For interaction attribution methods, derived undirected interaction graph (i.e. from Archipelago and Shapley interaction index) does not emphasize the importance of some nodes in symmetric interaction.As shown in Fig. 3 (c), the importance of "not" is ignored in the interaction with "might".Bivariate Shapley defines a directed interaction graph but its formulation introduces noisy interaction relationship, which may result in false estimation.For example, both the directional interaction between "not" and "might" are For ASIV evaluated in SST-2 dataset, conditional expectation based estimation performs better than in-domain expectation and marginal expectation with padding operation while it is inferior to marginal expectation with random sampling.Since the review length in SST-2 is relatively short (see Appendix B), compared with conditional sampling, random sampling also can generate smooth and indomain-like text sequence.For example, given the test instance in SST-2 "the cast is uniformly excellent and relaxed" and the corresponding x S "the [MASK] is uniformly excellent [MASK] [MASK] ", conditional sampling generates the most likely sequence "the interior is uniformly excellent throughout ."and random sampling produces "the movie is uniformly excellent and predictable".
In Yelp-2 dataset with RoBERTa classification, both ASIV-CE and ASIV-IDE performs better than ASIV with random sampling .Different from SST-2 reviews, the review length in Yelp-2 is quite long (see Appendix B).Random sampling is more likely to produce long disorganized x ′ S in some permutations.Compared with in-domain sampling where the quality of pretrained language model is restricted by the limited training corpus, conditional sampling could generate more smooth text sequences that modify the main idea of the review.

Sensitivity analysis
As shown in Fig. 4 and Fig. 5, we study the influence of k on the evaluation performance of BERT and RoBERTa models on SST-2 and Yelp-2 datasets.
We can see from Fig. 4 that ASIV always outperforms the baselines and ASIV with random sampling consistently achieves the best performance in AOPC and LOR metrics by varying k.With the increase of k, the curves of ASIV-IDE and ASIV-CE tend to overlap in BERT-based classification and ASIV-CE outperforms ASIV-IDE by a narrow margin in RoBERTa-based classification.In Yelp-2 dataset with the increase of k generally both conditional and in-domain sampling strategies are more effective than random and padding operation.ASIV-IDE with BERT model performs better in AOPCs and LORs in the scenario k < 30 and ASIV-CE with RoBERTa model performs better with k < 30.
As addressed in the above subsection that in SST-2 the average number of words of the review is limited, random sampling could enforce the clas- sification model to focus on the specific text span that is essential for predicting the short text sequence, while the smooth context produced by conditional and in-domain sampling strategies may contain confusing information for prediction.For predicting long text sequence, the classification model has to rely on the whole context to capture the main idea of the text.Therefore in computing marginal contribution both conditional and indomain sampling could be more applicable for long text sequences.

Conclusion
In this paper we propose an asymmetric feature interaction attribution method to explain asymmetric higher-order feature interactions in the inference of deep neural NLP models.We extend Shapley value to asymmetric Shapley interaction value and investigate three different sampling strategies in computing marginal contribution of value function in NLP field.By evaluating our proposed model and five model-agnostic feature interaction attribution methods on two sentiment datasets with BERT and RoBERTa, our model achieves the best performance in identifying the influential words for model prediction.Also, for the three different sampling strategies we empirically show that none is generally better than the others or more broadly applicable, which could provide guidelines for the selection of reference distribution in NLP field.For example, random sampling strategy is effective for the short text sequence and for the long text sequence in-domain sampling could produce more smooth and domain-dependent context.In the future, we consider generating sparse and differential causal structure for explaining model prediction.

Limitations
The proposed asymmetric Shapley interaction value could estimate asymmetric feature interaction in explaining the prediction of deep models.There are two major concerns regarding the time complexity: estimation of marginal contribution and construction of hypergraphs.In computing value function we have to consider more permutations to reduce approximation errors.Also, before estimating the contribution of asymmetric interaction, interaction graph with different orders should be constructed.We could resort to effective approximation methods in computing marginal contribution and prior knowledge in building hypergraph.A Implementation details of explanation methods HEDGE (Chen et al., 2020): We follow the original implementation that selects word-level features from the bottom of the hierarchical structure and obtain features' ranking based on the estimated importance.
SOC (Jin et al., 2019): We rank the importance of features in the bottom level of a hierarchical explanation.
Archipelago (Tsang et al., 2020): We use Ar-chAttribute to compute the pairwise interaction attribution and then apply PageRank algorithm.
Bivariate Shapley value (Masoomi et al., 2021): We follow the original paper and use Shapley sampling approximation to compute bivariate Shapley value.The sample size is set to 1000 for each interaction.
Shapley interaction index (Grabisch, 1997) then the second-order Shapley-Taylor interaction indices could be obtained by sampling over permutations, which is same to the computation of Shapley interaction index in our paper.

B Statistics of SST-2 and Yelp-2 datasets
Here we present the statistics of the test datasets.In Yelp-2 test set the the average review length 136.49, to reduce computational cost in estimating pariwise interaction, we have to sample from the test with restriction to review length (i.e.review length ≤ 100).Figure 9: of the estimated feature interaction graph for the instance "a waste of good performance" from SST-2 that is predicted as Negative by BERT model.

Figure 1 :
Figure 1: Explanations for a negative movie review (computed by Shapley value and Shapley interaction index), where the color indicates contribution of the corresponding word/pairwise word interaction to the model prediction.

Figure 3 :
Figure 3: Visualization of feature interaction graph estimated by ASIV, Bivariate Shapley value and Shapley interaction index.

Figure 4 :
Figure 4: Evaluation performance of BERT and RoBERTa on SST-2 dataset.

Figure 5 :
Figure 5: Evaluation performance of BERT and RoBERTa on Yelp-2 dataset.
: To ensure consistency among Shapley-based methods, we also use Shapley sampling approximation.The sample size is set to 1000 for each interaction.Asymmetric Shapley interaction value: For the classification with BERT model, in computing conditional expectation we employ pre-trained BERTbase model, and for in-domain we pretrain BERT-base model using training data.For RoBERTa classification model, we employ pretrained RoBERTa-base model in estimating conditional expectation and pretrain RoBERTa-base model for in-domain expectation.We do not select Shapley-Taylor indices for the following reason: the second-order Shapley-Taylor interaction indices for a pair (i, j) with a fixed permutation π is defined as I ij,π = v(S∪ij)−v(S∪i)−v(S∪j)+v(S), (19)

Figure 8 :
Figure8: Matrix of the estimated feature interaction graph for the instance "you might not buy the ideas" from SST-2 that is predicted as Negative by BERT model.

Table 1 :
Evaluation performance of feature interaction explanation methods on SST-2 and Yelp-2 datasets.