Integrated Directional Gradients: Feature Interaction Attribution for Neural NLP Models

In this paper, we introduce Integrated Directional Gradients (IDG), a method for attributing importance scores to groups of features, indicating their relevance to the output of a neural network model for a given input. The success of Deep Neural Networks has been attributed to their ability to capture higher level feature interactions. Hence, in the last few years capturing the importance of these feature interactions has received increased prominence in ML interpretability literature. In this paper, we formally define the feature group attribution problem and outline a set of axioms that any intuitive feature group attribution method should satisfy. Earlier, cooperative game theory inspired axiomatic methods only borrowed axioms from solution concepts (such as Shapley value) for individual feature attributions and introduced their own extensions to model interactions. In contrast, our formulation is inspired by axioms satisfied by characteristic functions as well as solution concepts in cooperative game theory literature. We believe that characteristic functions are much better suited to model importance of groups compared to just solution concepts. We demonstrate that our proposed method, IDG, satisfies all the axioms. Using IDG we analyze two state-of-the-art text classifiers on three benchmark datasets for sentiment analysis. Our experiments show that IDG is able to effectively capture semantic interactions in linguistic models via negations and conjunctions.


Introduction
In the last decade Deep Neural Networks (DNN) have been immensely successful. Much of this success can be attributed to their ability to learn from complex higher order interactions from raw features (Goodfellow et al., 2016). This success of DNNs has led to them being increasingly adopted * Equal contribution for algorithmic decision making. This in turn has led to increasing concerns over explainability and interpretability of these models, given the important role they are beginning to take in society (Selbst and Barocas, 2018).
One area of work that has emerged in recent years is that of black box model explanation strategies that "explain" the output of a DNN for a given input using feature attribution scores or saliency maps (Sundararajan et al., 2017;Shrikumar et al., 2017). Numerous studies have been published in recent years proposing different strategies to answer the question "which features in the input were most important in deciding the output of the DNN?" However, modern DNNs take as input raw data as features, and learn from higher order interaction of those features. Thus in the past year a number of studies have instead focused on explaining feature interactions rather than explaining individual features Jin et al., 2019;Sundararajan et al., 2020;Tsang et al., 2020).
One issue that remains, however, is that given two methods for attributing importance scores, it is not entirely straight forward to objectively compare them. As has been noted by earlier studies (Sundararajan et al., 2017), if the output of an attribution method seems non-intuitive it is not easy to answer if that is caused by (i) limitations of the attribution method, (ii) limitations of the DNN model being explained, or (iii) limitation of the data on which the DNN model was trained. Like multiple previous studies Sundararajan et al., 2020;Tsang et al., 2020) we take an axiomatic approach to this problem, whereby we first define the set of properties/axioms that a "good" solution must satisfy, followed by development of a solution that satisfies those axioms.
The method for computing feature group attribution (interchangeably referred to as feature inter-action attribution) presented in this study is called Integrated Directional Gradients or IDG. Like multiple earlier methods in this area, IDG is a cooperative game theory inspired method. However, unlike earlier cooperative game theory inspired axiomatic methods which only borrowed axioms from solution concepts (such as Shapley value) for individual feature attributions and introduced their own extensions to model interactions, our formulation is inspired by axioms satisfied by well behaved characteristic functions as well as solution concepts in cooperative game theory literature. We find that well behaved characteristic functions provide a much simpler and intuitive framework for defining axioms for group attributions.
We apply IDG on state-of-the-art models on the NLP domain. As part of its input IDG requires a set of meaningful feature sets, that have a hierarchical structure (Section 2.1). In this paper we use parse tree of sentences to construct the meaningful feature structures. Figure 1 shows an illustrative example of the nature of explanations and attributions computed using IDG.
The major contributions of the current work are as follows: • First, we formally define the feature group attribution problem as an extension to the feature attribution problem (Section 2.1). • Second, we state a set of axioms that a well behaved feature group attribution method should satisfy (Section 2.2). • Third, we present the method of Integrated Directional Gradients or IDG as a solution to the feature group attribution problem that satisfies the stated axioms (Section 2.3). • Fourth, we propose an efficient algorithm to compute IDG for a given set of feature groups with a hierarchical structure (Section 2.4). • Finally, we compare IDG with other recently proposed related methods for computing feature interactions attribution. (Section 3). • To facilitate reproducibility, the implementation of IDG has been made publicly available 1 .

Problem Definition
In this section we formally state the problem of assigning attribution scores to meaningful feature  Figure 1: Computation of attribution score (value function v) for an example sentence Frenetic but not really funny. Magenta and green respectively denote negative or positive contribution to the inferred class and the importance is represented by the color intensity. Constituency parse tree is used to obtain meaningful feature groups. Note that each word is further divided into tokens (owing to byte pair encodings) each of which has 768 dimensions. IDG computes importance scores in a bottom-up manner starting from the individual embedding dimensions (d i ) working its way up to tokens, words, phrases and finally the sentence.
groups. Let f (x) be a deep neural network function, that takes as input a n dimensional real valued vector x ∈ R n and produces a real valued scalar output. Let A = {a 1 , a 2 , . . . , a n } refer to the set of features, with x i referring to the value of feature a i in feature vector x.
Then the feature group attribution problem is defined as follows: Given an input x, a baseline b ∈ R n , and a family of meaningful feature subsets M ⊆ P(A), assign to every subset of features S ⊆ A a value/importance score v(S). Here, P(A) represents the power set of the feature set.
The above formulation is inspired by cooperative game theory literature. Intuitively, we think of features as players in a co-operative game trying to "help" the DNN model reach its output. The objective then is to design a "good" value/importance function (characteristic function in cooperative game theory literature) for each feature subset (coalition of players).
Note that the above formulation is very different from existing cooperative game theory inspired feature attribution methods. Most existing methods assume that the value/characteristic function exists and then compute a payoff assignment vector for individual features, typically using Shapley values.
Similar to earlier studies, in our formulation we assume that the baseline b represents the "zero" input or absence of contribution from any feature.
The "family of meaningful feature subsets" M captures the notion that not all subsets of features represent "meaningful" parts of input. Another intuitive way to think about this is that not all features can collaborate directly, but need to be part of groups that can directly collaborate.
In general we will assume that M has a hierarchical containment structure, that is feature groups in M can be represented as a directed acyclic graph -with tree being a special case. Further, we will also assume that every individual feature is in M -that is {a i } ∈ M for i ∈ {1, 2, . . . , n} -and represents the leaf nodes in the hierarchy, while the set of all features is also in M -that is A ∈ M and represents the root of the hierarchy.

Solution Axioms
In this section we present a set of axioms that a well behaved value/importance function should satisfy. Note that, the following four axioms are variants of standard axioms for characteristic functions in cooperative game theory literature.
Axiom 2 (Normality) The value of the empty set of features is zero, v(∅) = 0.
Axiom 3 (Monotonicity) The value of a set of features is greater than or equal to the value of any of its subsets; if S ⊆ T , then v(S) ≤ v(T ).
Axiom 4 (Superadditivity) The value of the union of two disjoint sets of features is greater than or equal to the sum of the values of the two Since the value function represents the importance of a set of features, which is intuitively a direction less quantity, the Non-Negativity axiom ensures that every feature has a non-negative value/importance score. Similarly, the Normality axiom ensures that the importance score assigned to the empty set of features is zero. Since in the current framework the features in a deep neural network "collaborate", with the assumption that collaboration can only be beneficial, the axioms of Monotonicity and Superadditivity ensure that collaboration doesn't lead to diminished value/importance. Note that Superadditivity together with Non-Negativity implies Monotonicity.
In a cooperative game, players cooperate to generate the maximum value. A sometimes implicit assumption in these games is that it is always possible for a player to do nothing, in which case they generate zero value. Thus if doing something generates negative value a rational player will always choose to do nothing. This is the essence of Axiom 1. In axiomatic ML explanation literature, features are thought of as players cooperating to predict the output. One can also think of the value provided by a feature (importance of the feature) as the information contained in the feature that is effectively used by the model. This view also supports assumption of Axiom 1 as quantities of information (entropy) is also a non-negative quantity.
Axioms 1-3 are some of the foundational axioms of cooperative game theory (Chalkiadakis et al., 2011). While much mathematical theory has been published for computing solution concepts in games where these assumptions do not hold, we argue that those games themselves can be difficult to interpret and thus are less suitable for developing interpretability/explainability methods.
The following three axioms are variations of axioms of the same name presented in the (Sundararajan et al., 2017). The modifications presented here are necessary to incorporate the complexities resulting from assigning attribution scores to groups of features rather than individual features.  In essence the axiom Sensitivity (a) ensures that features that does effect the output of the DNN are not assigned a zero value/importance. Consequently, any feature group that includes such a feature must also be assigned a non-zero value. Conversely, the axiom Sensitivity (b) ensures that any feature that does not effect the output of the DNN is assigned a zero value, and that it doesn't contribute any value to any feature group that it is included in.
Axiom 7 (Symmetry Preservation) Two features a i and a j are said to be functionally equivalent if f (x) = f (y) for every pair of input vectors x and y such that x i = y j , x j = y i , and x k = y k for k ∈ {i, j}. Two features a i and a j are said to be structurally equivalent with respect to a family of meaningful feature subsets M if a i ∈ S and S = {a i } implies a j ∈ S for all feature subsets S ∈ M and vice versa. If two features a i and a j are both functionally and structurally equivalent and if the given input vector x and baseline vector b are such that The Symmetry Preservation axiom first defines two different types of feature equivalence: functional and structural. Two features are said to be functionally equivalent if swapping the values of those features doesn't effect the output of the DNN. Where as structural equivalence of features on the other hand refers to them having equivalent position in the structure imposed by the set of meaningful features M . Finally, the Symmetry Preservation axiom ensures that features that are both functionally and structurally equivalent contribute equal value/importance to all feature subsets they are included in.
Axiom 8 (Implementation Invariance) Two neural networks f ′ () and f ′′ () are functionally equiv- The Implementation Invariance axiom simply ensures that different implementations of the same DNN function result in same value/importance assignment to all feature subsets.

Our Method: Integrated Directional Gradients
In this section we present a solution to the "feature group attribution problem" that we call the Integrated Directional Gradients method or IDG. This method is inspired by the Integrated Gradients method (Sundararajan et al., 2017) and by Harsanyi dividends (Harsanyi, 1963) in cooperative game theory. The high level idea of the method is to construct the value function in terms of the "dividends" generated by each meaningful feature subset. In this formulation, each meaningful feature group contributes "additional value" to the DNN model, that we call "dividend" of the group. The dividend of a feature group S is represented by d(S) and d(S) ∈ [0, 1). The dividend of a single feature is also its value and a measure of its importance. One of the simplest measures of importance of a feature is the partial derivative of the DNN function with respect to the feature. The partial derivative also has an intuitive notion that it represents the amount of change in the output of the DNN function per unit change in the input, in the direction of the feature. However, as noted in the earlier studies (Sundararajan et al., 2017), due to effects such as gradient saturation, partial derivatives can't be directly used for measuring the importance of a feature. To alleviate this issue the authors of the Integrated Gradients method recommend taking a path integral of the partial gradient over the straight line path connecting the baseline b to the input x. For this study, we take a similar approach, and take the absolute value of the path integral of the partial gradient as the dividend of a single feature.
The dividend of a group of features is distinct from its value and is the measure of the importance of the interaction of the features in the group. For this study we consider the directional derivative of the DNN function in the direction of the given set of features to be representative of the importance of the interaction of the given set of features. Similar to the single feature case this also has the intuitive notion that it represents the amount of change in the output of DNN function per unit change in input, in the direction of the subset of features. However, as in the case with single features, issues such as gradient saturation still need to be addressed for directional gradients as well. Thus we propose to use absolute value of IDG, which is the path integral of the directional gradient over the straight line path from the baseline b to the input x as the dividend of the feature group. Further, the sign of IDG may be used to signify the nature of contribution (positive or negative) to model output.
Equations 1 to 6 describe the process of computing the value/importance v(S) of a subset of features using the IDG method. Given a feature subset S first the feature subset difference vector z s is computed from the input feature vector x and the baseline vector b. Next, IDG(S) is computed by integrating over the directional derivative, in the direction of z s over the straight line path from the baseline b to the input x. The dividend d(S) of the feature subset S is then computed by normalizing the absolute value of IDG(S) over all meaningful subsets, such that the sum of the dividends of all meaningful features subsets add up to 1. Finally the value v(S) of the given feature subset S is computed by adding up the dividends of all the meaningful subsets contained in S, including itself.

Efficiently computing Integrated Directional Gradients
Similar to (Sundararajan et al., 2017), we approximate the integral in IDG, by simply summing over the gradients at points occurring at small intervals along the path from baseline b to the input x. The approximated IDG(S) is computed as: Here m denotes the number of steps in the Reimann approximation of the integral. We now propose a polynomial time dynamic programming Algorithm (1) for calculating the attribution score (i.e., value function v) for all the meaningful subsets in M for a given input x and a baseline b.
First, ∇f is calculated for each of the m + 1 intermediate positions between x and b. Next we compute AIDG(S) for all feature groups in M . This is followed by the computation of Z, which is 2 Detailed proofs are available in Appendix.

Algorithm 1
end for 5: for S ∈ M do 6: Compute AIDG(S) ⊲ Using Eq. 7 7: end for 8: Z ← S∈M |AIDG(S)| 9: for S ∈ M do 10: Compute d(S) ⊲ Using Eq. 4 11: end for 12: for S ∈ M do 13: Compute v(S) ⊲ Using Eq. 6 14: end for 15: end procedure simply the sim of the AIDG(S) scores for reach of the meaningful subsets. Given Z and the individual scores the divided d(S) can easily be computed using Eq. 4. Finally, given the dividend of all meaningful subsets of S is known, the value function v(S) for each of the meaningful subsets of S can be computed using Eq. 6.
We illustrate the computation of attribution scores using an example sentence Frenetic but not really funny taken from SST dataset (Figure 1). The task is sentiment classification and the inferred class for this sentence is negative. The model used for classification is XLnet-base (refer to Section 3 for details on dataset, model and training procedure). We leverage the constituency parse tree of the sentence to obtain meaningful feature groups. Note that XLnet tokenizer uses byte pair encoding. Hence the word "Frenetic" is further decomposed into "Fre", "net" and "ic". Each token is further represented by an embedding of size 768. The value function is calculated in a bottom-up manner starting from each embedding dimension of the constituent tokens (referred as d i in Figure 1). These are then combined to obtain the value function score for each token. We then follow the parse tree to calculate the score for each phrase. For example the score for phrase Frenetic but is 0.407 while that of not really funny is 0.454.
The overall time complexity of Algorithm 1 is O (m(F + B + V · |A|) + V + E), where F and B are the time complexity of a single forward and backward pass of the neural network, V and E are, respectively, the number vertices and edges in the graph structure induced by the family of meaningful feature subsets M , |A| is the number of features, and m the number of approximation steps used to compute AIDG(S). For more details on the complexity result, refer to Appendix.

Comparison with existing methods
It has been noted that when a DNN explanation method returns a non-intuitive result, it is not possible to disentangle which part of the pipelinetraining data, trained model, or the explanation method -is to blame for the result (Sundararajan et al., 2017). Thus many studies (Sundararajan et al., 2017;Sundararajan et al., 2020;Tsang et al., 2020) have taken the axiomatic strategy instead to compare methods qualitatively. Taking a similar approach, we present in Table 1 a qualitative comparison of recent feature interaction attribution methods most similar to our work.
We group the comparison into four major categories. First, in most cooperative game theory literature players are assumed to cooperate. It is thus intuitive that more cooperation will not lead to lesser benefit, and it is generally assumed that the grand coalition will form (Chalkiadakis et al., 2011). While there are mathematical formulations that work in absence of this assumption, we argue that they lead to non-intuitive results when applied to the task of feature interaction attribution. These assumptions are manifested by well-behavedness properties of the characteristic/value function. In Table 1 we see that existing cooperative game theory inspired methods generally ignore this aspect when computing importance attributions.
Second, to compute the effect of a model in absence of a feature, attribution methods generally mask out the feature, generally replacing it with a ZERO or PAD token. It has been noted that this requires the DNN model to be evaluated in an region of the input space for which it has not received any training data and for which its accuracy was never evaluated (Sundararajan et al., 2017;Kumar et al., 2020). Thus the results that model produces for these out-of-distribution inputs is questionable. In Table 1 we see that all existing methods compute their attributions by evaluating the model for these out-of-distribution inputs.
Third, in a cooperative game theoretic setting when players (here features) are assumed to cooperate, it is intuitive that as the size of the coalition grows the coalition will not become less important. This is the key intuition behind Axioms 1-4. However, In Table 1 we see that none of the existing methods ensure that their attributions adhere to this key intuition.
Finally, cooperative game theory based methods generally ensure that axioms of Completeness (a.k.a. Efficiency), Symmetry Preservation, Linearity, and Sensitivity (a.k.a Null/Dummy player) are warranted by their attributions. In this paper we follow the lead of (Sundararajan et al., 2017) and use the nomenclature from (Aumann and Shapley, 2015), which additionally introduces the axiom of Implementation Invariance. In Table 1, we see that for LS-Tree (Chen and Jordan, 2020), Shapley-Taylor Interaction Index (Sundararajan et al., 2020), and Archipelago (Tsang et al., 2020), which are cooperative game theory inspired methods, these assumptions hold. However for SCD/SOC (Jin et al., 2019) and HEDGE  which are not axiomatic formulations, these assumptions do not hold. For our method, IDG, all but the axiom of Linearity holds. In Section 5.2 we argue that this is not a major limitation and refer to existing literature that even argues for doing away with the Linearity axiom.

Evaluating IDG on state-of-the-art models
We deploy our model for the task of sentiment classification across three different datasets -Stanford Sentiment Treebank (SST) (Socher et al., 2013), Yelp reviews (Zhang et al., 2015) and IMDB (Maas et al., 2011). For each dataset, we train three stateof-the-art models -XLnet-base (Yang et al., 2019), XLnet-large (Yang et al., 2019) and BERT-itpt (Sun et al., 2019). We use the same hyperparameter configuration as mentioned in the original papers. They are summarized in Appendix as well. The performance of these models are summarized in Table 2.

Results
To precisely visualize the interactions between phrases, we search over the test examples for instances of negations. We follow the methodology proposed in . In specific, we look into the parse tree for each review and check if the left child consists of a negation phrase (e.g., lacks, never etc.) in the first two words and the right child has a positive or a negative sentiment. Since for SST, each phrase is also annotated with their corresponding sentiment labels in the form of a constituency parse tree, this can be easily ob-     tained. For Yelp and IMDB, we look for presence of negation phrases in the reviews and then manually select 100 such examples from the filtered set. Since the parse trees for the reviews are not explicitly available for Yelp and IMDB, we deploy a state-of-the-art constituency parser (Mrini et al., 2019) to obtain them. We illustrate with one example each from SST and Yelp datasets in Figures 2(a) and 2(b) respectively. Additional examples can be found in Appendix.

In Distribution Evaluations
For Figure 2(a) the classification model is XLnet-base and the ground truth as well as the inferred class is negative.
The first part (Though everything might be literate and smart) has a positive sense.
But when appended with the second part (it never took off and always seemed static), a negative sense is manifested. This is captured by the classification model as demonstrated by our framework. For the example in Figure 2(b), the classifier model is BERTitpt and the inferred as well as the ground-truth class is negative. This example consists of two sentences while the first one Nice atmosphere has a positive sense, when combined with the second sentence Cheeseburger was not at all that, the overall sense turns negative. This is again conveniently manifested in the scores assigned by our framework. We also report the results on IMDB reviews (Maas et al., 2011) in Appendix.

Quantitative Evaluations and Human Judgement Experiments
As noted by (Sundararajan et al., 2017), when the results of an explanation method is non-intuitive, it is not obvious which part of the ML pipelinethe data, the model being explained, the explanation method -is to be blamed and by how much. Due to this issue many authors (Sundararajan et al., 2017;Sundararajan et al., 2020;Tsang et al., 2020) have chosen to take the axiomatic/theoretical path, where they state the properties of the proposed method and compare explanation methods based on the axioms/properties they satisfy. Nevertheless, many recent studies (Singh et al., 2018;Jin et al., 2019; have proposed new explanation methods and provided evaluations using quantitative metrics such as AOPC (Nguyen, 2018), Log Odds (Shrikumar et al., 2017), and Cohesion Score (Jin et al., 2019).
One common strategy is to perturb the input -such as removing of Top-K most important words/features -followed by measuring the drop in performance. We argue that these methods of evaluation have issues because they generally involve measuring model performance on out-ofdistribution inputs. And as stated earlier, measuring the outputs of models on out-of-distribution inputs, that is inputs, on which the model has neither been trained or tested on, is questionable. (b) Figure 2: The value function scores assigned by our framework for different coalitions (interactions) between phrases for two reviews from SST (a) and Yelp (b) respectively. Magenta and green respectively denote negative or positive contribution to the inferred class and the magnitude of importance is represented by the color intensity. Note that the interactions are correctly captured by the classifier model in both the cases as demonstrated by IDG.
The other strategy is to perturb the model -such as by adding noise to model weights -followed by measuring the drop in performance. (Hooker et al., 2019) proposed a similar solution for the input perturbation case as well, that is by retraining the model after perturbing all training samples. However, in this scenario if two explanation methods provided different explanations/attributions for the different models, it is not obvious if the models are to blame or the explanation methods. Similar issues exist for human judgement experiments as well. Due to the above issues for the current work we too have chosen to take the qualitative comparison path.

Linearity and Uniqueness
One of the common axioms of solution concepts in cooperative game theory is Linearity. The axiom of Linearity (a.k.a Additivity) states that if the characteristic/value function has the form v(S) = v 1 (S) + v 2 (S) and φ 1 (S) and φ 2 (S) are the attributions due to v 1 (S) and v 2 (S) then the attribution due to v(S) should be given by φ(S) = φ 1 (S) + φ 2 (S).
During our design and experimentation we found that having the attributions normalized, that is v(∅) = 0 and v(A) = 1, provided much more intuitive results. Such normalization, however, runs counter to the possibility of an attribution method that satisfies Linearity.
Further, it has been argued by some game theorists that the axiom of Linearity was added as a mathematical convenience and also to constrain the attributions such that it is unique (Osborne and Rubinstein, 1994). Further, (Kumar et al., 2020) argue that enforcing such uniqueness constraints by this method limits the kind of models that can be explained by these attributions.
Thus, IDG is also not an unique solution to the feature group attribution problem, due to its sacrifice of Linearity. However, given that recent studies have found (Sundararajan and Najmi, 2020) that Shapley values can and have been used in many different ways, each of which claiming uniqueness, the importance of uniqueness claims is significantly diminished.

Related work
Feature attribution based method. These methods essentially assign importance scores to individual features thereby explaining the decisions of the classifier model. The scores are mostly calculated by either backpropagating a custom relevance score (Sixt et al., 2020) or directly using the gradients. The gradient based methods aim to calculate the sensitivity of the inference function with respect to the input features and thereby measuring its importance. The method was first introduced in (Springenberg et al., 2015) and further investigated in (Selvaraju et al., 2017;Kim et al., 2019). (Sundararajan et al., 2017) adopts an axiomatic approach and deem it to be more suitable as the feature attribution methods are hard to evaluate empirically. The other set of methods usually backpropagates their custom relevance scores down to the input to identify relevance of an input feature (Bach et al., 2015;Shrikumar et al., 2017;Zhang et al., 2018). Unlike the gradient based methods, these are not implementation invariant (i.e., the back propagation process is architecture specific). Game theoretic aspect. (Lundberg and Lee, 2017) adopts results (shapely values in specific) from coalition game theory to obtain feature attribution scores. The key idea is to consider the features as individual players involved in a coalition game of prediction which is considered the payout. The payout then can be fairly distributed among the players (features) to measure their importance. This has been further explored in (Lundberg et al., 2020;Ghorbani and Zou, 2020;Sundararajan and Najmi, 2020;Frye et al., 2020). Quantifying feature interactions. The methods mentioned above fail to properly capture the importance of feature interaction. (Janizek et al., 2020) proposes to capture pair-wise interaction by building upon Integrated gradients framework. (Cui et al., 2020) learns global pair-wise interactions in bayesian neural networks.  introduces contextual decomposition to capture interaction among words in a text for a LSTM-based classifier. (Singh et al., 2018) further extends the method to other architectures. More recent research endeavors in this direction include (Tsang et al., 2020;. We elaborate more on the methods closest to our work in section 3.

Conclusion
In this paper we investigated the problem of feature group attribution and proposed a set of axioms that any framework for feature group attribution should fulfill. We then introduced IDG, a novel method, as a solution to the problem and demonstrated that it satisfies all the axioms. Through experiments on real-world datasets with state-ofthe-art DNN based classifiers we demonstrated the effectiveness of IDG in capturing the importance of feature groups as deemed by the classifier. 8 Appendix

Detailed proof of theorems
Given dividend d(S) is constructed to be nonnegative, it is straight forward to show that v(S) satisfies Axioms 1 to 4, given it is a sum of one or more non-negative dividends.

Lemma 1 v(S) satisfies Sensitivity (a)
Proof 1 Let there be a feature a i such that, f (x) = f (b) for given input x and baseline b that only differ in a i . To prove v(S) satisfies Sensitivity (a) it is sufficient to prove that in the above scenario IDG({a i }) = 0. Then from (Eq 2) Since, in the given case, x i is the only feature that varies on the straight line path connecting b and x, we can rewrite f (x) = g(x i ). Therefore Proof 2 Let there be a feature a i such that, f (x) = f (y) for every input x and y that only differ in a i . To prove v(S) satisfies Sensitivity (b) it is sufficient to prove that IDG(S) = IDG(S ′ ), for all S such that a i ∈ S, and S ′ = S {a i }. The precondition of Sensitivity (b) implies that Therefore for any S and S ′ such that Which implies that IDG(S) = IDG(S ′ ).

Lemma 3 v(S) satisfies Symmetry Preservation
Proof 3 To prove that v(S) satisfies Symmetry Preservation, it is sufficient to prove that for any feature subset S ⊆ A {a i , a j }, IDG(S ∪{a i }) = IDG(S ∪ {a j }). The precondition of functional equivalence implies that if in a given feature vector Additionally, when considering Further, this also implies that x i = x j on every point on the straight line connecting b and x. The above imples that IDG(S ∪ {a i }) = IDG(S ∪ {a j }).

Lemma 4 v(S) satisfies Implementation-Invariance
Proof 4 v(S) satisfies Implementation Invariance since they only depend on gradients of the nerual network function and its evaluations.

Additional results
IMDB. The dataset (Maas et al., 2011) consists of 25K positive labeled and 25K negatively labeled reviews posted on IMDB.
For evaluation, we deploy the same procedure as in case of Yelp to obtain 100 representative examples. Two illustrative examples are provided in Figures 3 and 4. Negative example. We consider an example from the SST dataset where the classifier model made wrong inference. The ground truth class was negative while the inferred class was positive. The value function scores for all the valid coalitions are provided in Figure 5. The results show that although the classifier was able to distinguish between the positive sense manifested in the first part and the negative sense in the second, it made a positive inference overall. This might be due to the low confidence of the classifier in inferring the final class as demonstrated by the probabilities -0.44 for negative and 0.56 for positive class. However, further investigations are required before stronger claims can be made.

Training models
SST. The XLnet-base model was trained with batch size 24 for 4 epochs. We use AdamW (Loshchilov and Hutter, 2018) as optimizer with learning rate 2e −05 and weight decay 0.01. The model achieved an accuracy of 0.915 on the test set. The XLnetlarge model was trained with same batch size, for same number of epochs and with same optimizer. The learning rate and weight decay were 5e −06 and 0.01 respectively. An accuracy of 0.916 was obtained on the test set for this model. BERT-itpt was trained with a batch size of 24 and optmized with AdamW with learning rate 1e −5 and weight decay 0.01. The embedding layers were not frozen during training. Yelp. The Bert-itpt model was trained with training batch size of 24, for 3 epochs and with AdamW (learning rate 1e −05 , weight decay 0.01) and achieved an accuracy of 0.947 on the test set. We further trained an XLnet models with similar training hyperparameters and achieved an accuracy of 0.983. IMDB. The two models Bert-itpt and XLnet-large were both trained on 25K training examples and tested on the rest. The batch sizes were 24 and 32 respectively. AdamW was used as optimizer for both models with same weight decay of 0.01 but learning rates 2e −05 and 2e −05 respectively for Bert-itpt and XLnet-large. We could obtain testing accuracy of 0.957 and 0.967 respectively for the two models.
All these models were trained on cluster with 2 CPUs each with 20 cores, 384 GB DDR4 RAM and Inter Xeon Gold 6148 processor. The distributed set up was connected through Mellanox ConnectX-5 network and used Lustre file system. The set up also utilized 4 NVIDIA Tesla V100 GPUs each with 32 GB memory.
Experiments with IDG were performed on a system with Intel Core i7-8550U 1.80GHz CPU with 16 GB RAM.

Adversarial attacks against explanations
In (Selbst and Barocas, 2018) the authors argue that one of the main reasons to develop explanation techniques is to enable humans to understand how automated decision systems work which in turn enable us to debate on whether the model's rules for decision making are justifiable. On the flip side security researchers (Slack et al., 2020) have have shown that such efforts can be stifled using adversarial attack techniques. In particular (Slack et al., 2020) showed that models can be trained to deceive blackbox explanation methods, such that it provides 'unfair' results on in-distribution samples while exhibiting different behavior when explained using KernelSHAP. In a recent study  the researchers have explored creation of deceptive models that can fool gradient based methods such as IntGrad (Sundararajan et al., 2017). In (Slack et al., 2020) the authors showed that evaluating models on out-of-distribution inputs, that is the inputs that the original model was not tested on, is a large potential attack surface for such deceptive techniques. While unlike existing studies, IDG doesn't evaluate out-of-distribution values, it seems certainly possible to use adversarial training methods to deceive IDG. While for the current work evaluation against adversarial attack was out of scope, we consider it as an important future direction.  Figure 3: The value function scores assigned by our framework for different coalitions (interactions) between phrases for the review Apart from Helena Bonham Carter, there is nothing worthy about this movie. And the surprise ending?! The thought of a sequel is even more annoying. Save your money, wait for the video and ignore that too. The inferred class is negative. IDG correctly captures the positive sense (even though the overall sense is negative) of the phrase Apart from Helena Bonham Carter as it contributes oppositely to the overall inference result.  Figure 4: The value function scores assigned by our framework for different coalitions (interactions) between phrases for the review Aside for being a classic in the aspect of its cheesy lines and terrible acting, this film should never be watched unless you are looking for a good cure for your insomnia. I can't imagine anyone actually thinking this was a ''good movie''. The inferred class is negative. IDG shows how the classifier captures the positive sense (even though the overall sense is negative) of the phrase Aside for being a classic in the aspect of cheesy lines and terrible acting as it contributes oppositely to the overall inference result.  Figure 5: The value function scores assigned by our framework for different coalitions (interactions) between phrases for the review Though Ganesh is successful in a midlevel sort of way, there's nothing so striking or fascinating or metaphorically significant about his career as to rate two hours of our attention. The inferred class is positive while the ground truth class is negative.