Attention Flows are Shapley Value Explanations

Shapley Values, a solution to the credit assignment problem in cooperative game theory, are a popular type of explanation in machine learning, having been used to explain the importance of features, embeddings, and even neurons. In NLP, however, leave-one-out and attention-based explanations still predominate. Can we draw a connection between these different methods? We formally prove that — save for the degenerate case — attention weights and leave-one-out values cannot be Shapley Values. Attention flow is a post-processed variant of attention weights obtained by running the max-flow algorithm on the attention graph. Perhaps surprisingly, we prove that attention flows are indeed Shapley Values, at least at the layerwise level. Given the many desirable theoretical qualities of Shapley Values — which has driven their adoption among the ML community — we argue that NLP practitioners should, when possible, adopt attention flow explanations alongside more traditional ones.


Introduction
The approaches to model interpretability taken by the ML and NLP communities overlap in some areas and diverge in others. Notably, in machine learning, model prediction has sometimes been framed as a cooperative effort between the potential subjects of an explanation (e.g., input tokens) (Lundberg and Lee, 2017). But how should we allocate the credit for a prediction, given that some subjects contribute more than others (e.g., the sentiment words in sentiment classification)? The Shapley Value is a solution to this problem that uniquely satisfies several criteria for equitable allocation (Shapley, 1953). However, while Shapley Value explanations have been widely adopted by the ML community -to analyze the importance of features, neurons, and even training data Zou, 2019, 2020) -they have had far less traction in NLP, where leave-one-out and attentionbased explanations still predominate.
What is the connection between these different paradigms? When, if ever, are attention weights and leave-one-out values effectively Shapley Values? The adoption of Shapley Values -which have their origins in game theory (Shapley, 1953) -by the ML community can be ascribed to their many desirable theoretical qualities. For example, consider a token whose masking out does not impact the model prediction in any way, regardless of how many other tokens in the sentence are also masked out. In game theory, such a token would be called a null player, whose Shapley Value is guaranteed to be zero (Myerson, 1977;Young, 1985). If we could provably identify the conditions under which attention weights and leave-one-out values are Shapley Values, we could extend such theoretical guarantees to them as well.
In this work, we first prove that -save for the degenerate case -attention weights and leaveone-out values cannot be Shapley Values. More formally, there is no set of players (i.e., possible subjects of an explanation, such as tokens) and payoff (i.e., function defining prediction quality) such that the values induced by attention or leave-oneout also satisfy the definition of a Shapley Value. We then turn to attention flow, a post-processed variant of attention weights obtained by running the max-flow algorithm on the attention graph (Abnar and Zuidema, 2020). We prove that when the players all come from the same layer (e.g., tokens in the input layer), there exists a payoff function such that attention flows are Shapley Values.
This means that under certain conditions, we can extend the theoretical guarantees associated with the Shapley Value to attention flow as well. As we show, these guarantees are axioms of faithful interpretation, and having them can increase confidence in interpretations of black-box NLP models. For this reason, we argue that whenever possible, NLP practitioners should use attention flow-based explanations alongside more traditional ones, such as gradients (Feng et al., 2018;Smilkov et al., 2017). We conclude by discussing some of the limitations in calculating Shapley Values for any arbitrary player set and payoff function in NLP.

Model Interpretation as a Game
The Shapley Value (Shapley, 1953) was proposed as a solution to a classic problem in game theory: When a group of players work together to achieve a payoff, how can we fairly allocate the payoff to each player, given that some contribute more than others? The players here are the potential subjects of the explanation (e.g., input tokens); the payoff is some quality of the model prediction (e.g., correctness). We contextualize the game theoretic terms with respect to model interpretability below.
Definition 2.1. A player is a possible subject of the explanation (e.g., character, token, embedding, neuron). N = {1, ..., n} is the set of all players.
Definition 2.2. A coalition is a subset of players S ⊆ N that work together. There are 2 n possible coalitions. The other players N \ S are left out by being replaced with a non-subject that cannot affect the outcome (e.g., a zeroed-out embedding or a dropped-out neuron).
Definition 2.3. The payoff reflects some quality of the model prediction -e.g., correctness, confidence, entropy -made using a given coalition. It is defined by a payoff function v : 2 N → R, where v(∅) = 0. The value φ i (v) of a player i is the share of the payoff allocated to it. In other words, it is the importance accorded to subject i of an explanation.
Definition 2.4. A game is defined by (N, v), a player set N and payoff function v. It is a transferable utility game (TU-game), where the payoff can be distributed among the players as desired.
In the game of model interpretation, the subjects of the explanation are framed as players working cooperatively to make the best possible prediction.

Equitable Allocation
How can we allocate the payoff equitably, in a way that reflects the actual contribution made by each player? In other words, how can we faithfully interpret a prediction? The game theory literature proposes that any equitable payoff allocation satis-fies these three conditions (Myerson, 1977;Young, 1985;Ghorbani and Zou, 2019): Condition 1. (Null Player): A player that induces no change in the payoff from joining any coalition has zero value. Formally, Condition 2. (Symmetry): Two players who induce the same change in payoff upon joining every coalition (that excludes them) have the same value. Formally, Condition 3. (Additivity): The value of a player across two different games with payoff v, w should be the sum of its value in each game. Formally,

The Shapley Value
The Shapley Value is a well-known solution to the problem of payoff allocation in a cooperative setting, as it uniquely satisfies the three criteria for equitable allocation in 2.1 (Shapley, 1953;Myerson, 1977;Young, 1985). It sets the value of a player to be its expected incremental contribution to a coalition, over all possible coalitions.
Definition 2.5. Where R is one of n! possible permutations of the player set N , let P R[:i] be the subset of players that precede player i in the permutation. Then, for a given payoff function v, the Shapley Value of player i is There are other equivalent ways of expressing the Shapley Value, including as a sum over the 2 n possible coalitions.
In addition to satisfying our three criteria of equitable allocation (2.1), a Shapley Value distribution always exists and is unique for a TU-game (N, v). Unlike with attention weights, which have been criticized for allowing counterfactual explanations (Jain and Wallace, 2019;Serrano and Smith, 2019), there can thus be no counterfactual Shapley Value distribution for a given input and payoff function v. The distribution is also said to be efficient, since it allocates all of the payoff: v(N ) = i∈N φ i (v) (Myerson, 1977;Young, 1985). The Shapley Value can, in theory, be computed for any player set and payoff function. However, in practice, there are typically too many players to calculate this combinatorial expression exactly. Generally, estimates are taken by uniformly sampling m random permutations R (Ghorbani and Zou, 2019): In the rest of this paper, we ask: Is there some TUgame ( erson, 1977;Young, 1985), so for attention weights to be efficient, the only applicable payoff function would be the sum of attention weights. Since each player only has one Shapley Value for a given v, if it is attended to multiple times, its value must be the total attention paid to it: where a j,i denotes the attention j pays to i, φ i (v) = j∈N a j,i . Note that the payoff for a coalition S is within some constant of its cardinality, since for a player j, the weights a j,· of the players that it attends to sum to 1 (Bahdanau et al., 2015). We consider two cases.
Case 1 For a player j that attends to some other player, its contribution to the payoff of every S ∈ N \ {j} is a j,· = 1, implying φ j (v) = 1 by the Shapley Value definition (1). If some player (that pays attention) is more or less attended to than another -which is the point of using attentionthis results in a contradiction. Thus φ j cannot be the total attention paid to j.
Case 2 For a player i that doesn't attend to any other player, its contribution to the payoff of every S ∈ N \ {i} is 0, since the attention paid to i is redistributed among other players when it is absent. This implies φ i (v) = 0 by (1). However, all input embeddings fall under this case, and we know at least one will be attended to; its attention weights will be non-zero, making this a contradiction. Thus φ i cannot be the total attention paid to i.

Attention Flows
What if we restricted the players to those from the same layer of a model? The remaining players still affect the prediction but can't have any of the payoff allocated to them. In this case, attention weights still cannot be Shapley Values. However, attention weights can be post-processed. Abnar and Zuidema (2020) proposed treating the self-attention graph as a flow network -where the attention weights are capacities -and then applying a max-flow algorithm (Ford and Fulkerson, 1956) to this network to calculate the maximum flow on each edge. We prove (by construction) that these attention flows are Shapley Values when the players are restricted to those from the same layer and the payoff is the total flow, as visualized in Figure 1. Proof. Blocking the flow through a player i ∈ S decreases v(S) by that player's outflow |f o (i)|, since the attention flow is only calculated once -with the entire graph -and not for each possible subgraph. Since the players are all disjoint and have no connections, blocking the flow through one player does not affect the outflow of any of the other players. This would not be the case, for example, if the players were in different layers, in which case changes in flow upstream would cause changes in flow downstream. Then for any coalition S ⊆ N and player i ∈ S, v(S ∪ {i}) = v(S) + |f o (i)|. We can rewrite the total outflow for player i as Attention Rollout Abnar and Zuidema (2020) also proposed another post-processed variant of attention called attention rollout, in which the attention weight matrices from each layer are multiplied with those before it to get aggregated attention values. Attention roll-out values cannot be Shapley Values, however; this can be shown with a trivial extension of the proof to Proposition 1.

Leave-One-Out
Erasure describes a class of interpretability methods that aim to understand the importance of a representation, token, or neuron by erasing it and recording the resulting effect on model prediction (Li et al., 2016;Arras et al., 2017;Feng et al., 2018;Serrano and Smith, 2019). Although the Shapley Value technically falls under this class, most erasure-based methods only remove one entitythe one whose importance they want to estimate -and this only takes two forward passes, compared to O(2 n ) passes for the Shapley Value. Since only one entity is erased, this simpler group of erasurebased methods is called leave-one-out (Jain and Wallace, 2019; Abnar and Zuidema, 2020). We show in this section that leave-one-out values are not Shapley Values, except in the degenerate case.
Proposition 3. If ∃ i ∈ N such that player i is not a null player even when excluding the coalition N \{i}, then there is no TU-game (N, v) for which leave-one-out values are Shapley Values.
Proof. Let the leave-one-out value of player i be denoted by LOO i (v). Let R denote any permutation of N where P R [:i] = N \ {i}. By definition, By our assumption, the first term is non-zero, so there is no equivalence with LOO i (v). In practice, this assumption is almost always satisfied.
Note that leave-one-out tells us very little about player importance for discrete payoff functions.
For example, if the payoff were the correctness (i.e., 1 if correct and 0 otherwise), then the importance of a player would be binary: it would either be critically important to prediction or totally irrelevant. This provides an incomplete picture -while there is enough redundancy in BERT-based models to tolerate some missing embeddings, this does not mean those embeddings are of no importance (Kovaleva et al., 2019;Ethayarajh, 2019;Michel et al., 2019). For example, if two representations played a critical and identical role in a predictionbut only one was necessary -then leave-one-out would assign each a value of zero, despite both being important. In contrast, the Shapley Value of both players would be non-zero and identical.

Applications
Because Shapley Values have many useful applications, attentions flows -and any other score that meets the criteria for a Shapley Value -have many useful applications as well: • For one, using the various properties of the Shapley Value, we can provide more specific interpretations of model behavior than is currently the case, backed by theoretical guarantees. For example, if a token has zero attention flow in layer k but non-zero flow in layer k−1, then we can conclude that all the information it contains about the label (e.g., sentiment) was extracted by the model prior to the kth layer; this derives from the "null player" property of the Shapley Value. The same could not be said if the token only had a leave-one-out value of zero, since leave-one-out values are not Shapley Values.
• Interpretability in NLP often takes a single token or embedding to be the unit of analysis (i.e., a "player" in game theoretic terms). However, what if we wanted to understand the role of entire groups of tokens rather than individual ones? For most interpretability methods, there is no canonical way to aggregate scores across multiple units -we cannot necessarily add the raw attention scores of two tokens, since the usefulness of one may depend on the other. If we used a method that provided Shapley Values, we could easily redefine a "player" to be a group of tokens, such that all tokens in the same player group would simultaneously be included or excluded from a coalition.
• Recent work has used the Data Shapleyan extension of the Shapley Value -to estimate the contribution of each example in the training data to a model's decision boundary (Ghorbani and Zou, 2019). If we're finetuning BERT for sentiment classification, for example, we might want to know which sentence is more helpful: "This movie was great!" or "This was better than I expected." We can answer such questions by using the Data Shapley. To our knowledge, this has been done in computer vision but not in NLP.

Limitations and Future Work
Because Shapley Values -and by extension, attention flows -have many theoretical guarantees that are axioms of faithful interpretation, we encourage NLP practitioners to provide attention flow-based explanations alongside more traditional ones. This is not without limitations, however. As proven in Proposition 2, this equivalence only holds for a specific payoff function -the total flow through a layer -which is reflective of model confidence but not of the prediction correctness. But why do we need attention flows at all if, in theory, Shapley Values can be calculated for any arbitrary player set and payoff function? While this is true in theory, because of the combinatorial calculation (1), it is computationally intractable in most cases. While it is possible to take a Monte Carlo estimate (2), in practice the bounds can be quite loose (Maleki et al., 2013). Finding TU-games for which the Shapley Value can be calculated exactly in polynomial time -as with attention flow --is an important line of future work. These explanations may come with trade-offs: for example, SHAP is a kind of Shapley Value that assumes contributions are linear (i.e., a coalition can't be greater than the sum of its parts), which makes it much faster to calculate but restricts the set of possible payoff functions (Lundberg and Lee, 2017). Still, such methods will be critical to providing explanations that are both fast and faithful.