Conditional probing: measuring usable information beyond a baseline

Probing experiments investigate the extent to which neural representations make properties—like part-of-speech—predictable. One suggests that a representation encodes a property if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. Instead of using baselines as a point of comparison, we’re interested in measuring information that is contained in the representation but not in the baseline. For example, current methods can detect when a representation is more useful than the word identity (a baseline) for predicting part-of-speech; however, they cannot detect when the representation is predictive of just the aspects of part-of-speech not explainable by the word identity. In this work, we extend a theory of usable information called V-information and propose conditional probing, which explicitly conditions on the information in the baseline. In a case study, we find that after conditioning on non-contextual word embeddings, properties like part-of-speech are accessible at deeper layers of a network than previously thought.


Introduction
Neural language models have become the foundation for modern NLP systems (Devlin et al., 2019;Radford et al., 2018), but what they understand about language, and how they represent that knowledge, is still poorly understood (Belinkov and Glass, 2019;Rogers et al., 2020). The probing methodology grapples with these questions by relating neural representations to well-understood properties. Probing analyzes a representation by using it as input into a supervised classifier, which is trained to predict a property, such as part-of-speech (Shi et al., 2016;Ettinger et al., 2016;Alain and Bengio, 2016;Adi et al., 2017;Belinkov, 2021).
One suggests that a representation encodes a property of interest if probing that representation produces higher accuracy than probing a baseline representation like non-contextual word embeddings. However, consider a representation that encodes only the part-of-speech tags that aren't determined by the word identity. Probing would report that this representation encodes less about part-of-speech than the non-contextual word baseline, since ambiguity is relatively rare. Yet, this representation clearly encodes interesting aspects of part-of-speech. How can we capture this?
In this work, we present a simple probing method to explicitly condition on a baseline. 1 For a representation and a baseline, our method trains two probes: (1) on just the baseline, and (2) on the concatenation of the baseline and the representation. The performance of probe (1) is then subtracted from that of probe (2). We call this process conditional probing. Intuitively, the representation is not penalized for lacking aspects of the property accessible in the baseline.
We then theoretically ground our probing methodology in V-information, a theory of usable information introduced by Xu et al. (2020) that we additionally extend to multiple predictive variables. We use V-information instead of mutual information (Shannon, 1948;Pimentel et al., 2020b) because any injective deterministic transformation of the input has the same mutual information as the input. For example, a representation that maps each unique sentence to a unique integer must have the same mutual information with any property as does BERT's representation of that sentence, yet the latter is more useful. In contrast, V-information is defined with respect to a family of functions V that map one random variable to (a probability distribution over) another. V-information can be constructed by deterministic transformations that make a property more accessible to the functions in the family. We show that conditional probing provides an estimate of conditional V-information I V (repr → property | baseline).
In a case study, we answer an open question posed by Hewitt and Liang (2019): how are the aspects of linguistic properties that aren't explainable by the input layer accessible across the rest of the layers of the network? We find that the partof-speech information not attributable to the input layer remains accessible much deeper into the layers of ELMo (Peters et al., 2018a) and RoBERTa (Liu et al., 2019) than the overall property, a fact previously obscured by the gradual loss across layers of the aspects attributable to the input layer. For the other properties, conditioning on the input layer does not change the trends across layers.

Conditional V-information Probing
In this section, we describe probing methods and introduce conditional probing. We then review Vinformation and use it to ground probing.

Probing setup
We start with some notation. Let X ∈ X be a random variable taking the value of a sequence of tokens. Let φ(X) be a representation resulting from a deterministic function of X; for example, the representation of a single token from the sequence in a layer of BERT (Devlin et al., 2019). Let Y ∈ Y be a property (e.g., part-of-speech of a particular token), and V a probe family, that is, a set of functions {f θ : θ ∈ R p }, where f θ : z → P(Y) maps inputs z to probability distributions over the space of the label. 2 The input z ∈ R m may be in the space of φ(X), that is, R d , or another space, e.g., if the probe takes the concatenation of two representations. In each experiment, a training dataset D tr = {(x i , y i )} i is used to estimate θ, and the probe and representation are evaluated on a separate dataset D te = {(x i , y i )} i . We refer to the result of this evaluation on some representation R as Perf(R).

Baselined probing
Let B ∈ R d be a random variable representing a baseline (e.g., non-contextual word embedding of a particular token.) A common strategy in probing is to take the difference between a probe performance on the representation and on the baseline 2 We discuss mild constraints on the form that V can take in the Appendix. Common probe families including linear models and feed-forward networks meet the constraints.
(Zhang and Bowman, 2018); we call this baselined probing performance: (1) This difference in performances estimates how much more accessible Y is in φ(X) than in the baseline B, under probe family V.
But what if B and φ(X) capture distinct aspects of Y ? For example, consider if φ(X) captures parts-of-speech that aren't the most common label for a given word identity, while B captures partsof-speech that are the most common for the word identity. Baselined probing will indicate that φ(X) explains less about Y than the baseline, a "negative" probing result. But clearly φ(X) captures an interesting aspect of Y ; we aim to design a method that measures just what φ(X) contributes beyond B in predicting Y , not what B has and φ(X) lacks.

Our proposal: conditional probing
In our proposed method, we again train two probes; each is the concatenation of two representations of size d, so we let z ∈ R 2d . The first probe takes as input [B; φ(X)], that is, the concatenation of B to the representation φ(X) that we're studying. The second probe takes as input [B; 0], that is, the concatenation of B to the 0 vector. Conditional probing method takes the difference of the two probe performances, which we call conditional probing performance: Including B in the probe with φ(X) means that φ(X) only needs to contribute what is missing from B. In the second probe, the 0 is used as a placeholder, representing the lack of knowledge of φ(X); its performance is subtracted so that φ(X) isn't given credit for what's explainable by B. 3

V-information
V-information is a theory of usable informationthat is, how much knowledge of random variable Y can be extracted from r.v. R when using functions in V, called a predictive family (Xu et al., 2020). Intuitively, by explicitly considering computational constraints, V-information can be constructed by computation, in particular when said computation makes a variable easier to predict. If V is the set of all functions from the space of R to the set of probability distributions over the space of Y , then V-information is mutual information (Xu et al., 2020). However, if the predictive family is the set of all functions, then no representation is more useful than another provided they are related by a bijection. By specifying a V, one makes a hypothesis about the functional form of the relationship between the random variables R and Y . One could let V be, for example, the set of log-linear models. Using this predictive family V, one can define the uncertainty we have in Y after observing R as the V-entropy: where f [r] produces a probability distribution over the labels. Information terms like I V (R → Y ) are defined analogous to Shannon information, that is, . For brevity, we leave a full formal description, as well as our redefinition of V-information to multiple predictive variables, to the appendix.

Probing estimates V-information
With a particular performance metric, baselined probing estimates a difference of V-information quantities. Intuitively, probing specifies a function family V, training data is used to find f ∈ V that best predicts Y from φ(X) (the infimum in Equation 7), and we then evaluate how well Y is predicted. If we use the negative cross-entropy loss as the Perf function, then baselined probing estimates the difference of two V-information quantities. This theory provides methodological best practices as well: the form of the family V should be chosen for theory-external reasons, 4 and since the probe training process is approximating the infimum in Equation 3, we're not concerned with sample efficiency.
Baselined probing appears in existing information-theoretic probing work: Pimentel et al. (2020b) define conditional mutual information quantities wherein a lossy transformation c(·) is performed on the sentence (like choosing a single word), and an estimate of the gain from knowing the rest of the sentence is provided; I(φ(X); Y |c(φ(X))) = I(X; Y |c(X)). 5 Methodologically, despite being a conditional information, this is identical to baselined probing, training one probe on just φ(X) and another on just c(φ(X)). 6

Estimating conditional information
Inspired by the transparent connections between V-information and probes, we ask what the Vinformation analogue of conditioning on a variable in a mutual information, that is, I(X, Y |B). To do this, we extend V-information to multiple predictive variables, and design conditional probing (as presented) to estimate thus having the interpretation of probing what φ(X) explains about Y apart from what's already explained by B (as can be accessed by functions in V). Methodologically, the innovation is in providing B to the probe on φ(X), so that the information accessible in B need not be accessible in φ(X).

Related Work
Probing-mechanically simple, but philosophically hard to interpret (Belinkov, 2021)-has led to a number of information-theoretic interpretations. Pimentel et al. (2020b) claimed that probing should be seen as estimating mutual information I(φ(X); Y ) between representations and labels. This raises an issue, which Pimentel et al. (2020b) notes: due to the data processing inequality, the MI between the representation of a sentence (from e.g., BERT) and a label is upper-bounded by the MI between the sentence itself and the label. Both an encrypted document X and an unencrypted version φ(X) provide the same mutual information with the topic of the document Y . This is because MI allows unbounded work in using X to predict Y , including the enormous amount of work (likely) required to decrypt it without the secret key. Intuitively, we understand that φ(X) is more useful than X, and that this is because the function φ performs useful "work" for us. Likewise, BERT can perform useful work to make interesting properties more accessible. While Pimentel et al. (2020b) conclude from the data processing inequality that probing is not meaningful, we conclude that estimating mutual information is not the goal of probing.
Voita and Titov (2020) propose a new probinglike methodology, minimum description length (MDL) probing, to measure the number of bits required to transmit both the specification of the probe and the specification of labels. Intuitively, a representation that allows for more efficient communication of labels (and probes used to help perform that communication) has done useful "work" for us. Voita and Titov (2020) found that by using their methods, probing practitioners could pay less attention to the exact functional form of the probe. V-information and MDL probing complement each other; V-information does not measure sample efficiency of learning a mapping from φ(X) to Y , instead focusing solely on how well any function from a specific family (like linear models) allows one to predict Y from φ(X). Further, in practice, one must choose a family to optimize over even in MDL probing; the complexity penalty of communicating the member of the family is analogous to choosing V. Further, our contribution of conditional probing is orthogonal to the choice of probing methodology; it could be used with MDL probing as well.
V-information places the functional form of the probe front-and-center as a hypothesis about how structure is encoded. This intuition is already popular in probing, For example, Hewitt and Manning (2019) proposed that syntax trees may emerge as squared Euclidean distance under a linear transformation. Further work refined this, showing that a better structural hypothesis may be hyperbolic (Chen et al., 2021) axis-aligned after scaling (Limisiewicz and Mareček, 2021), or an attentioninspired kernel space (White et al., 2021).
In this work, we intentionally avoid claims as to the "correct" functional family V to be used in conditional probing. Some work has argued for simple probe families  2021), we see V-information as most useful in discovering emergent structure, that is, parsimonious and surprisingly simple relationships between neural representations and complex properties.

Experiments
In our experiments, we aim for a case study in understanding how conditioning on the noncontextual embeddings changes trends in the accessibility of linguistic properties across the layers of deep networks.

Tasks, models, and data
Tasks. We train probes to predict five linguistic properties, roughly arranged in order from lowerlevel, more concrete properties to higher-level, more abstract properties. We predict five linguistic properties Y : (i) upos: coarse-grained (17-tag) part-of-speech tags (Nivre et al., 2020), (ii) xpos: fine-grained English-specific part-of-speech tags, (iii) dep rel: the label on the Universal Dependencies edge that governs the word, (iv) ner: named entities, and (v) sst2: sentiment.
Data. All of our datasets are composed of English text. For all tasks except sentiment, we use the Ontonotes v5 corpus (Weischedel et al., 2013), recreating the splits used in the CoNLL 2012 shared task, as verified against the split statistics provided by Strubell et al. (2017). 78 Since Ontonotes is annotated with constituency parses, not Universal Dependencies, we use the converter provided in CoreNLP (Schuster and Manning, Probe families. For all of our experiments, we choose V to be the set of affine functions followed by softmax. 9 For word-level tasks, we have where i indexes the layer in the network and j indexes the word in the sentence. For the sentencelevel sentiment task, we average over the wordlevel representations, as

Results
Results on ELMo. ELMo has a non-contextual embedding layer φ 0 , and two contextual layers φ 1 and φ 2 , the output of each of two bidirectional LSTMs (Hochreiter and Schmidhuber, 1997). Previous work has found that φ 1 contains more syntactic information than φ 2 (Peters et al., 2018b; 9 We used the Adam optimizer (Kingma and Ba, 2014) with starting learning rate 0.001, and multiply the learning rate by 0.5 after each epoch wherein a new lowest validation loss is not achieved.  (2019) conjecture that this may be due to accessibility of information from φ 0 . Conditional probing answers shows that when only measuring information not available in φ 0 , there is still more syntactic information in φ 1 than φ 2 , but the difference is much smaller.
Results on RoBERTa. RoBERTa-base is a pretrained Transformer consisting of a word-level embedding layer φ 0 and twelve contextual layers φ i , each the output of a Transformer encoder block (Vaswani et al., 2017). We compare baselined probing performance to conditional probing performance for each layer. In Figure 1, baselined probing indicates that part-of-speech information decays in later layers. However, conditional probing shows that information not available in φ 0 is maintained into deeper layers in RoBERTa, and only the information already available in φ 0 decays. In contrast for dependency labels, we find that the difference between layers is lessened after conditioning on φ 0 , and for NER and sentiment, conditioning on φ 0 does not change the results.

Conclusion
In this work, we proposed conditional probing, a simple method for conditioning on baselines in probing studies, and grounded the method theoretically in V-information. In a case study, we found that after conditioning on the input layer, usable part-of-speech information remains much deeper into the layers of ELMo and RoBERTa than previously thought, answering an open question from Hewitt and Liang (2019). Conditional probing is a tool that practitioners can easily use to gain additional insight into representations. 10 10 An executable version of the experiments in this paper is on CodaLab, at this link: https: //worksheets.codalab.org/worksheets/ 0x46190ef741004a43a2676a3b46ea0c76.

A Multivariable V-information
In this section we introduce Multivariable Vinformation. V-information as introduced by Xu et al. (2020) was defined in terms of a single predictive variable X, and is unwieldy to extend to multiple variables due to its use of a "null" input outside the sample space of X (Section D.3). 11 Our multivariable V-information removes the use of null variables and naturally captures the multivariable case. Consider an agent attempting to predict Y ∈ Y from some information sources X 1 , . . . , X n , where X i ∈ X i . Let P(Y) be the set of all probability distributions over Y . At a given time, the agent may only have access to a subset of the information sources. Let the known set C ∈ C and unknown setC ∈C be a binary partition of X 1 , . . . , X n . Though the agent isn't given the true value ofC when predicting Y , it is instead provided with a constant valueā ∈C, which does not vary with Y . 12 We first specify constraints on the set of functions that the agent has at its disposal for predicting Y from X:
Intuitively, the constraint on V states that for any binary partition of the X 1 , . . . , X n into known and unknown sets, if a function is expressible given some constant assignment to the unknown variables, the same function is expressible if the unknown variables are allowed to vary arbitrarily. Intuitively, this means one can assign zero weight to those variables, so their values don't matter. This constraint, which we refer to as multivariable optional ignorance in reference to Xu et al. (2020), will be used to ensure non-negativity of information; when some X is moved fromC to C as a new predictive variable for the agent to use, optional ignorance ensures the agent can still act as if that variable were held constant.
Given the predictive family of functions the agent has access to, we define the multivariable V-information analogue of entropy: Definition 2 (Multivariable Predictive V-entropy). Let X 1 , . . . , X n ∈ X 1 , . . . , X n . Let C ∈ C and C ∈C form a binary partition of X 1 , . . . , X n . Let a ∈C. Then the V-entropy of Y conditioned on C is defined as Note thatā does not vary with y; thus it is 'informationless'. The notation f (c,ā) takes the known value of C ⊆ {X 1 , . . . , X n }, and the constant valueā, and produces a distribution over Y, and f (c,ā)[y] evaluates the density at y.
If we let V = Ω, the set of all functions from the X i to distributions over Y, then V-entropy becomes exactly Shannon entropy (Xu et al., 2020). And just like for Shannon information, the multivariable Vinformation from some variable X to Y is defined as the reduction in entropy when its value becomes known. In our notation, this means some X is moving fromC (the unknown variables) to C (the known variables), so this definition encompasses the notion of conditional mutual information if C is non-empty to start.

A.1 Properties of multivariable V-information
The crucial property of multivariable Vinformation as a descriptor of probing is that it can be constructed through computation. In the example of the agent attempting to predict the sentiment (Y ) of an encrypted message (X), if the agent has V equal to the set of linear functions, then I V (X → Y ) is small 13 . A function φ that decrypts the message constructs V-information about Y , since I V (φ(X) → Y ) is larger. In probing, φ is interpreted to be the contextual representation learner, which is interpreted as constructing V-information about linguistic properties. V-information also has some desirable elementary properties, including preserving some of the properties of mutual information, like nonnegativity. (Knowing some X should not reduce the agent's ability to predict Y ).
Proposition 1. Let X 1 , . . . , X n ∈ X 1 , . . . , X n and Y ∈ Y be random variables, and V and U be predictive families. Let C,C be a binary partition of X 1 , . . . , X n .

B Probing as Multivariable V-information Estimation
We've described the V-information framework, and discussed how it captures the intuition that usable information about linguistic properties is constructed through contextualization. In this section, we demonstrate how a small step from existing probing methodology leads to probing estimating V-information quantities.

B.1 Estimating V-entropy
In probing, gradient descent is used to pick the function in V that minimizes the cross-entropy loss, where θ are the trainable parameters of functions in V. Recalling the definition of V-entropy, this minimization performed through gradient descent is ap- To summarize, this states that the supervision used in probe training can be interpreted as approximating the inf in the definition of V-entropy. In traditional probing, the performance of the probe is measured on the test set D te using the traditional metric of the task, like accuracy of F 1 score. In V-information probing, we use D te to approximate the expectation in the definition of V-entropy. Thus, the performance of a single probe on representation R, where the performance metric is cross-entropy loss, is an estimate of H V (Y |R). This brings us to our framing of a probing experiment as estimating a V-information quantity.

B.2 Baselined probing
Let baselined probing be defined as in the main paper. Then if the performance metric is defined as the negative cross-entropy loss, we have that Perf(B) estimates −H V (Y |B), Perf(φ(X)) estimates −H V (Y |φ(X)), and so baselined probing performance is an estimate of

B.3 Conditional probing
Let conditional probing be defined as in the main paper. Then if the performance metric is defined as the negative cross-entropy loss, , and so conditional probing performance is an estimate of The first is estimated with a probe just on B-under the definition of predictive family, this means providing the agent with the real values of the baseline, and some constant value like the zero vector instead of φ i (X). That is, holdingā ∈ φ i (X ) i constant and sampling b, y ∼ B, Y , the probability assigned to y is f (b,ā)[y] for f ∈ V. The second term is estimate with a probe on both B and φ i (X). So, sampling b, x, y ∼ B, X, Y , the probability assigned to y is f (b, φ i (x))[y] for f ∈ V. Intuitively, conditional probing measures the new information in φ i (X) because in both probes, the agent has access to B, so no benefit is gained from φ i (X) supplying the same information.
C Proof of Proposition 1 This holds because we are taking the infimum over V such that if f ∈ U then f ∈ V. c,c ] ∀c ∈C, andā,ā / denote the constant values of the unknown set with and without X , the proof is as follows:

Non-Negativity
By definition, In the second line, we break down the expectation based on conditional independence. Then we apply Jensen's inequality and optional ignorance to remove the expectation w.r.t. x. Since VC ⊂ V, the former's infimum is at least as large as the latter's. Then Combined with non-negativity (i.e., I V (X → Y |C) > 0) we have inequality in both directions, so I V (X → Y |C) = 0.

D Equivalence of Xu et al. (2020) and our V-information
In order to define conditional probing, we needed a theory of V-information that considered arbi-trarily many predictive variables X 1 , . . . , X n . Vinformation as presented by Xu et al. (2020) considers only a single predictive variable X . It becomes extremely cumbersome, due to the use of null variables in their presentation, to expand this to more, let alone arbitrarily many variables. So, we redefined and extended V-information to more naturally capture the case with an arbitrary (finite) number of variables. In this section, we show that, in the single predictive variable case considered by Xu et al. (2020), our V-information definition is equivalent to theirs. For the sake of this section, we'll call the V-information of Xu et al. (2020) Xu-V-information, and ours V-information.
In particular, we show that there is a transformation from any predictive family of Xu-Vinformation to predictive family for V-information under which predictive V-entropies are the same (and the same in the opposite direction.)

D.1 From Xu et al. (2020) to ours
We recreate the definition of predictive family from Xu et al. (2020) here: Now, we construct one of our predictive families from the Xu predictive family. Let U ⊆ Υ be a Xu predictive family. We now construct a predictive family under our framework, V ⊆ Ω. For each f ∈ U, f : X ∪ {∅} → P(Y), construct the following two functions: first, g, which recreates the behavior of f on the domain of X : and second, g , which recreates the behavior of f on ∅, given any input from X : information quantities are the same. This shows that the predictive family we constructed in our theory is equivalent to the predictive family from Xu et al. (2020) that we started with.
D.2 From our V-information to that of Xu et al. (2020) Now we construct a predictive family U under the framework of Xu et al. (2020) from an arbitrary predictive family V under our framework. For each function f ∈ V, we have from the definition that there exists f ∈ V such that ∀x ∈ X , f (x) = P for some P ∈ P(Y). We then define the function: whereā ∈ X is an arbitrary element of X , and the set of constant-valued functions (26) where z ∈ X ∪ {∅}, and let The set U is a predictive family under Xu-Vinformation because for any f ∈ U, f is either a g or a g in our construction, and so optional ignorance is maintained by the set G that was either constructed for g or that g was a part of. That is, from the construction, G contains a function for each element in the range of g (or g ) that maps all x ∈ X as well as ∅ to that element, and U contains all elements in G. Now we show that the predictive V-entropies of U (from this construction) under Xu et al. (2020) are the same as for V under our framework.
First we want to show H U (Y |X) = H V (Y |X). For the g that achieves the inf over U in Equation 22, we have there exists f ∈ V such that g(x) = f (x) given that x ∈ X , so H V (y|x) ≤ H U (y|x) The same is true in the other direction; the f ∈ V that achieves the inf in V-entropy similarly corresponds to g ∈ U, implying H U (Y |X) ≤ H V (Y |X), and thus their equality. Now we want to show H U (Y |∅) = H V (Y ). For the g ∈ U that achieves its inf, we have by construction that there is an f ∈ V such that for anyā ∈ X , it holds that g(∅) = f (ā). So, H V (Y |X) ≤ H U (Y |X). In the other direction, for the f ∈ V that achieves its inf given an arbitrarȳ a ∈ X , there is the f ∈ V from our construction of U such that f (ā) = f (x) = g(∅) for all x ∈ X . This implies H U (Y |X) ≤ H V (Y |X), and thus their equality.

D.3 Remarks on the relationship between our
V-information and that of Xu et al.
The difference between our V-information and that of Xu et al. (2020) is in how the requirement of optional ignorance is encoded into the formalism. This is an important yet technical requirement that if a predictive agent has access to the value of a random variable X, it's allowed to disregard that value if doing so would lead to a lower entropy. An example of a subset of Ω for which this doesn't hold in the multivariable case is for multi-layer perceptrons with a frozen (and say, randomly sampled) first linear transformation. The information of, say, X 1 and X 2 , are mixed by this frozen linear transformation, and so X 1 cannot be ignored in favor of just looking at X 2 . However, if the first linear transformation is trainable, then it can simply assign 0 weights to the rows corresponding to X 1 and thus ignore it. The V-information of Xu et al. (2020) ensures this option by introducing a null variable ∅ which is used to represent the lack of knowledge about their variable X -and for any probability distribution in the range of some f ∈ U under the theory, there must be some function f that produces the same probability distribution when given any value of X or ∅. This is somewhat unsatisfying because f should really be a function from X → P(Y), but this implementation of optional ignorance changes the domain to X ∪ {∅}. When attempting to extend this to the multivariable case, the definition of optional ignorance becomes very cumbersome. With two variables, the domain of functions in a predictive family must be (X 1 ∪{∅})×(X 2 ∪{∅}). Because the definition of V-entropy under Xu et al. (2020) treats using X separately from using ∅, one must define optional ignorance constraints separately for each subset of variables to be ignored, the number of which grows exponentially with the number of variables.
Our re-definition of V-information gets around this issue by defining the optional ignorance constraint in a novel way, eschewing the ∅ and instead encoding it as the intuitive implementation that we Results are reported in bits of V-information; higher is better described in the MLP -that for any function in the family and fixed value for some subset of the inputs (which will be the unknown subset), there's a function that behaves identically even if that subset of values is allowed to take any value. (Intuitively, by, e.g., having it be possible that the weights for those inputs are 0 at the first layer.)

E Full Results
In this section, we report all individual probing experiments: single-layer probes' V-entropies in Table 2, single-layer probes' task-specific metrics in Table 3, two-layer probes' V-entropies in Table 4, and two-layer probes' task-specific metrics in Table 5. In Figure 2, we report the xpos figure for RoBERTa corresponding to the other four figures in the main paper. We see that it shows roughly the same trend as the upos figure from the main paper.  Table 3: Task-specific metric results on probes taking in one layer, for each layer of the network. For upos, xpos, dep, and sst2, the metric is accuracy. For NER, it's span-level F 1 as computed by the Stanza library (Qi et al., 2020). For all metrics, higher is better.
RoBERTa Two-Layer V-entropy  Table 5: Task-specific metric results on probes taking in two layers: layer 0 and each other layer of the network. For upos, xpos, dep, and sst2, the metric is accuracy. For NER, it's span-level F 1 as computed by the Stanza library. For all metrics, higher is better.