Bird’s Eye: Probing for Linguistic Graph Structures with a Simple Information-Theoretic Approach

NLP has a rich history of representing our prior understanding of language in the form of graphs. Recent work on analyzing contextualized text representations has focused on hand-designed probe models to understand how and to what extent do these representations encode a particular linguistic phenomenon. However, due to the inter-dependence of various phenomena and randomness of training probe models, detecting how these representations encode the rich information in these linguistic graphs remains a challenging problem. In this paper, we propose a new information-theoretic probe, Bird’s Eye, which is a fairly simple probe method for detecting if and how these representations encode the information in these linguistic graphs. Instead of using model performance, our probe takes an information-theoretic view of probing and estimates the mutual information between the linguistic graph embedded in a continuous space and the contextualized word representations. Furthermore, we also propose an approach to use our probe to investigate localized linguistic information in the linguistic graphs using perturbation analysis. We call this probing setup Worm’s Eye. Using these probes, we analyze the BERT models on its ability to encode a syntactic and a semantic graph structure, and find that these models encode to some degree both syntactic as well as semantic information; albeit syntactic information to a greater extent.

With the growing popularity of pretrained language models that build contextualized text representations (Reid et al., 2020;Devlin et al., 2019, inter alia), various probing models have been introduced to understand if and how our linguistic intuitions are encoded in these representations. These probes train supervised models to predict pieces of linguistic information such as POS (partof-speech), morphology, syntactic and semantic relations, and other local or long-range phenomena in language (Belinkov et al., 2017;Conneau et al., 2018;Hewitt and Manning, 2019;Tenney et al., 2019b;Jawahar et al., 2019). However, it is still an open question if these representations somehow encode entire linguistic graph structures such as dependency and constituency parse trees or graph structured meaning representations such as AMR (Abstract Meaning Representation), UCCA (Universal Conceptual Cognitive Annotation), etc.
A popular recent work, the structural probe (Hewitt and Manning, 2019), has investigated how contextualized representations encode syntax trees. They tested if a linear transformation of the network's word representation space can predict particular features of the syntax tree, namely, the distance between words and depth of words in the tree. Thus, the structural probe cannot by itself answer the question if these representations encode entire linguistic graph structures. Moreover, the structural probe is only designed for tree structures and cannot be extended to general graphs.
In this work, we introduce a new probing approach, Bird's Eye, which can be used to detect if contextualized text representations encode entire linguistic graphs. Bird's Eye is a simple information-theoretic probe (Pimentel et al., 2020b) which first encodes the linguistic graph into a con- Figure 1: Methodology of Bird's Eye: To probe pretrained language models, linguistic graphs are embedded in a continuous space and the mutual information between graph embeddings and word representations is calculated. tinuous representation using graph embedding approaches (Cai et al., 2018) and then, estimates the mutual information between the linguistic graph representation space and the contextualized word representation space. An illustration of the probe approach is given in Figure 1. The information theoretic approach is more reliable than training a probe and using accuracy for probing as it is debatable if the classifier-based probe is probing or trying to solve the task (Hewitt and Liang, 2019;Pimentel et al., 2020b). We further extend Bird's Eye to probe for localized linguistic information in the linguistic graphs such as POS or dependency arc labels in dependency parses. We call this probe, Worm's Eye.
In our experiments, we first illustrate the reliability of our probe methods and show the randomness of previous probe methods that use accuracy. Then, we use Bird's Eye to detect syntactic and semantic structures in BERT, showing that much syntactic and some semantic structure are encoded in BERT. Besides, we also use Worm's Eye to probe for specific linguistic information in syntactic trees and semantic graphs respectively to see which kinds of localized linguistic information is encoded in BERT. Our probing results are consistent with previous probe methods (Hewitt and Manning, 2019;Reif et al., 2019;Liu et al., 2019;Tenney et al., 2019a,b;Wu et al., 2021). We also discuss limitations of our probe and how future work can build upon our foundation.

Bird's Eye Probe
In this section, we introduce our informationtheoretic approach for probing linguistic graph structures in word representations. The MI estimate is used to understand how much of the information in the linguistic graph structure has been learnt by the pretrained models.
Let X = {x 1 , . . . , x T } denote an input sentence (each x i is the contextual embedding of a token in the given vocabulary V) and G denote the cor-responding linguistic graph. Furthermore, let X denote a random variable that takes values ranging over all possible token sequences in V. Correspondingly, let G denote a random variable that ranges over all possible corresponding linguistic graphs. We use I(X ; G) to denote the linguistic structure information that is included in the given word representations. Note that the MI value I(X ; G) is always non-negative, and a large MI implies that more of the structure information is encoded in the word representations. In order to make the MI computation easier, we additionally assume alignments between the nodes V in the graph G and the words in X. This alignment is one to one, for example, in dependency parsing (Marcus et al., 1993) but an aligner might be needed in some cases (Banarescu et al., 2013).
There are three main challenges in estimating MI in our setting. First, the MI estimation of discrete graphs and continuous features has been an elusive problem (Ross, 2014;Kraskov et al., 2004;Escolano et al., 2017), since there is no widely accepted definition of mutual information in this setting. Second, the dimensionality of the contextualized word representations is very high. Traditional methods (Moon et al., 1995;Steuer et al., 2002;Paninski, 2003) for MI estimation do not scale well with large sample size or dimension (Gao et al., 2015). Getting accurate estimates of mutual information in the high dimension is not easy. Third, graphs across different linguistic formalisms could have different entropy values, and thus the MI value I(X ; G) may be uncomparable across the different linguistic graph formalisms. For example, if syntactic trees G and semantic graphs G have the same MI value with X i.e. I(X ; G) = I(X ; G ) while the entropy values are fairly different i.e. I(X ; G) ≈ H(G) << H(G ), it is not proper to conclude that X contains the same amount of information from structures G and G , since they correspond to different percentages of the amount of uncertainty. Thus, the MI values must be inter-preted carefully.
Bird's Eye tackles the aforementioned difficulties by transforming the linguistic graphs into a continuous space using a graph embedding approach. Then the MI between graph embeddings and word representations is estimated using a recently proposed method (Belghazi et al., 2018) which performs well even in high dimensions. Finally, we also estimate a lower and upper bound of the MI, which is used to interpret the MI value. We describe various stages of Bird's Eye below:

Graph Embedding
The provided linguistic graphs can typically be represented as an adjacency matrix. Directly calculating MI with the adjacency matrix is not useful due to the sparsity and discreteness of the adjacency matrix representation. Thus, we transform the graphs into a continuous space where each node is represented by a continuous representation of same dimensionality.
Theoretically, if the graph embedding approach is perfect, we can use the invariant property of mutual information (Kraskov et al., 2004). This property states that under some fairly strong conditions, there exists an invertible function f (·) that satisfies G = f −1 (f (G)), where the graph embeddings are Z = f (G). Thus, we can transform G into graph embeddings Z, and: In this paper, we use DeepWalk (Perozzi et al., 2014), which is based on the skip-gram model (Huang et al., 1993;Mikolov et al., 2013) for graph embeddings 1 . Specifically, given a node v ∈ V encoded as the one-hot vector 1 v , the model tries to predict its neighbor's vector 1 v where v ∈ N v . The graph G = {V, E} is first sampled to generate a set of random walks. Then the graph neighborhood relationship is represented by the cooccurrence of nodes in the walk paths. Finally, for all the walks, Word2vec (Mikolov et al., 2013) with skip-gram is used to maximize the co-occurrence likelihood 2 : Let Z = ⊕{z v |v ∈ V } denote the learnt graph embedding where z v is the embedding of node v.
Here ⊕ denotes the concatenation operation.
In our experiments, we also explore to what extent the original linguistic graphs can be restored by the graph embeddings, which tests the extent to which eq. 1 holds and if we can use I(X ; Z) instead of I(X ; G) to estimate MI. More details can be found in Appendix A

Mutual Information Estimation
To estimate I(X ; Z) in high dimensions, we maximize the compression lemma lower bound (Banerjee, 2006) as mentioned in Belghazi et al. (2018). Specifically, for a pair of random variables X and Z, the mutual information is equivalent to the Kullback-Leibler (KL) divergence between the joint distribution P X Z and the product of the marginal distributions P X ⊗ P Z : From the compression lemma lower bound (Banerjee, 2006), the KL divergence D KL (P||Q) can be bounded as: where F can be any class of functions T : Ω → R satisfying certain integrability constraints. Thus, in the inequality 4, the lower bound can be obtained by finding a function in the set F: To get a tight estimate of I(X ; Z), we need the lower bound to be as high as possible. Thus, the MI estimation problem turns into an optimization problem to maximize the compression lemma lower bound. To ensure that, similar to Belghazi et al. (2018), we let F = {T θ } θ∈Θ be the set of functions parametrized by a neural network, and optimize the neural network using stochastic gradient descent. Formally, the objective function is: Here, P Z are empirical joint and marginal distributions over a sample of n (sentence, graph) pairs.
We calculate graph embeddings for each sentence independently, and regard one sentence as a mini-batch to optimize the neural network iteratively for MI estimation. Note that different from existing probe models, our objective of the neural network is to find an optimal function in F and estimate MI, rather than use prediction accuracy. Besides, the neural network is very simple (MLP). Therefore, there is no need to split dataset into training and test to test generalization in MI estimation 3 . The negative of the training loss as eq. 5 can be taken as MI estimation directly (Belghazi et al., 2018;Cristiani et al., 2020). In our experiments, we verify the effectiveness of the MI estimation method to prove that the probe is stable. More technical details of the MI estimation model and how it is trained are given in Appendix B.

Control Bounds
Next, we introduce two control bounds to interpret the MI value, whose functions are similar to the control task introduced by Hewitt and Liang (2019). As mentioned, comparing MI alone across different types of structures is not useful, since the entropy values of graph embeddings can also be different. Thus, we calculate an upper and a lower bound of the MI value based on the graph structures. Instead of using the MI value alone, we interpret it by its relative value in terms of the two control bounds. Formally, for the MI between graph embeddings and word representations, we have: The lower bound is the MI between a truly random variable R (i.e., independent of the graph) and the graph embedding Z. Thus, I(R; Z) = 0. The upper bound telescopes to the graph structure's self-entropy 4 . Using these two control bounds, we interpret the structure information by the relative MI value 5 : The MI estimatesÎ(Z; Z) andÎ(R; Z) can be obtained in the same way asÎ(X ; Z) (using the MI estimation method mentioned above). MIG (eq. 7) scales the MI value for graph embeddings with different self-entropy values into the same range: M IG(G) ∈ [0, 1]. Intuitively, MIG captures what percentage of the structure information is encoded in word representations. Since M IG(G) is scaled using I(R; Z), it also helps reduce the error in MI estimation. As mentioned, we maximize compression lemma lower bound 5 as the MI estimate. However, there could be a gap between it and the ground-truth MI value. Based on the fact that the ground-truth I(R; Z) = 0, we can know that the gap I(R; Z) −Î(R; Z) is equal to −Î(R; Z). In MIG (eq. 7), the gap is added for both numerator and denominator, which reduces the error brought by MI estimation 6 .

Worm's Eye Probe for Localized Linguistic Structure
Bird's Eye allows us to probe for entire linguisitic structures. However, for us to have a complete understanding, we might also want to probe for some localized information in the linguistic graphs. For example, we may want to know if BERT knows about POS tags or certain dependency relations in the syntax parse. We formulate probing for localized linguistic information as probing for a subgraph of the linguistic graph and reuse our Bird's Eye probe for it. We call this setting Worm's Eye as we are now analyzing if these representations capture local sub-structures.
To probe localized linguistic information G s = {V s , E s }, we use perturbation of the original structure for analysis. Specifically, we add a perturbation to the original graph embedding Z based on the subgraph G s . For all the nodes in V s or nodes connected by edges in E s , we add a noise on their corresponding node representations in Z. Let Z denote the corrupted graph embedding. Then, we define the following: MIL describes how much MI is contributed by the local structure G s . When the local structure is the whole graph, Z is completely noisy and M IL(G s ) equals to 1, which means the entire MI value I(X ; Z) is contributed by the local structure. If the local structure is an empty set, we have Z = Z. Then we can get M IL(G s ) = 0, representing that the local structure does not contribute anything to the MI value. If we control the perturbation of different types of local structures at the same level, we can compare how well they are captured by the word representations relative to each other using eq. 8. Specifically, for relations with labels, e.g., types of dependency relations in syntax trees, we set the same perturbation on the graph embeddings. Then, we test and compare M IL(G s ) for different types of relations. Larger M IL(G s ) for a particular relation type implies that more information about this relation type is encoded in the word representations.

Probing for Syntactic and Semantic Graph Structures
We use our Bird's Eye probe to detect two linguistic structures in the pretrained models, namely, dependency syntax (Marcus et al., 1993; and a more semantic formalism, AMR (Banarescu et al., 2013).
We first use our model to probe for Stanford dependencies (de . For a sentence X with tokens {x 1 , x 2 , ..x T }, the syntax tree defines a directed labelled tree where tokens x i are represented as nodes and relations among them as labeled edges. We ignore the edge direction and labels for simplicity in our work 7 . Future work can consider incorporating edge direction and labels. We embed the given syntax tree into a continuous space as mentioned before. Then, we calculate the MIG (eq. 7) as described before to determine how much syntax information is captured in the given contextualized representations.
Next, we test if contextualized representations capture a semantic graph representation -the Abstract Meaning Representation (AMR) (Banarescu et al., 2013). Different from syntactic trees, semantic graphs are not tree structured, and there can be loops or reentrencies. In the AMR annotation, plurality, articles and tenses were dropped and thus, there is no one-to-one corresponding between words in the sentence and nodes in the AMR graph. Thus, we use an off-the-shelf aligner (Pourdamghani et al., 2014) and calculate MI between the AMR graph embedding and the representations of those words that are aligned with a node in the AMR graph. For simplicity, edge directions and 7 The Stanford dependency tree also contains one empty root node, which is also ignored labels are also ignored in this setting.

Experiments
Our experiments mainly comprise of two parts: 1. Verification of the probe: The first part is for verification of the probing methodology and ensuring that the graph embeddings retain information about the linguistic graphs i.e eq. 1 holds. We do this by testing if the graph embeddings can be used to restore the original graph. 2. Probing for graph structures: The second part is about using the probe to detect syntactic and semantic graph structures in BERT. Importantly, we probe if pretrained BERT captures entire graph structures as well as specific relational information in these linguistic graphs. To contrast with previous accuracy and training based probes, we also train a group of simple MLP models with different number of hidden layers and use accuracy for probing. We show that designing and training a model to probe entire or localized linguistic structures is not as reliable as our information-theoretic approach.
We use gold annotations from the Penn Treebank and the AMR Bank for all our experiments. For the contextualized word representations, we select pretrained BERT models, specifically BERT-base (uncased) and BERT-large (uncased). Since BERT generates word-piece embeddings, to align them with gold word-level tokens, we represent each token as the average of its word-piece embeddings as in Hewitt and Manning (2019). We also use two non-contextual word embeddings as baselines: GloVe embeddings (Pennington et al., 2014) and ELMo-0, character-level word embeddings with no contextual information generated by pretrained ELMo (Reid et al., 2020).

Evaluation of Graph Embeddings
We first evaluate how well the graph embeddings can capture the linguistic graph structures by predicting the original graphs with them. We use simple MLPs of 6 different settings with varying number of hidden layers. More details can be found in Appendix C. We use AUC score as the metric to evaluate the graph prediction performance, which is a common metric in link prediction that computes area under the ROC curve (Fawcett, 2006).
The results are presented in Table 1. We can see that for both syntax trees and semantic graphs, MLPs can achieve good performance in restoring the original graph using graph embeddings where  the AUC score is quite high. Thus, we can be confident that equation 1 holds, and we can calculate MI based on the graph embeddings. Future work can explore better graph embedding approaches. We also evaluate our probe by adding noisy representations to the graph embeddings to prove that it is capable of teasing out different levels of dependencies. Details can be referred to in Appendix D.

Probing Entire Structures
We first used the Bird's Eye probe to detect if entire linguistic structures are encoded in hidden representations of BERT 8 . We also include two non-contextual word representations -GloVe and ELMo-0 as baselines. We report M IG as the results of our probe on the two graph structures in Figures 2(a) and 2(b). The M IG estimations for syntactic structure probing of both BERT-base and BERT-large are quite high, which implies that BERT encodes much syntactic information. However, for the semantic structure, the M IG scores of BERT models are lower, suggesting that BERT does not encode the semantic structures as well. These two conclusions are consistent with previous works (Liu et al., 2019;Tenney et al., 2019b;Wu et al., 2021) which have found that unlike syntax, semantics is not captured well by the pretrained models.
We also observe an interesting trend when comparing M IG across layers. We find that for syntax, M IG starts to decrease in the upper layers, especially for the BERT-large. This is consistent with previous works which report that BERT models syntax more in the lower and middle layers (Tenney et al., 2019a). For semantic graphs, M IG is steady across all layers. It means that semantic information is spread across the entire model. The results are consistent with existing work (Rogers et al., 2020). For the two non-contextual baselines, GloVe and ELMo-0, we can see that their M IG scores are lower compared with contextualized representations, especially for syntax. Previous work (Hewitt and Manning, 2019) has drawn similar conclusions. While for the semantic graphs, the gap is not significant.

Probing Localized Information
In this section, we show how we can use the Worm's Eye probe to understand if the contextualized representations capture localized linguistic information in the dependency parses such as POS information or relational dependency information. As described before, we design various perturbation experiments using our Worm's Eye probe. For probing POS information or a dependency relation type, we add noise to the graph embeddings of the corresponding node(s). After that, we calculate the M IL ratio (eq. 8) to show how much particular linguistic information (POS or relation type information) is contained in the word representations. We repeat the experiment 20 times and use boxplots to present all the results. First we use Worm's Eye to test for POS information, which is tagged as node labels in the dependency tree. We select 5 POS tags: IN, NNP, DT, JJ, and NNS, which have high and roughly the same frequencies in the Penn Treebank dataset. Complete statistics about the POS tag frequencies can be found in Appendix E. We ensure that the amount of perturbation of the graph embeddings is the same for each type. Figure 3 presents the results. We find that NNP achieves the highest M IL score, while NNS achieves the lowest. This implies that BERT encodes syntactic information for singular proper nouns (NNP) and adjectives (JJ) more than plural nouns (NNS). Next, we probe 5 types of universal dependency relations in the Penn Treebank dataset (PTB). These are prep, pobj, det, nn and nsubj. These 5 relations also roughly occur the same number of times in PTB. Complete statistics about the number of occurences of these relation types can be found in Appendix E. Similarly, for each type of relation, we add same amount of perturbation to graph embeddings of nodes connected by the specific relations. Figure 4 shows the results, where nsubj relations have the lowest M IL score compared with other 4 types. This means that BERT encodes more syntactic structure for prepositional modifiers (prep), object of a preposition (pobj), and noun compound modifier (nn) than nominal subject (nsubj). Reif et al. (2019) have drawn similar conclusions while probing for dependency arc labels. Similar experiment for semantic structure can be found in Appendix F.

On Accuracy-Based Probing
In contrast to our information-theoretic approach to probing, we train a group of MLP models to probe entire and local structures in BERT-base. We show that these probe results mainly depend on the model complexity rather than the structure itself. Probing entire graph structures. A group of MLPs are trained to predict entire syntactic and semantic structures with word representations. Figure 5 and Figure 6 show the results. Their trends are similar. Shallow MLPs perform the worst and deep ones perform much better. Previous work on structural probing (Hewitt and Manning, 2019) argues that powerful models could parse the word repre-sentations, thus a simple model should be designed. However, in Table 1, we find that linear model even could not restore the graph by its embeddings. Obviously, its performance cannot indicate how much structure information is included in the graph embeddings. Thus, there is no reasonable principle to decide the complexity of the probe model. Given this, designing and training a model is not suitable to probe entire structures. A similar argument has been placed by previous works (Pimentel et al., 2020a,b;Lovering et al., 2021). Probing localized graph structures. To prove that accuracies of probe models for localized structure also mainly depend on the model's complexity rather than the local structure, we train the group of MLPs to predict the entire syntactic structure by word representations, and calculate the AUC scores for each type of relations in test set as probing results. Table 2 shows the AUC score of predicting specific type of relations. For syntactic structure, same 5 types of relations are selected, and for semantic graphs, we select 3 groups of relations to probe: arg, general and op. Complete statistics of AMR Bank are in Appendix E. From the results, we can find that for MLP models with different number of hidden layers, the ranks of AUC scores of relation prediction are quite different. For both syntax trees and semantic graphs, there is no consistent interpretation of the results to conclude which types of relations are encoded in BERT. We also run the experiment in the perturbation settings, which can be referred to Appendix G.
Combining the results of probing with accuracy in Figure 5, Figure 6, and Table 2, we can find that the prediction decisions are not based purely on the structure but rather on spurious heuristics. This has also been concluded and discussed in some recent works (Hewitt and Liang, 2019;Lovering et al., 2021). Thus, training models is not feasible to probe structures. For our probe methods, the randomness of models such as complexity is not an issue, since the one with highest estimation should be selected for tighter compression lemma bound 4 as introduced by Pimentel et al. (2020b).

Hyperparameter and Efficiency
Information-theoretic approaches sacrifice simplicity and efficiency to achieve reliable probing results compared to accuracy-based probes. Even though our probes are quite simple, there are more hyperparameters that need to be selected by users compared to accuracy-based probes. To help users implement our methods in their setting, we briefly describe some guiding principles to help them select hyperparameters, and point out several potential ways to make our probing approach more efficient.
Our probes are composed of two steps: (a) computation of the graph embedding, and (b) estimation of the mutual information. The guiding principal in the graph embedding step is to retain as much linguistic graph information as possible. In our experiments, we used default hyperparameters in DeepWalk (Perozzi et al., 2014) for simplicity. Details can be found in Appendix A. However, users may use also use other graph embedding approaches that incorporate edge labels, etc. to improve our model. As the mutual information estimation procedure is estimating a lower bound to the true mutual information, the guiding principle for hyperparameter selection in this step should be to let the MI estimation values be as large as possible. In particular, model size is worth noting. Deeper models can achieve a tighter lower bound. However, these are less efficient than shallow ones. Thus, the selection of MI estimator's complexity is a tradeoff. According to our empirical experience, a relatively good choice is to use a two-layer MLPs. More details can be found in Appendix D. Note that it might also be harder to achieve convergence with deeper models as training of MI estimators is notoriously difficult. We leave a better exploration of this to future work.
Potential users might also resort to other solutions to make the probes more efficient. If the bottleneck is in the graph embedding step, some fast approaches (Hamilton et al., 2017;Tang et al., 2015) can be chosen instead. If the mutual information estimation step is the bottlenneck, some sampling strategies can be used. A simple way is to sample a subset of the dataset, and optimize eq. 5 based on that subset. Alternatively, potential users can use more sophisticated sampling strategies in training as in Recht and Ré (2012). These approaches achieve a much better convergence rate for MI estimation.

Related Work
Syntax and Semantics Probing. Many existing works probe language models directly or indirectly showing how much syntactic and semantic information is encoded in them. Belinkov et al. (2017) tested NMT models and found that higher layers encode semantic information while lower layers perform better at POS tagging. Similarly, Jawahar et al. (2019) tested various BERT layers and found that it encodes a rich hierarchy of linguistic information in the intermediate layers. Tenney et al. (2019b); Wu et al. (2021) compared the syntactic and semantic information in BERT and its variants, and found that more syntactic information is encoded than semantic information. Conneau et al. (2018) focused on probing various linguistic features with 10 different designed tasks. Hewitt and Manning (2019) designed a tree distance and depth prediction task to probe syntax tree structures.
Information Theoretic Probe. With the popularity of probe methods, limitations of previous methods have also been found. Information theoretic methods have been proposed as an alternative. To avoid the randomness of performance brought by the varying sizes of the probe models, Pimentel et al. (2020b) proposed an information-theoretic probe with control functions, which used mutual information instead of model performance for probing. Voita and Titov (2020) restricted the probe model size by Minimum Description Length. Training a model is recast as teaching it to effectively transmit the data. Lovering et al. (2021) pointed out that if we train a model to probe, the decisions are often not based on information itself, but rather on spurious heuristics specific to the training set.
Mutual Information Estimation. Mutual information estimation is a well-known difficult problem, especially when the feature vectors are in a high dimensional space (Chow and Huang, 2005;Peng et al., 2005). There are many traditional ways to estimate MI, such as the wellknown histogram approach (Steuer et al., 2002;Paninski, 2003), density estimations using a kernel (Moon et al., 1995), and nearest-neighbor distance (Kraskov et al., 2004). Belghazi et al. (2018) was recently proposed as a way to estimate MI using neural networks, which showed marked improvement over previous methods for feature vec-tors in high-dimensional space.

Limitations and Future Work
In this paper we propose a general informationtheoretic probe method, which is capable of probing for linguistic graph structures and avoids the randomness of training a model. In the experiments, we use our probe method to show the extent to which syntax trees and semantic graphs are encoded in pretrained BERT models. Further, we perform a simple perturbation analysis to show that with small modifications, the probe can also be used to probe for specific linguistic sub-structures. There are some limitations of our probe. First, a graph embedding is used, and some structure information could be lost in this process. We provide simple ways to test this. Second, training a MI estimation model is difficult. Future work can consider building on our framework by exploring better graph embedding and MI estimation techniques.

Broader Impact and Discussion of Ethics
In recent years, deep learning approaches have been the main models for state-of-the-art systems in natural language processing. However, understanding the decision making in these systems has been hard, and has challenges when these systems are used in human contexts. Probing helps us gain interpretability and hence is useful in deploying these black-box models. Our work introduces a simple and general way for understanding how linguistic properties represented as graph structures are encoded in large pretrained language models which are being applied to a wide range of structures in NLP. The methodology and probing results can be helpful to the development of future NLP models.
While our model is not tuned for any specific real-world application domain, our methods could be used in sensitive contexts such as legal or healthcare settings, and it is essential that any work using our probe method undertake extensive qualityassurance and robustness testing before using it in their setting. The datasets used in our work do not contain any sensitive information to the best of our knowledge. In this section, we present the technique details of the graph embedding approach, as well as parameters. Given a graph such as syntax tree and semantic graph, we first run random walk algorithm on it to sample walk paths. The random walk strategy is simple, each time a neighbor of current node is selected from its neighbor set based on uniform distribution. For each node, the length of random walk path is 10. And the repeat time is 100. In general, each node has 100 different walk paths with length 10. Then we put those paths into Word2vec model (Mikolov et al., 2013), with window size equal to 2, since we only want graph embeddings to capture the one-hop neighborhood relationships.
The hidden states, in other words graph embeddings are vectors with 128 dimensions.

B Details of MI Estimation
In this section we present technique details about MI estimation, such as the neural network model design and parameters. There are two terms in the objective function 5: one is about joint distribution and another is about marginal distribution For the joint distribution part, we concatenate the graph embeddings Z and word representations X first, and then put them into our designed neural network to compute a scalar. Then the average value of the scalar is computed. For the marginal part, we randomly shuffle the representations X . After random shuffle, there is no dependency between X and Z anymore. Then, we put the concatenation of the shuffled representations and graph embeddings into the neural network to get another scalar. We take the exponential value of that scalar and take the average value for the whole dataset. As aforementioned, the selection of model size is a tradeoff. In our experiments, we design an MLP model with two layers for MI estimation. The first layer is linear without nonlinear activation function, to encode graph embeddings and word representations into same space, with 64 dimensions. Then we concatenate those two hidden states and put them through a nonlinear layer to get a scalar. For example, we have one sentence with 10 words.
Assume we can get graph embeddings with size 10 * 128, and word representations of BERT with size 10 * 768. Then we use a linear function to map those two vectors into hidden space, say with 32 dimensions. Then we concatenate those two embeddings as a 10 * 64 matrix, and use one extra linear function with nonlinear activation functions to map it as 10 * 1 matrix. Then we can get the compression lemma lower bound as the mean value of the 10 * 1 matrix, which is the mutual information estimation that we want.
The loss is defined directly as the minus value of objective function. With stochastic gradient decent, we can maximize the lower bound to get the estimation. About the mini-batch, since the document contains many sentences, we select one sentence as one min-batch to optimize the neural network.
The reason why we treat one sentence as one minibatch is that we get word representations of BERT and graph embeddings in that way. One sentence has a complete syntax tree structure, and getting word representations with one sentence in BERT can make attention computed within the sentence. Another reason is that using two sentences as input may exceed the maximum BERT input size: 512 tokens sometimes. However, if we use mini-batch to estimate the mutual information, it brings errors. The reason is that if we want to estimate the mutual information between X and Z, the expectation should be all the data points that we know. But here we use minibatch to calculate the expectation for one batch only. To alleviate this error, as introduced in Belghazi et al. (2018), we select small learning rate to keep the error small.

C Details of MLP Models
This section introduces how to use MLP to do link prediction task, as well as the details about MLPs. For one sentence, given its graph embeddings, we simply use MLP to calculate a score for all node pairs, and then compare with the groundtruth graph with the predicted distribution vector. AUC score is computed based on the distribution vector and ground-truth vector. Note that since the graph is very sparse, it makes the task very difficult. Generally, the task can be regarded as a binary classification task with an extremely unbalanced data distribution.
For the details, linear MLP simply predicts the graph by the concatenation of two input vectors to decide whether there is an edge between them.
For MLPs with hidden layers, The concatenated vectors are through non-linear layers first, then the final output layer is linear. The dimension of all hidden states is 128. And the learning rate is 10 −4 .

D Reliability of MI Estimation
MI estimation for features in high-dimensional space is difficult. To prove that our estimated MI values are quite accurate, we test the MI estimation method (Belghazi et al., 2018) on sets of graph embeddings with different levels of dependencies.
To have that, we add noise on graph embeddings as Z . Z and Z can have different dependencies based on the added noise rate. Noise vectors are sampled from a standard Gaussian independently. We test the estimation for various levels of noise, from original graph embeddings Z to the condition that 100% signals are noise σ. For example, 40% means that for each graph embedding Z = 60% × Z + 40% × σ. Then, we calculatê I(Z ; Z) to see whether the values is small with large noise added.  Table 3 presents the exact values. To make it more readable, we report the MI percentage of I(Z ; Z)/Î(Z; Z). From the results, we find that for the two structures, results are not very similar. But the general tendencies are consistent, where less dependencies caused by larger noise have smaller MI values. Note that when the noise rate is 100%,Î(Z , Z) degenerates to the lower boundÎ(R, Z). As mentioned before, the absolute value of it represents the gap between the estimated MI and ground-truth MI, which is very small (less than 10 −3 ×Î(Z; Z)). It also proves that our MI estimations are reliable.

E Statistics of Penn Treebank and AMR Bank
We provide the relation number and connected word number of Penn Treebank dataset and AMR Bank in this section. The word number for POS tags of Penn Treebank is also provided. The statistics are for the whole dataset. We only report de-  statistics, we also only present tags with word number ranked top 10. There are still 28 types of POS tags that are catergorized into one type # others.
For the description, arg represents frame arguments, following PropBank conventions. general are composed of a set of general semantic relations. op means the relations for lists. Similarly, quantities are relations for quantities. And date contains relations for date-entities. For semantic structure, we run the perturbation experiment in a similar way. Different from syntax trees, the relations of AMR graphs can be grouped into 6 types. And number distribution is not very even. Thus, we corrupt the graph embeddings with 50% noise. Other settings are similar to that of syntactic structures. From the results, we can found that BERT encodes more structure information about arg and general relations. While for the op relations, which represents the relations for lists, are not well encoded.

G Probing Localized Information with Accuracy
Similar to our localized probing experiments, we add perturbations on word representations. Specifically, we corrupt word representations equally and with same number for each relation type. Then we train an MLP with 5 hidden layers to predict entire structures with the corrupted word representations. AUC scores of all relation types are calculated. As in the Worm's Eye, we only report 5 types of relations for syntactic and 3 types for semantic structures. Results are shown in Figure 8(a) and Figure 8(b). We can find that the probe accuracies are very unusual. First, corrupt one type of relations, the accuracies for other types of relations change significantly. Besides, the MLP is trained with corrupted relations e.g., nsubj while predicts prep with worst AUC score. The results also prove the point that prediction decisions are not based purely on the structure but rather on spurious heuristics (Lovering et al., 2021