Linguistic Dependencies and Statistical Dependence

Are pairs of words that tend to occur together also likely to stand in a linguistic dependency? This empirical question is motivated by a long history of literature in cognitive science, psycholinguistics, and NLP. In this work we contribute an extensive analysis of the relationship between linguistic dependencies and statistical dependence between words. Improving on previous work, we introduce the use of large pretrained language models to compute contextualized estimates of the pointwise mutual information between words (CPMI). For multiple models and languages, we extract dependency trees which maximize CPMI, and compare to gold standard linguistic dependencies. Overall, we find that CPMI dependencies achieve an unlabelled undirected attachment score of at most \approx 0.5. While far above chance, and consistently above a non-contextualized PMI baseline, this score is generally comparable to a simple baseline formed by connecting adjacent words. We analyze which kinds of linguistic dependencies are best captured in CPMI dependencies, and also find marked differences between the estimates of the large pretrained language models, illustrating how their different training schemes affect the type of dependencies they capture.


Introduction
A fundamental aspect of natural language structure is the set of dependency relations which hold between pairs of words in a sentence. Such dependencies indicate how the sentence is to be interpreted and mediate other aspects of its structure, such as agreement. Consider the sentence: Several ravens flew out of their nests to confront the invading mongoose. In this example, there is a dependency between the verb flew and its subject ravens, capturing the role this subject plays in the flying event, and how it controls number agreement. All modern linguistic theories recognize the centrality It is impossible to know whether that theory is realistic . Figure 1: We use models pretrained on masked language modelling objectives to extract trees which maximize contextualized pointwise mutual information (CPMI) between words, to examine how linguistic dependencies relate to statistical dependence.

LM
of such word-word relationships, despite considerable differences in detail in how they are treated (for a review of linguistic dependency grammar literature see de Marneffe and Nivre, 2019). In addition to linguistic dependencies between words, there are also clear and robust statistical relationships. A noun like ravens is likely to occur with a verb like flew. In short, the presence or absence of certain words in certain positions in a sentence is informative about the presence or absence of certain other words in other positions. This raises the question: Do words that are strongly statistically dependent tend to be those related by linguistic dependency (and vice versa)? In everyday language, a sentence like the example above is probably more likely than Several pigs flew out of their nests to confront the invading shrubbery, despite this second example being syntactically identical to the first.
The long tradition of both supervised and unsupervised learning of grammars and parsers in computational linguistics suggests a strong link between dependency structure and statistical dependence. Works such as Magerman and Marcus (1990) and de Paiva Alves (1996) introduced the use of pointwise mutual information (PMI) as a measure of the strength of statistical dependence between words, for the purpose of inferring linguistic structures from corpus statistics. The link between PMI and linguistic dependency has been studied and affirmed in Futrell et al. (2019). They show that for words linked by linguistic dependencies, the estimated mutual information between POS tags (and distributional clusters) is higher than that between non-dependent word pairs, matched for linear distance.
In this work, we dig further into the question of the correspondence between statistical and linguistic dependencies using modern pretrained language models (LMs) to compute estimates of conditional PMI between words given context, which we term contextualized pointwise mutual information (CPMI). For each sentence we extract a CPMI dependency tree, the spanning tree with maximum total CPMI, and compare these trees with gold standard linguistic dependency trees. 1 We find that CPMI trees correspond better to gold standard trees than non context-dependent PMI trees. However our analysis shows that CPMI dependencies and linguistic dependencies correspond only roughly 50% of the time, even when we introduce a number of strong controls. Notably, we do not see better correspondence when we examine CPMI trees inferred by models that are explicitly trained to recover syntactic structure during training. Likewise, we see no increase in correspondence when we calculate CPMI over partof-speech (POS) tags, a control designed to examine a less fine-grained statistical dependency than that between actual word forms. In fact, CPMI arcs broadly correspond to linguistic dependencies slightly less often than a simple baseline that just connects all and only adjacent words. We see similar overall unlabeled undirected attachment score (UUAS) when evaluated across a variety of pretrained models and different languages. However, a close analysis shows noteworthy differences between the different LMs, in particular revealing that BERT-based models are markedly more sensitive to adjacent words than XLNet. These differ-ences yield insights about how different LM pretraining regimes result in differences in how the models allocate statistical dependencies between words in a sentence.

Background
Pointwise mutual information (PMI;Fano, 1961) is commonly used as a measure of the strength of statistical dependence between two words. Formally, PMI is a symmetric function of the probabilities of the outcomes x, y of two random variables X, Y , which quantifies the amount of information about one outcome that is gained by learning the other: pmi(x; y) := log p(x, y) p(x)p(y) = log p(x | y) p(x) .
In our case, the observations are two words in a sentence (drawn from discrete random variables indexed by position in the sentence, ranging over the vocabulary). PMI has been used in computational linguistic studies as a measure of how words inform each other's probabilities since Church and Hanks (1989). 2 Much earlier work on unsupervised dependency parsing (e.g., Van Der Mude and Walker, 1978;Magerman and Marcus, 1990;Carroll and Charniak, 1992;Yuret, 1998;Paskin, 2001) used techniques involving maximizing estimates of total pointwise mutual information between heads and dependents, or maximizing the conditional probability of dependents given heads (these two objectives can be shown to be equivalent under certain assumptions; see §C). While such PMIinduced dependencies proved useful for certain tasks (such as identifying the correct modifier for a word among a selection of possible choices; de Paiva Alves, 1996), purely PMI-based dependency parsers did not perform well at the general task of recovering linguistic structures overall (see discussion in Klein and Manning, 2004).
The recent advent of pretrained contextualized LMs (such as BERT, XLNet;Devlin et al., 2019;Yang et al., 2019) provides an opportunity to revisit the relationship between PMI-induced dependencies and linguistic dependencies. These networks are pretrained on very large amounts of natural language text using masked language mod-2943 That theory is realistic . [MASK] BERT That theory is realistic . [MASK] [MASK] BERT p(realistic | theory, c) p(realistic | c) CPMI(realistic; theory | s) = log Figure 2: Diagram illustrating using BERT to compute the probability of realistic with and without masking theory, to obtain a CPMI score between those two words in the sentence s = That theory is realistic. elling objectives to be accurate estimators of conditional probabilities of words given context, and thus are natural tools for investigating the statistical relationships between words.

Contextualized PMI dependencies
Linguistic dependencies are highly sensitive to context. For example, consider the following two sentences: I see that the crows retreated, and The mongoose pursued by crows retreated. In the first there is a dependency between retreated and crows, and in the second there is not. However, PMI between two words in a sentence is strictly independent of the other words in that sentence.
Here we define contextualized pointwise mutual information (CPMI) as the conditional PMI given context, which we estimate using pretrained contextualized LMs. A contextualized LM M provides an estimate for the probability of words given context, which we use to define CPMI M between two words w i and w j in a sentence W as where the W −i is the sentence with word w i masked, and W −i,j is the sentence with words w i , w j masked. To demonstrate the computation of this quantity, Figure 2 illustrates how BERT is used to obtain a CPMI score between the words theory and realistic in the sentence That theory is realistic.

Dependency tree induction
Given a sentence, we compute a matrix consisting of the CPMI between each pair of words. We then symmetrize this matrix by summing across the diagonal, so that we have a single score for each pair of words (omitting this step led to extremely similar results). 3 We then extract tree structures which maximize total CPMI. Since natural language dependencies are overwhelmingly projective (see Kuhlmann, 2010) we extract maximum projective spanning trees using the dynamic programming algorithm from Eisner (1996Eisner ( , 1997. 4 Results for dependency trees alternatively extracted without the projectivity constraint, using Prim's maximum spanning tree (MST) algorithm (Prim, 1957), are similar, and results using both algorithms are provided in §D for comparison. For further details on the extraction of CPMI dependencies, see §A.3.

Evaluating CPMI dependencies
In this section, we analyze the degree to which CPMI-inferred dependencies from pretrained LMs resemble linguistic dependencies.

Method
We use gold dependencies for sentences from the Wall Street Journal (WSJ), from the Penn Treebank (PTB) corpus of English text handannotated for syntactic constituency parses (Marcus et al., 1994), converted into Stanford Dependencies (de Marneffe et al., 2006;de Marneffe and Manning, 2008b). 5 We evaluate all extracted dependency trees on the full development split (WSJ section 22, consisting of 1700 sentences). For comparison with other work in unsupervised grammar induction, we also report results on the WSJ10 (all 389 sentences of length ≤ 10 from section 23, the test split, as used in e.g. Yang et al. (2020) xlnet-base-cased Gold was nowhere the spectacular performer it was two years ago on Black Monday . Gold was nowhere the spectacular performer it was two years ago on Black Monday .
DistilBERT 7/13 = 54% Gold was nowhere the spectacular performer it was two years ago on Black Monday .
XLNet-base 4/13 = 31% Figure 3: Top: CPMI matrices for an example sentence, from BERT, DistilBERT, XLNet. Gold dependencies are marked with a dot. Bottom: Resulting projective MST parses for the three models. Gold dependency parse above in black, CPMI dependencies below, blue where they agree, and red when they do not. The unlabeled undirected attachment score (UUAS) is given at right. Further examples provided in appendix, Figure 14. et al., 2020;Sanh et al., 2019). For other languages (and English) we use pretrained multilingual BERT base; see D.2 for details. All pretrained contextualized LMs we use are provided by Hugging Face transformers (Wolf et al., 2020).
Syntactically aware models We likewise compute CPMI estimates using models explicitly designed to have a linguistically-oriented inductive bias, by taking syntax into account in their training objectives and architecture. Following Du et al. (2020), we include two pretrained versions of an ordered-neuron LSTM (Shen et al., 2019)a language model designed to have a hierarchical structural bias. The first (ONLSTM) is pretrained on raw text data, the second (ONLSTM-SYD) is pretrained on the same data but with an additional auxiliary objective to reconstruct PTB syntax trees. As a control, we also include a vanilla LSTM model. All three models are trained on the PTB training split. Example parses extracted from these models are given in the appendix (Figure 16). We extract CPMI estimates from these models similarly to the above, but we condition only on preceding material, since these LSTM-based models operate left-to-right. See §A.2 for details. 6 6 Note that results of the (ON)LSTM models are not directly comparable to the transformer-based models, as these Noncontextualized PMI control We also compute a non-contextualized PMI estimate using a pretrained global word embedding model (Word2Vec; Mikolov et al., 2013), to capture wordto-word statistical relationships present in global distributional information, not sensitive to the context of particular sentences. This control is calculated as the inner product of Word2Vec's target and context embeddings, pmi w2v (w i ; w j ) := w ⊤ i c j , since its training objective is optimized when this quantity equals the PMI plus a global constant (as explained in Levy and Goldberg, 2014; Allen and Hospedales, 2019). Details are given in §A.1.
Baselines A random baseline is obtained by extracting a parse for each sentence from a random matrix (so each pair of words is equally likely to be connected). We also include a 'connect-adjacent' baseline-degenerate trees formed by simply connecting the words in order-a simple, strong, and linguistically plausible baseline for English.
In addition to these baselines, we will compare unlabelled undirected accuracy score (UUAS) with that reported for the Dependency Model with Valence (DMV; Klein and Manning, 2004), a classic dependency parsing model. Note, importantly, models are trained on much less data.   Fig. 3).

Results
Example CPMI dependencies and extracted projective trees are given in Figure 3, with gold dependencies for comparison. Table 1 gives the UUAS results. 7 Overall UUAS is given in the first column. The remaining columns give the UUAS for the subset of edges of length 1 and longer, in terms of precision and recall respectively. 8 Table 2 gives overall UUAS from multilingual BERT for a selection of languages from the PUD treebanks (for full results see Table 12, Figure 13). The overall results show broadly that CPMI dependencies correspond to linguistic dependencies better than the noncontextual PMI-dependencies estimated from Word2Vec. However, across the models, and across languages, UUAS in general is in the range 40-50%. Degenerate trees formed by connecting words in linear order (the connectadjacent baseline) achieve similar UUAS. Additionally, for the ONLSTM models, which have a hierarchical bias in their design, we see that accuracy of the CPMI-induced dependencies is the essentially the same with or without the auxiliary syntactic objective. Overall accuracy for both syntactically aware models is the same as for the vanilla LSTM. Further analysis of these results is in §6.

Delexicalized POS-CPMI dependencies
In this second experiment we estimate CPMIdependencies over part-of-speech (POS) tags, rather than words. In the unsupervised dependency parsing literature there is an ample history of approaches making use of gold POS tags (see e.g., Bod, 2006;Cramer, 2007;Klein and Manning, 2004). Additionally, a traditional objection to the idea of deducing dependency structures directly from cooccurrence statistics, beyond data sparsity issues, is the possibility that "actual lexical items are too semantically charged to represent workable units of syntactic structure" (as phrased by Klein and Manning, 2004, p.3). That is, perhaps words' patterns of co-occurrence contain simply too much information about factors irrelevant to dependency parsing, so as to drown out the information that would be useful for recovering dependency structure. According to this line of thinking, we might expect linguistic dependency structure to be better related to the statistical dependencies between the categories of words, rather than lexical items themselves. Thus a version of CPMI calculated over POS tags would be predicted to achieve higher accuracy than the CPMI calculated 2946 That theory is realistic . [MASK] [MASK] BERT POS tagger That theory is realistic . over lexical item probabilities above.
A straightforward but unfeasible way to investigate this idea would be to obtain contextualized POS-embeddings by re-training all the LMs from scratch on large delexicalized corpora only consisting of POS tags. Instead, for efficiency, follow LM probing literature (Hewitt and Manning, 2019) and train a small POS probe on top of a pretrained LM, which estimates the probability of the POS tag at a given position in a sentence. After training this probe, we can extract a POS-based CPMI score between words. We define this POS-CPMI analogously to CPMI, but using conditional probabilities of POS tags, rather than word tokens: where π i , π j are the gold POS tags of w i , w j in sentence W , and M POS is the contextualized LM M with a pretrained POS embedding network on top. This is illustrated in Figure 4. We then extract POS-CPMI dependencies to compare to gold dependencies.

Method
We implement a POS probe as a linear transformation on top of the final hidden layer of a fixed pretrained LM. We train two versions of this probe: one trained simply to minimize cross entropy loss (simple POS probe), the other trained using the information bottleneck technique ( Table 3: Total UUAS for POS-CPMI using the simple POS probe and IB POS probe, from BERT and XLNet models. Overall results are in the first column, remaining columns break down results by arc length and recall and precision as in Table 1. gold POS tags. All eight probes achieve between 92% and 98% training accuracy. We extract parses from POS-CPMI matrices just for CPMI (described above in §4). Below, we refer to the estimates extracted using the simple POS probe as simple-POS-CPMI, and those extracted using the IB POS probe as IB-POS-CPMI.

Results
Using the POS-CPMI dependencies does not result in higher accuracy. This provides evidence that the correlation between linguistic dependencies and CPMI dependencies is not merely artificially low due to distracting lexical information. Table 3 shows the UUAS of the simple-POS-CPMI and IB-POS-CPMI trees. Compared to the lexicalized CPMI trees discussed in the previous section, for BERT models, the simple-POS-CPMI dependencies have rather comparable overall UUAS, while for XLNet it is markedly lower. For both models, IB-POS-CPMI dependencies have lower UUAS. While these results are somewhat mixed, it is clear that, in our experimental setting, POS-CPMI dependencies correspond to gold dependencies no more than the CPMI dependencies do, performing at best roughly as well as the connect-adjacent baseline.

Analysis
In this section we outline main takeaways from a more detailed examination of the results from § §4-5, including additional analysis in §A.4.  192 192 192 192 192 192 290 290  6.9 6.9 6.9 6.9 6.9 6.9 6.1 6.1 6.1 6.1 6. 6.6 6.6 6.6 6.6 6.6 7.1 7.1 7.1 7.1 7.1 6.2 6.2 6.2 6.2 6.2 2.9 2.9 2.9 2.9 2.9 6.1 6.1 6.1 6.  UUAS is higher for length 1 arcs Breaking down the results by dependency length, Figure 8 (in appendix) shows the recall accuracy of CPMI dependencies, grouped by length of gold arc. Length 1 arcs have the highest accuracy, and longer dependencies have lower accuracy. This trend holds for CPMI from all LMs. For BERT large, in particular, arcs of length 1 have recall accuracy of 80%, while longer arcs are near random. For XLNet, this trend is less pronounced.
No relation label has high UUAS In Figure 5, recall accuracy is plotted against gold dependency arc label. 9 When examining all lengths of dependency together (left) recall accuracy would seem to be correlated with mean arc length. But, filtering out all the gold arcs of length 1 (49% of arcs), we see that there is not a strong overall effect of arclength on mean accuracy for lengths > 1. For most dependency labels, CPMI accuracy from each of the models is above the random baseline, but at or below to the connect-adjacent baseline. Exceptions to this trend include dependency labels dobj (direct object), xcomp (which connects a verb or adjective to the root of its clausal complement). For wordpairs in these relations, CPMI estimates (XLNet in particular) achieve higher accuracy than the baselines. However, even in these cases, CPMI dependencies do not perform at a level that could be considered successful for an unsupervised parser. This is contrary to what would be expected if CPMI-dependencies were in a strong correspondence with linguistic dependencies, even if this only held for certain types of linguistic dependency.
When considering arcs of length > 1, there is no dependency arc label which has UUAS above 0.5 from any of the models. More complete results including the other models not shown in Figure 5 are given in Table 5 (in appendix).
UUAS is not correlated with LM performance Figure 6 shows per-sentence UUAS plotted against log pseudo-perplexity (PPL) for BERT and XLNet models (results are similar for other models; see §A.4.3, Figure 9). These results show that correspondence between CPMI-dependencies and linguistic dependencies isn't higher on sentences on which the models are more confident.
We also examined the accuracy of CPMI dependencies during training of BERT (base uncased) from scratch. Figure 11 (in appendix) shows the average perplexity of this model at checkpoints during training, along with average UUAS of induced CPMI structures. UUAS reaches its highest value before perplexity plateaus. We should also stress that, throughout this paper, UUAS is not a measure of LM quality. Rather, it simply measures how well patterns of statistical dependence captured by the LM align with linguistic dependencies. Better alignment may not be related to better language modelling.
Dependencies differ between LMs Dependency structures extracted from the different pretrained LMs show roughly similar overall UUAS, though the models agree with each other on only 25-48% of edges. They agree with the noncontextualized word embedding model Word2Vec at just slightly lower rates (21-27%), while agreeing with the linear baseline at higher rates (34-57%). See §A.4.1 and for these details.
In particular, CPMI dependencies from all the models connect adjacent words more often than the gold dependencies do, but this effect is much more pronounced for BERT models than for XLM, and XLNet models (Figure 7). A possible reason for this difference lies in the way these models are trained. XLNet is trained to predict words according to randomly sampled chain rule decompositions, enforcing a bias to be able to predict words in any order, including longer dependencies. XL-Net's probability estimates for words may therefore be sensitive to a larger set of words, rather than mostly the adjacent ones. Whereas BERT, trained with a less constrained masked LM objective, has probability estimates that are evidently more sensitive to adjacent words.

Related work
Probing pretrained embeddings In the past few years, a substantial amount of literature has emerged on probing pretrained language models (in the sense of e.g. Conneau et al., 2018;Manning et al., 2020), wherein a presumably weak network (a probe) is trained to extract linguistic information (in particular, dependency information, in e.g. Hewitt and Manning, 2019; Clark et al., 2019) from pretrained embeddings. Extracting CPMIdependencies differs from training a dependency probe in that it is entirely unsupervised, and is motivated by a specific hypothesis-about the relationship linguistic dependencies have with statistical dependence.
Nonparametric probing A number of other recent works have taken an unsupervised approach to investigating syntactic structure encoded by pretrained LMs, largely focusing on self-attention weights (e.g. Mareček andRosa, 2018, 2019;Kim et al., 2020a,b;Htut et al., 2019). Very recently, Zhang and Hashimoto (2021, concurrent with this paper) examined conditional dependencies implied by masked language modelling using a nonparametric method similar to our CPMI, using BERT to estimate Conditional PMI (and Conditional MI) between words. They extract maximum spanning trees, and report UUAS on WSJ dependency data. Their results are similar to those reported here: namely, scores are much higher than a chance baseline, but close to a connect-adjacent baseline. While their numerical results are similar, their interpretation differs somewhat. Given our analysis, we find less reason for optimism about the prospects of unsupervised dependency parsing directly from probability estimates by pretrained LMs.

Perturbation impact
The experiments in the current paper extracting CPMI can be seen as an application of the token perturbation approach of Wu et al. (2020). 10 They describe general nonparametric method to examine the impact, f (w i , w j ), of a word w j on another word w i in the sentence, where f is some difference function between the embedding of w i (masked in the input) with and without the word w j also being masked.
In their experiments, they use two examples of impact-measuring functions (see Wu et al., 2020, §2.2). The first, the Dist metric, is simply Euclidean distance between embeddings. The second, the Prob metric, is defined as , using the masked LM's probability estimates (notation as defined in §3). The latter impact metric is quite similar to CPMI, the difference being only that Prob impact is the difference in probabilities, while CPMI is the difference in log probabilities. Table 4 compares the reported UUAS of maximum projective spanning trees from CPMI matrices, to those from Dist impact matrices on the English PUD data set. They do not report UUAS for the Prob metric or release code for it, but mention that it is significantly outperformed by the Dist method. Wu et al. (2020, p.1) note that their "best performing method does not go much beyond the strong right-chain baseline". While it may be seen as an application of perturbed masking technique, CPMI is motivated as a method to test a specific hypothesis about the relationship between linguistic and statistical dependence. Extracting matrices using another impact metric (such as Euclidean distance between embeddings, Dist) may indeed achieve higher attachment scores, as Wu et al. (2020) demonstrate, but this does not bear on the hypothesis we focus on in this paper.

Discussion
In this paper we explored the connection between linguistic dependency and statistical dependence. We contribute a method to use modern pretrained language models to compute CPMI, a contextdependent estimate of PMI, and infer maximum CPMI dependency trees over sentences.
We find that these trees correlate with linguistic dependencies better than trees extracted from a noncontextual PMI estimate trained on similar 10 We thank an anonymous reviewer for alerting us to this work.  data. However, we do not see evidence of a systematic correspondence between dependency arc label and the accuracy of CPMI arcs, nor do we see evidence that the correspondence increases when using models explicitly designed to encode linguistically-motivated inductive biases, nor when estimated between POS embeddings instead of word forms. Overall, CPMI-inferred dependencies correspond to gold dependencies no more than a simple baseline connecting adjacent words. This is our first main takeaway: statistical dependence (as modelled by these pretrained LMs) is not a good predictor of linguistic dependencies. Second, our analysis shows that CPMI trees extracted from different LMs differ to an extent that is perhaps surpising, given the similarity in spirit of their training regimes. The difference in accuracy when broken down with respect to linear distance between words offers information about the ways in which these models' inductive and structural biases inform the way they perform the task of prediction. BERT aligns better overall, but this is driven by its being more like the linear baseline. For longer arcs, XLNet aligns a bit better with linguistic structure. Compared to BERT, XLNet can be seen as imposing a constraint on the language modelling objective by forcing the model to have accurate predictions under different permutation masks. Generalizing this observation, we ask whether linguistic dependencies would correspond to the patterns of statistical dependence in a model trained with a language modelling loss while concurrently minimizing the amount of contextual information used to perform predictions. Finding ways of expressing such constraints on the amount of information used during prediction, and verifying the ways in which this can affect our results and LM pretraining in general constitutes material for future work.

A CPMI-dependency implementation details A.1 Word2Vec as noncontextual PMI control
We use Word2Vec (Mikolov et al., 2013) to obtain a non-conditional PMI measure as a control/baseline. Additionally, in contrast with the CPMI values extracted from contextual language models, this estimate does not take into account the positions of the words in a particular sentence, but otherwise reflects global distributional information similarly to the contextualized models. Word2Vec should therefore function as a control with which to compare the PMI estimates derived from the contextualized models.
Word2Vec maps a given word w i in the vocabulary it to a 'target' embedding vector w i , as well as an 'context' embedding vector c i (used during training). As demonstrated by Levy and Goldberg (2014); Allen and Hospedales (2019), Word2Vec's training objective is optimized when the inner product of the target and context embeddings equals the PMI, shifted by a global constant (determined by k, the number of negative samples): w ⊤ i c j = pmi(w i ; w j ) − log k. This type of embedding model thus provides a non-contextual PMI estimator. A global shift will not change the resulting PMI-dependency trees, so we simply take pmi w2v (w i ; w j ) := w ⊤ i c j , with embeddings calculated using a Word2Vec model trained on the same data as BERT. 11 Note: since we are ignoring the global shift of k, an absolute valued version of PMI estimate will not be meaningful, and for this reason we only ever extract dependencies from the Word2Vec PMI estimate without taking the absolute value.

A.2 LtoR-CPMI for one-directional models
Our CPMI measure as defined above requires a bidirectional model (to calculate probabilities of words given their context, both preceding and following). The LSTM models we test in this study are L-to-R, so we define an slightly modified version of CPMI, which we will can call here LtoR-CPMI, to use with such unidirectional language This decomposition allows us to estimate conditional pointwise information between words made of multiple subtokens, at the expense of specifying a left-to-right order within those words.

A.3.2 Symmetrizing matrices
PMI is a symmetric function, but the estimated CPMI scores are not guaranteed to be symmetric, since nothing in the models' training explicitly forces their conditionaly probability estimates of words given context to respect the identity p(x|y)p(y) = p(y|x)p(x). For this reason, we have a choice when assigning a score to a pair of words v, w, whether we use the model's estimate of CPMI M (v; w), which compares the probability of v with conditioner w masked and unmasked, or of CPMI M (w; v). In our implementation of CPMI we calculate scores in both directions, and use their sum (as mentioned in the main text §3.1), though experiments using one or the other (using just the upper or lower triangular of the matrix), or the max (equivalent to extracting a tree from the unsymmetrized matrix) led to very similar overall results. Likewise for the Word2Vec PMI estimate, and the POS-CPMI estimates.

A.3.3 Negative PMI values
PMI may be positive or negative. Results in the main text are all computed for CPMI dependencies extracted from signed matrices (so arcs with large negative CPMI will be rarely included). However, there is some discussion of interpreting the magnitude of PMI as indicating dependency, independent of sign (see Salle and Villavicencio, 2019). The choice to use an absolute-valued version of CPMI might be justified by arguing that words which influence each other's distribution should be connected, whether this influence is positive or negative.
In §D.1 we include full results both with and without taking the absolute value of the CPMI matrices before extracting trees. The absolute-valued CPMI dependencies show a models increase in UUAS over the corresponding matrices without taking the absolute value in general. But, it is not clear whether the choice to use absolute-valued CPMI would be justified conceptually. Contrary to the conceptual motivation for CPMI dependencies, in which words which often occur together should be linked, an absolute-valued version links words which are highly informative of each others' not being present. For this reason we do not choose to use an absolute-valued version of CPMI by default, but report those results for comparison, note that the UUAS is in fact higher with the absolute value, and refrain from further speculation.

A.4 Additional analysis of CPMI dependencies
A.4.1 Similarity between models Figure 10 shows the similarity of the CPMI dependency structures extracted from the different contextual embedding models. We measure similarity of dependency structures with the Jaccard index for the sets of the predicted edges by two models. Jaccard index measures similarity of two sets A, B and is defined as The contextualized models agree with each other on around 30-50% of the edges, and agree with the the noncontextual baseline W2V slightly less. In general, they agree with the linear baseline at somewhat higher rates.

A.4.2 Accuracy versus arc length
Breaking down the results by dependency length, Figure 8 shows the recall accuracy of CPMI dependencies, grouped by length of gold arc. In general, length 1 arcs have the highest accuracy; longer dependencies have lower accuracy. CPMI dependencies from BERT (large) have 81% recall accuracy on length 1 arcs, with arcs longer than 1 having much lower recall (13% overall) near random (10%). In other models, XLNet in particular, this distinction is less of a binary distinction, but the trend is still for lower recall on longer arcs.

A.4.3 Accuracy versus perplexity
Here we investigate the correlation between language model performance and CPMI-dependency accuracy. If models' confidence in predicting were tied to accuracy, it would be hard to argue that the relatively low accuracy score we see was due to the lack of connection between syntactic dependency and statistical dependency, rather than to the models' struggling to recover such a structure. Here we measure model confidence by obtaining a perplexity score for each sentence, calculated as the negative mean of the pseudo log-likelihood, that is, for a sentence w of length n, log p(w I |w −I )] Figure 9 shows that accuracy is not correlated with sentence-level perplexity for any of the mod- The distinction is mostly between arcs of length 1 vs longer arcs. Note that the relatively higher accuracy of BERT (large)'s estimates overall are driven by its very large proportion of length 1 arcs. els (fitting a linear regression, R 2 < 0.05 for each model). That is, the accuracy of CPMIdependency structures is roughly the same on the sentences which the model predicts confidently (lower perplexity) as on the sentences which it predicts less confidently (higher perplexity).

A.4.4 UUAS during training
We examined the accuracy of CPMI dependencies during training of BERT (base uncased) from scratch. Figure 11 shows the average perplexity of this model, along with the sentence-wise average accuracy of CPMI structures at selected checkpoints during training. After about one million training steps the model has reached a plateau in terms of performance (perplexity stops decreasing), and we see that the peak UUAS has also plateaued at that point, but in fact reached its highest value after one hundred thousand training steps.
A.4.5 UUAS by dependency label Table 5 gives per-dependency label recall accuracy of CPMI-dependencies extracted from the subset of dependency labels for which XLNet (base) achieves UUAS higher than both the linear and a random (projective) baselines.

B Information Bottleneck for POS probe
The simple POS probe is a d-by-h-matrix, where the input dimension h is the contextual embedding network's hidden layer dimension, and the output dimension d is the number of different POS tags in the tagset. Interpreting the output as an unnormalized probability distribution over POS tags, we    train the layer to minimize the cross-entropy loss between the predicted and observed POS (using the labels from the Treebank). Training a simple linear probe is a rough way to get a compressed representations from contextual embeddings, but it has limitations (Hewitt and Liang, 2019).
A more correct way of extracting these representations is by a variational information bottleneck technique (Tishby et al., 2000). We implement this technique (roughly following Li and Eisner, 2019), as follows. Optimization is to minimize L IB = −I[Y ; Z] + βI [H; Z], where H is the input embedding, Z the latent representation and Y the true label. This technique trains two sets of parameters: the decoder, a linear model just as in the simple linear POS probe, and the encoder, another linear model, whose output in our case is interpreted as means and log-variances of a multivariate Gaussian (a simplifying assumption). Minimizing this loss maximizes information in the compressed representations about the output labels given a constraint on the amount of information that the compressed representations carry about the original embeddings.

C Equivalence of max pmi and max conditional probability objectives
Mareček (2012) describes the equivalence of optimizing for trees with maximum conditional probability of dependents given heads and optimizing for the maximum PMI between dependents and heads. This equivalence relies on an assumption that the marginal probability of words is independent of the parse tree.
For a corpus C, a dependency structure t can be described as a function which maps the index of a word to the index of its head. If net mutual information between dependents and heads according to dependency structure t is pmi(t) := ∑ i pmi(w i ; w t(i) ), and the log conditional probability of dependents given heads is ℓ cond (t) := ∏ w∈s p(w i | w t(i) ), the optimum is the same: The step taken in (3) follows only under the assumption that the marginal probability of dependent words is independent of the structure t. That is, that "probabilities of the dependent words . . . are the same for all possible trees corresponding to a given sentence" (Mareček, 2012, §5.1.2). This must be stipulated as an assumption in a probabilistic model for the above derivation to hold.

D Augmented tables of results
We give results in further detail for the CPMIdependencies on the English PTB Wall Street Journal (WSJ) and on the multilingual PUD treebanks. Tables described below follow this appendix.

D.1 Results on WSJ data
Results presented in this section repeat those given in the main text, with two independent additional parameters: projectivity and absolute value.
Projectivity As described in §3.1, in the main text we report results for projective CPMI dependency trees extracted from CPMI matrices using Eisner's algorithm Eisner (1996,1997). These results are also repeated below, but we additionally present UUAS results for maximum spanning trees (MSTs) extracted from CPMI matrices using Prim's algorithm (Prim, 1957), following Hewitt and Manning (2019).
Absolute value In the main text we consider dependencies extracted from signed CPMI matrices. As described in §A.3.3, we also compute UUAS from absolute-valued matrices, and report them here.
• Table 6 is an augmented version of Table 1 from the main text, containing results for CPMI-dependencies both with and without the projectivity constraint.
• Table 7 is as the previous, but using an absolute valued version of CPMI.
• Table 10 is likewise an augmented version of Table 3 from the main text, containing results for POS-CPMI-dependencies both with and without the projectivity constraint.
• Table 11 is as the previous but using an absolute valued version of POS-CPMI.
In these tables, we also include the UUAS of randomized 'lengthmatched' control. For each sentence, this control consists of a randomized tree whose distribution of arc lengths is identical to the gold tree (obtained by rejection sampling). Tables 8 and 9 give augmented UUAS results as in to Tables 6 and 7, resp., but for only the sentences of length ≤ 10 from the test split (section 23) of the WSJ corpus (WSJ10). We include these results for comparison with much of the unsupervised dependency parsing literature following Klein and Manning (2004), which reports results on that subset. Note that the UUAS is naturally higher across the board on this corpus of shorter sentences. Table 12 gives results on the 20 languages of the Parallel Universal Dependencies (PUD) treebanks. These parallel treebanks were included in the CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies. The PUD treebank for each language consists of 1000 sentences annotated for Universal Dependencies. The sentences are translated into each of the languages, with the majority (750) being originally in English.

D.2 Results on multilingual PUD data
We compute CPMI for these sentences using the multilingual pretrained BERT-base model made available by Hugging Face Transformers (Wolf et al., 2020). 12 This model was trained using masked language modelling and next sentence prediction on the 104 languages with the largest Wikipedias, including all 20 in the PUD. UUAS for CPMI dependency trees for all languages is plotted in Figure 13.

MSTs
Projective MSTs language mean sent. length connect-adjacent random CPMI CPMI (abs) random CPMI CPMI ( Figure 12: UUAS for multilingual Parallel UD dataset, for CPMI dependencies extracted from from BERT base multilingual. Note that while the dataset consists of the same 1000 sentences translated into the 20 languages, there is some variation across languages in mean sentence length. Projective (signed) UUAS are plotted below in Figure 13 with random and connect-adjacent baselines.       It is impossible to know whether that theory is realistic .

BERT-base 3/9 = 33%
It is impossible to know whether that theory is realistic .

4/9 = 44%
It is impossible to know whether that theory is realistic .

5/9 = 56%
It is impossible to know whether that theory is realistic .
XLNet-base 7/9 = 78% It is impossible to know whether that theory is realistic .
W2V-signed 2/9 = 22% Gold was nowhere the spectacular performer it was two years ago on Black Monday . Gold was nowhere the spectacular performer it was two years ago on Black Monday .
BERT-base 9/13 = 69% Gold was nowhere the spectacular performer it was two years ago on Black Monday .

7/13 = 54%
Gold was nowhere the spectacular performer it was two years ago on Black Monday .

6/13 = 46%
Gold was nowhere the spectacular performer it was two years ago on Black Monday .

XLNet-base 4/13 = 31%
Gold was nowhere the spectacular performer it was two years ago on Black Monday . W2V-signed 6/13 = 46% " We have sufficient cash flow to handle that , " he said . " We have sufficient cash flow to handle that , " he said .
W2V-signed 5/9 = 56% Figure 14: Additional examples of projective parses from Bart, BERT, DistilBERT, XLM, XLNet, and the noncontextual baseline Word2Vec. Gold standard dependency parse above in black, CPMI-dependencies below, blue where they agree with gold dependencies, and red when they do not. Accuracy scores (UUAS) are given for each sentence.