Reconstruction Probing

We propose reconstruction probing, a new analysis method for contextualized representations based on reconstruction probabilities in masked language models (MLMs). This method relies on comparing the reconstruction probabilities of tokens in a given sequence when conditioned on the representation of a single token that has been fully contextualized and when conditioned on only the decontextualized lexical prior of the model. This comparison can be understood as quantifying the contribution of contextualization towards reconstruction -- the difference in the reconstruction probabilities can only be attributed to the representational change of the single token induced by contextualization. We apply this analysis to three MLMs and find that contextualization boosts reconstructability of tokens that are close to the token being reconstructed in terms of linear and syntactic distance. Furthermore, we extend our analysis to finer-grained decomposition of contextualized representations, and we find that these boosts are largely attributable to static and positional embeddings at the input layer.


Introduction
Model building in contemporary Natural Language Processing usually starts with a neural network pretrained on the objective of context reconstruction ("language modeling").Contextualized representations of complex linguistic expressions from such models have been shown to encode rich lexical and structural information (Tenney et al., 2019b;Rogers et al., 2020), making these models an effective starting point for downstream applications.
Probing pretrained language models aims to understand the linguistic information they encode, and how well it aligns with our understanding of human language (see Belinkov 2022 for a review).The methodologies employed include supervised classifiers targeting specific linguistic properties of interest (Ettinger et al. 2016;Giulianelli et al. 2018;Tenney et al. 2019a;Conia and Navigli 2022), similarity-based analyses (Garí Soler and Apidianaki, 2021;Lepori and McCoy, 2020), clozetype tests (Goldberg, 2019;Pandit and Hou, 2021), and causal intervention-based methods (Vig et al., 2020;Elazar et al., 2021;Geiger et al., 2021).This methodological diversity is beneficial given the high variability of conclusions that can be drawn from a study using a single method (Warstadt et al., 2019)-converging evidence is necessary for a more general picture.
We contribute to this line of research with a new analysis method that we name reconstruction probing, which relies on token probabilities obtained from context reconstruction, applicable to models pretrained on objectives of this kind. 1 Our method is characterized by two core properties.First, it is causal: rather than asking "what features can we extract from the contextualized representations?", we ask "what effect does contextual information have on the model predictions?"through intervention at the input level.Second, our method is behavioral: it relies on the context reconstruction objective that the model was trained on.This obviates the need to train specialized probes, which can be difficult to interpret due to the added confound of task-specific supervision.
Our method aims to probe how much information the contextualized representation of a single token contains about the other tokens that co-occur with it in a given sequence in masked language models.Our approach is to measure the difference between the reconstruction probability of a co-occurring token in the sequence given the full contextualized representation being probed, and the reconstruction probability of the same co-occurring token only from the lexical priors of the model.This method can be generalized to compare two arbitrary representations where one representation is expected to contain strictly more features than the other (e.g., a static embedding of a token vs. an embedding of the same token created by summing the static embedding and its positional embedding in context).Any difference between the reconstruction probabilities can be attributed to the presence/absence of those features.Using this method, we find that the contextualized representation of a token contains more information about tokens that are closer in terms of linear and syntactic distance, but do not necessarily encode identities of those tokens.A follow-up analysis that decomposes contextualized representations furthermore shows that the gains in reconstructability we find are largely attributable to static and positional embeddings at the input layer.

Proposed Approach
Pretrained Transformer models such as BERT (Devlin et al., 2019) learn to construct contextual representations through context reconstruction objectives like masked language modeling (MLM; e.g., predicting the token in place of [MASK] in The [MASK] sat on the mat).Often, the models are also trained to reconstruct a randomly substituted token (e.g., predicting the token in place of door in The cat sat door the mat, created by randomly substituting a word in The cat sat on the mat).The classifier that makes these predictions can only make use of a single token representation from the final layer, meaning these representations are optimized to contain information about other tokens of the sequence and the position of the token itself insofar as this information can help to resolve the identity of the token.Our approach aims to quantify how much the contextualization of these tokens contributes to changing the MLM predictions.

Metric
We operationalize contextual informativeness of a token representation as its contribution to predicting other tokens in the same sequence-i.e., the contribution to the MLM probability, or reconstruction probability.We quantify the contribution of a more informative token representation j ++ towards reconstructing a different token i, by comparing the reconstruction probability P (i|j ++ ) to the reconstruction probability of i given a less informative token representation j, P (i|j).
For in terms of the log odds ratio given the base reconstruction probability q (predicting from less context) and the contextualized reconstruction probability p (predicting from more context): The probabilities p and q are defined with respect to SOURCE and RECONSTRUCTION (shortened as RE-CON) tokens.SOURCE tokens refer to tokens that are revealed to the model at prediction time (e.g., Buddy in the running example).RECON tokens are tokens in the original sequence the model is asked to predict (e.g., chased in the running example).In obtaining probabilities p and q, the RECON tokens are replaced with [MASK] tokens, only leaving the SOURCE token revealed to the model (more detailed description is given in Section 2.2).MLM probability of the token in the original sequence is computed for each [MASK] token in the probe input-for instance, for Buddy contextual [MASK] [MASK], we compute the probability of chased at position 1 given this sequence, and Cookie at position 2 given this sequence.We compute Eq. 1 for every pair of tokens (t i , t j ) in a given sequence, where t i is SOURCE and t j is RECON.This value represents the degree of change in the probability of the reconstruction token t j induced by the contextualization of the source token t i .

Obtaining the Reconstruction Probabilities
We use the metric proposed above to gauge the contribution of a contextualized representation of a single token in reconstructing its context, over and above the lexical prior (i.e., completely contextindependent) of the model as illustrated in Figure 1.We describe below how the reconstruction probabilities from a fully contextualized representation and from the lexical prior of the model are obtained.
Fully Contextualized To obtain a fully contextualized representation of a token in a particular sequence (e.g., Buddy chased Cookie), we first pass the original, unmasked sequence of tokens through a masked language model.Here, we save each contextualized token representation at each layer of the model (e.g., Buddy L1 , Buddy L2 , . . ., Buddy Lm where m is the number of layers).Then, we create n (n = |seq|) versions of the input sequence where only a single token is revealed (Buddy We pass each sequence through the same masked language model, but at each layer, we replace the representation of the unmasked token with the stored contextualized representation of that token (see Figure 2 for an illustration).
Then, in order for the masked language modeling head to predict each [MASK] token in the sequence, it can only rely on the information from the representation of the single unmasked token (SOURCE), where the SOURCE token representation is contextualized with respect to the original, Lexical Prior Only Baseline We pass through a fully masked version of the input sequence as above, but do not add the positional embeddings at the input layer.The reconstruction probability that we obtain here corresponds to the probability of predicting the token in the original sequence in the absence of any lexical information and positional information.We expect this probability to reflect a general prior of the model over the vocabulary, for instance based on frequency in the training corpus.

Models
We analyzed three Transformer-based masked language models widely used for obtaining contextualized representations: BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019) and DistilBERT (Sanh et al., 2019).BERT and RoBERTa were both pretrained using the masked language modeling objective (BERT also on Next Sentence Prediction), and DistilBERT is a more compact version of BERT obtained through knowledge distillation.DistilBERT has been claimed to retain much of the downstream task performance of BERT despite being substantially smaller (Sanh et al., 2019), and has been shown to be highly similar to BERT in terms of constituency trees that can be reconstructed from linear probes (Arps et al., 2022).

Data
We used sentences from the Multi-Genre Natural Language Inference (MNLI; Williams et al. 2018) dataset for this analysis.We selected MNLI because it contains sentences of varying lengths from a range of domains, and is not a part of the pretraining data of the models we are probing.We then sampled 10K premise sentences from the nonspoken genres of the dataset (i.e., excluding TELE-PHONE and FACE-TO-FACE).We excluded spoken data as it is less typical of the data domain the models were trained on, and we excluded hypothesis sentences because they were generated by crowdworkers given the naturally-occurring premises.

Procedure
For each of the 10K sentences, we created two different sets of probe inputs as illustrated in Figure 1.We passed the probe inputs to the models to obtain the two different reconstruction probabilities (from lexical prior only vs. from a fully contextualized source token) of each of the tokens in the input, as described in Section 2.2.Finally, we computed the log odds ratios between the two reconstruction probabilities using Eq. 1 to quantify the contribution of contextualization for all possible (SOURCE, RECON) token pairs in the original sentence.

Is Token Identity Exactly Recoverable from Contextualized Representations?
The RECON token is among the top 10 MLM predictions of the model only a small percent of the time (BERT: 22.1%.RoBERTa: 7.9%, Distil-BERT: 8.2%), even though the SOURCE token provided to the model has been contextualized with all co-occurring tokens revealed.This observation suggests that the information encoded in the contextualized representations is a degree more abstract than directly encoding the identities of cooccurring tokens in the same sequence.This is in line with Klafka and Ettinger's (2020) finding that the features of co-occurring tokens rather than their identities are often more recoverable from the contextual representations.

Is Reconstructability Greater when
Tokens are in a Syntactic Relation?
We hypothesize that the contextual information in an embedding should disproportionally reflect the syntactic neighbors of the word.To test this hypothesis, we partition reconstructability scores based on the syntactic relation between the SOURCE and RE-CON tokens as follows: 3 (1) SOURCE/RECON is head: Cases where there is a single dependency arc between two tokens, the closest dependency relation possible with the exception of subword tokens.
Reconstructing cat from chased in Figure 5 would be a case of SOURCE is head, and chased from cat would be RECON is head.
(2) SOURCE/RECON is ancestor: Cases where there is more than one dependency arc connecting the two tokens.Reconstructing the from chased would be a case of SOURCE is ancestor, and chased from the would be RECON is ancestor.
(3) subword: SOURCE/RECON tokens are subwords of the same lexical item.Bud and ##dy is an example.(4) No relation: None of 3 For these and subsequent analyses, we parse the sentences using the spaCy dependency parser (https://spacy.io/models/en#en_core_web_trf).
the above relations holds.For example, tokens Bud and the are not in a dependency relation.Our results in Figure 3 confirm our hypothesis.In general, we find that the degree to which contextual information improves reconstruction depends on the existence of a syntactic relation between the SOURCE and RECON as expected.In all models, tokens in a subword or head-dependent relation are more reconstructable from each other compared to tokens with no relation.Furthermore, among tokens that are in a dependency relation, the closer the relation, the higher the reconstruction boost: reconstruction boost is the greatest for tokens in a subword relation, then for tokens in a head-dependent relation, and then for tokens in ancestor-descendant relation.These trends were consistent across all models we evaluate, with the exception of Distil-BERT where reconstruction boost when SOURCE is head was greater than tokens in a subword relation.The models showed more variation in whether ancestor relations boosted reconstructability significantly.While tokens in an ancestor-descendant relation (excluding direct dependents) were more reconstructable than tokens not in a dependency relation in BERT, this was not the case for RoBERTa and DistilBERT.We also did not find a large or con-sistent effect of whether the SOURCE token or the RECON token is the ancestor (including direct headdependent relations).Thus we cannot conclude that ancestors tend to contain more information about descendants than vice-versa.

Finer-Grained Syntactic Properties
In the next set of analyses, we study how finegrained syntactic properties of the words affect reconstructability, focusing on cases where there is a syntactic relation between SOURCE and RECON.
Dependency Relations One natural way to break down the results is by the label of the dependency relation that holds between SOURCE and RECON when such a relation exists.However, we did not find overarching trends; results were generally idiosyncratic, although boost for token pairs in ROOT and PRT (particle) relations was high across all models.See Appendix A for full results.
Functional Relations Next, we zoom in on relations between functional heads and their contentword dependents (Figure 4).Table 1 lists all the dependency arcs we use to identify functional heads. 4irst, we find that reconstructability is generally high for these pairs.Second, auxiliary-verb relations are associated with particularly high reconstructability for all models.One possible explanation for this finding is the fact that there is always morphological agreement between auxiliaries and verbs, unlike most other functional relations.Third, among functional relations, reconstructability is always lowest for complementizer-verb relations (labeled mark).We speculate that the complementizer might encode contextual information about the entire complement clause, which often includes many more content words than just the head verb.
We hypothesized that functional heads encode more information about their dependents in context than vice-versa, due to function words carrying less information than content words, but their contextual representations are equal in size, leaving more space for information about the rest of the sentence.Results from BERT support the hypothesis for all relations.On the other hand, no consistent asymmetry was observed for RoBERTa, and for DistilBERT, the observed pattern mostly contradicts our hypothesis.The large difference between BERT and DistilBERT results goes against prior results that suggest that the syntactic trees recoverable from these two models are highly similar (Arps et al., 2022).

Linear and Structural Distance
We also hypothesized that the distance between two tokens (both in linear and structural terms) would affect reconstruction.Linear distance is the difference between the linear indices of SOURCE and RECON: if they are the i th and j th tokens respectively, their linear distance is |i − j|.Structural distance is the number of arcs in the directed path between SOURCE and RECON tokens (if there is a path).For example, in Figure 5 the structural distance between the and chased is 2.  Linear Distance Predictably, we find that information encoded in contextualized representations is biased towards nearby tokens in linear space (Figure 6, row 1).In other words, we find that reconstructability generally decreases with increase  in linear distance.For all models, the sharpest decrease is observed between 1-and 2-token distances.Beyond this, reconstructability decreases approximately linearly in BERT, and more gradually in RoBERTa and DistilBERT.

Structural Distance
The second row of Figure 6 shows the decline in reconstructability as the number of intervening nodes in the dependency path between the tokens increases when comparing reconstruction.This trend is strictly monotonic in BERT, but there is an small increase starting from dependency depth 7 in RoBERTa and DistilBERT.Due to the high variance in the deeper depth cases, it is unclear whether this is a genuine effect of contextualization.

Decomposing Contextualization
While we examined the effect of contextualization compared to the lexical prior only baseline, our method allows for a finer-grained decomposition of the components of contextualization.In pretrained Transformer models, the input representation of a token is a function of the static lexical embedding and a (context-specific) positional embedding.
Using our method, we can study the individual influence of the lexical embedding, positional embedding, and remaining sequence-specific contextualization (i.e., everything that happens beyond the input layer, full contextualization henceforth).We create various ablated versions of a fully contextualized sequence, as shown in the Ablated sequence column of Table 2.The reconstruction probabilities from these ablated sequences allow us to probe the contribution of the various components of contextualized language models.Fully contextualized and All mask (-position) in Table 2 correspond to the reconstruction probabilities described and compared in Section 2.2, and the rest are intermediate ablations.

Results
Surprisingly, we find that there is often no clear benefit to reconstruction of providing the model with the contextualized embeddings at each layer, over just providing the input embedding (lexical + positional embeddings) of the source token (Figure 7, bottom).While BERT does gain reconstructability from full contextualization for subwords and when SOURCE is a head/ancestor, contextualization is generally harmful or at least not helpful to reconstruction for RoBERTa and DistilBERT.This indicates that the positive reconstruction boost observed in Figure 3 must be driven by static lexical and positional embeddings.Indeed, there are generally positive gains in reconstructability in models provided with the lexical embeddings of the SOURCE tokens compared to models given only [MASK] tokens (Figure 7, top), and also in models provided with positional embeddings on top of lexical embeddings (Figure 9, middle column; Appendix B.3).We provide full comparisons between ablations and their interpretation in Appendix B.

When is full contextualization helpful/harmful?
To better understand the effect of full contextualization, we manually examined token pairs where the greatest differences in reconstruction probabilities with the static lexical + positional and fully contextual SOURCE tokens.In BERT and DistilBERT, the majority (52% and 80%) of the 100 most helpful scenarios of full contextualization involved reconstruction of an apostrophe in a contraction from single-character or bi-character tokens (e.g., m, t, re).As the source token is highly ambiguous on its own, contextualization seems to provide additional information that these (bi)character tokens are a part of a contraction (e.g., I'm, wasn't, we're.In RoBERTa, we found no interpretable pattern. Cases where full contextualization negatively affected reconstruction were often when SOURCE and RECON formed a common bigram (e.g., (prix, grand), (according, to), (##ritan, pu), (United, States)).Since the RECON token is predictable from SOURCE alone, full contextualization seems to only dilute the signal.
Although we found that reconstruction is often better given only input embeddings (i.e., static + positional embeddings) than fully contextualized embeddings, we take caution with the interpretation that full layerwise contextualization is in general harmful to the models, especially given prior evidence (Tenney et al., 2019a) that transformations across layers yield meaningful changes.One possible interpretation is that the idiosyncrasy of the procedure for transferring the contextualized source token falls outside the setting in which these models were trained, adding noise to the process.

Related Work
Our research question is similar to Klafka and Ettinger (2020) which use supervised classifiers to investigate how much information about other tokens in context is contained in the contextualized representation(s) of a token.Our approach addressed a similar question through reconstruction probability given more/less informative token representations.Our findings about better reconstructability between tokens in a syntactic dependency relation echo prior work that show sensitivity of MLMs to part-of-speech and other syntactic relations (Tenney et al., 2019b;Goldberg, 2019;Htut et al., 2019;Kim and Smolensky, 2021).A novel finding is that some of the syntactic dependency between tokens can be traced back to information in the input embeddings, complementing the dynamic layerwise analysis in work such as Tenney et al. (2019a) and Jawahar et al. (2019).This result aligns with Futrell et al. (2019)'s observation that syntactic dependency is reflected in the corpus distribution as encoded in static embeddings.Existing work that analyzes static embeddings from contextualized models (Bommasani et al., 2020;Chronis and Erk, 2020;Sajjad et al., 2022) mostly concerns the distillation of static embeddings rather than isolating the contribution of static embeddings in contextualized prediction as in our work.More broadly, our work shares goals with intervention-based methods such as Geiger et al. (2021) and Wu et al. (2020), but we examine what the effect of our intervention is on masked language modeling probabilities rather than on separate downstream tasks.Karidi et al. (2021) employs the most similar methodology to ours, in their use of predictions from the masked language modeling objective directly for probing.However, their primary analysis concerns the role of contextualization in word sense disambiguation.

Conclusion
We proposed reconstruction probing, a novel method that compares reconstruction probabilities of tokens in the original sequence given different amounts of contextual information.Overall, reconstruction probing yields many intuitive results.We find that the information encoded in these representations tend to be a degree more abstract than token identities of the neighboring tokens-often, the exact identities of co-occurring tokens are not recoverable from the contexutalized representations.Instead, reconstructability is correlated with the closeness of the syntactic relation, the linear distance, and the type of syntactic relation between the SOURCE and RECON tokens.These findings add converging evidence to previous probing studies about the implicit syntactic information of contextual embeddings (Tenney et al. 2019b).Furthermore, our method is generalizable to comparing reconstruction probabilities from any pair of representations that differ in the degree of informativeness.We extended our analysis to finer-grained decomposition of the components that constitute contextualized representations using this method, finding that most of the reconstruction gains we saw were attributable to information contained in static lexical and positional embeddings at the input layer.This calls for deeper investigations into the role of token representations at the input layer, complementing a large body of existing work on layerwise analysis of contextualized language models.

Limitations
As we discussed in Section 5.1, further work is needed to investigate whether the negative effect of full contextualization beyond static + positional embeddings at the input layer is an idiosyncrasy of the embedding transfer procedure, or if this is a true effect.In future work, an experimental setup that is closer to the training setup, such as masking only the RECON token instead of all tokens and transferring the SOURCE could be adopted, in order to reduce the noise potentially introduced by the distributional change in the inputs.Regardless, we believe that findings regarding the information content of the representation at the input layer (static + positional embeddings) are novel and meaningful, and the quantification method we propose for comparing two representations in terms of their predictive utility is a generalizable methodological contribution.
We furthermore note that our attempts to conduct evaluation on newer masked language models were made challenging due to several technical issues in the library (e.g., masked language modeling being unavailable in DeBERTa (He et al., 2021)

A Dependency Relations
Figure 8 shows the full reconstructability boost results for all dependency arc labels in our dataset.

B Detailed Decomposition Analysis
B.1 Creating Ablated Sequences Fully contextualized See Section 2.2.
Static embedding (+position) We pass through the masked language model the n versions of the input sequence described above, each of which has a single token revealed, at the input layer only.Again, for each [MASK] token in the input sequence, we take the probability of the token in the same position in the original sequence as the reconstruction probability.This value corresponds to the probability of predicting the token in the original sequence given only the static lexical information of the source token and the positional information of the source and recon tokens.
Static embedding (-position) We pass through the n single token-revealed versions of the input sequence as described above, but at the input layer, we do not add the positional embeddings.The reconstruction probability obtained, then, corresponds to the probability of predicting the token in the original sequence given only the static lexical information of the source token and no positional information of any of the tokens.
All mask (+position) We pass through a fully masked version of the input sequence that consists of the same number of [MASK] tokens and obtain the reconstruction probability of the tokens in the original sequence.Hence, in this scenario, there is no source.The value obtained through this input corresponds to the probability of predicting the token in the original sequence in the absence of any lexical information.Note that the model still has access to the positional embeddings of the recon token, which may still be weakly informative for token prediction.

B.2 Representations Compared
By comparing the reconstruction probabilities described above using Eq. 1, we can gauge the effect of the additional contextual information on performing masked language modeling.For example,

Figure 1 :
Figure1: (Left) How the probability of chased from only the lexical priors of the model is obtained.The input to the model is a sequence of masked tokens of the same length as the original sentence, without any positional embeddings.(Right) How the probability of chased given a fully contextualized representation of the token Buddy is computed (see Figure2for more details).The reconstruction probabilities from (Left) and (Right) are compared using log odds ratio (LOR; Eq. 1).

Figure 3 :
Figure 3: Reconstructibility boost by syntactic relation measured by log odds ratio.

Figure 4 :
Figure 4: Reconstructibility boost (log-odds ratio with vs. without source) broken down by the functional relation between a functional head and a content-word dependent.

Figure 5 :
Figure 5: The dependency parse of the sentence Buddy chased the cat.

Figure 6 :
Figure 6: Reconstructibility boost (log odds ratio) broken down by linear distance (top) and structural distance (bottom) between SOURCE and RECON.

Table 2 :
Ablated sequence and an example of an input passed through the model to obtain the output representations when SOURCE is 'Buddy'.{} denotes an unordered set (i.e., no positional information).
Gabriella Chronis and Katrin Erk.2020.When is a bishop not like a rook?when it's like a rabbi!multiprototype BERT embeddings for estimating semantic relationships.In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 227-244, Online.Association for Computational Linguistics.Simone Conia and Roberto Navigli.2022.Probing for predicate argument structures in pretrained language models.In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4622-4632, Dublin, Ireland.Association for Computational Linguistics.Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.2019.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for