Probing for Constituency Structure in Neural Language Models

In this paper, we investigate to which extent contextual neural language models (LMs) implicitly learn syntactic structure. More concretely, we focus on constituent structure as represented in the Penn Treebank (PTB). Using standard probing techniques based on diagnostic classifiers, we assess the accuracy of representing constituents of different categories within the neuron activations of a LM such as RoBERTa. In order to make sure that our probe focuses on syntactic knowledge and not on implicit semantic generalizations, we also experiment on a PTB version that is obtained by randomly replacing constituents with each other while keeping syntactic structure, i.e., a semantically ill-formed but syntactically well-formed version of the PTB. We find that 4 pretrained transfomer LMs obtain high performance on our probing tasks even on manipulated data, suggesting that semantic and syntactic knowledge in their representations can be separated and that constituency information is in fact learned by the LM. Moreover, we show that a complete constituency tree can be linearly separated from LM representations.


Introduction
Over the last years, neural language models (LMs), such as BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), RoBERTa (Liu et al., 2019b), and DistilBERT (Sanh et al., 2020), have delivered unmatched results in multiple key Natural Language Processing (NLP) benchmarks (Wang et al., 2018).Despite the impressive performance, the black-box nature of these models makes it difficult to ascertain whether they implicitly learn to encode linguistic structures, such as constituency or dependency trees.
There has been a considerable amount of research conducted on questioning which types of linguistic structure are learned by LMs (Tenney et al., 2019b;Conneau et al., 2018;Liu et al., 2019a).The motivation behind asking this question is two-fold.On the one hand, we want to better understand how pre-trained LMs solve certain NLP tasks, i.e., how their input features and neuron activations contribute to a specific classification success.A second motivation is an interest in distributional evidence for linguistic theory.That is, we are interested in assessing which types of linguistic categories emerge when training a contextual language model, i.e., when training a model only on unlabeled text.The research in this paper is primarily motivated by this second aspect, focusing on syntactic structure, more concretely on constituency structure.We investigate, for instance, for pairs of tokens in a sentence whether a LM implicitly learns which constituent (NP, VP, . . . ) the two tokens belong to as their lowest common ancestor (LCA).We use English Penn Treebank data (PTB, Marcus et al., 1993) to conduct experiments.
A number of studies have probed LMs for dependency structure (Hewitt and Manning, 2019;Chen et al., 2021) and constituency structure (Tenney et al., 2019b).We probe constituency structure for the following reasons.In contrast to dependency structure, it can be richer concerning the represented abstract syntactic information, since it directly assigns categories to groups of tokens.On the other hand, not all dependency labels are represented in a standard constituency structure; but they can be incorporated as extensions of the corresponding non-terminal nodes (see, e.g., the PTB label NP-SBJ in App.A.1).To quantify the gain that we get from probing constituency rather than dependency trees, we compare the unlabeled bracketings in the syntactic trees in both formalisms on the PTB (Marcus et al., 1993;de Marneffe et al., 2006), where an unlabeled bracketing is the yield of a subtree.We find that while 97% of the bracketings in a dependency tree are also present in a constituency tree, only 54% of the bracketings in the constituency tree are present in the dependency tree.This shows that constituent trees can contain much more fine-grained hierarchical information than dependency trees.A further reason for focusing on constituency structure is that this is the type of structure most linguistic theories use.
We use diagnostic classifiers (Hupkes et al., 2018) and perform model-level, layer-level and neuron-level analyses.Most work on diagnostic classifiers performs mean pool over representations when probing for a relationship between two words (Durrani et al., 2020).We empirically show that mean pool results in lossy representation, and we recommend concatenation of representations as a better way to probe for relations between words.
A difficulty when probing a LM for whether certain categories are learned is that we cannot be sure that the LM does not learn a different category instead that is also predictive for the category we are interested in.More concretely, when probing for syntax, one should make sure that it is not semantics that one finds and considers to be syntax (since semantic relations influence syntactic structure).This point was also observed by Gulordava et al. (2018) and, more recently, Hall Maudslay and Cotterell (2021).Therefore, before probing the LM for syntactic relations, we manipulate our data by replacing a subset of tokens with other tokens that appear in similar syntactic contexts, thereby obtaining nonsensical text that still has a reasonable syntactic structure.We then conduct a series of experiments that show that even for these nonsensical sentences, contextual LMs implicitly represent constituency structure.Lastly, we questioned whether a full syntactic tree can be reconstructed using the linear probe.We achieve a labeled F1 score of 82.6% for RoBERTa when probing on the non-manipulated dataset in comparison to 51.4% with a random representation baseline.
The contributions of our work are as follows: • We find that constituency structure is linearly separable at various granularity levels: At the model level, we find that four different LMs achieve similar overall performance on our syntactic probing tasks, but make slightly different predictions.At the layer level, the middle layers achieve the best results.At the neuron level, syntax is heavily distributed across neurons.
• We use perturbed data to separate the effect of semantics when probing for syntax, and we find that different sets of neurons capture syntactic and semantic information.
• We show that a simple linear probe is effective in analyzing representations for syntactic properties and we show that a full constituency tree can be linearly separated from LM representations.
The rest of the paper is structured as follows.The next section introduces related work.We define our linguistic probing tasks in Sec. 3. Sec. 4 introduces our experimental methodology.Sec. 5, 6, and 7 discuss our experiments and their results, and Sec. 8 concludes.
An ample body of research exists on probing the sub-sentential structure of contextualized word embeddings.Peters et al. (2018) probed neural networks to see to what extent span representations capture phrasal syntax.Tenney et al. (2019b) devised a set of edge probing tasks to get new insights on what is encoded by contextualized word embeddings, focusing on the relationship between spans rather than individual words.This enables them to go beyond sequence labeling problems to syntactic constituency, dependencies, entity labels, and semantic role labeling.Their results on syntactic constituency are in line with our findings.The major difference is that we employ simpler probes while achieving similar results.Moreover, we separate the effect of semantics using corrupted data and we reconstruct full constituent trees using our probing setup.Most recently, Wu et al. (2020) propose a parameter-free probing technique to analyze LMs via perturbed masking.Their approach is based on accessing the impact that one word has on predicting another word within a sequence in the Masked Language Model task.They have also shown that LMs can capture syntactic information with their self-attention layers being capable of surprisingly effective learning.
Hewitt and Manning (2019) demonstrated, by using a structural probe, that it is possible to find a linear transformation of the space of the LM's activation vectors under which the distance between contextualized word vectors corresponds to the distance between the respective words in the dependency tree.In a similar vein, Chen et al. (2021) introduced another structural probe, Poincaré probe and have shown that syntactic trees can be better reconstructed from the intermediate representation of BERT in a hyperbolic subspace.Gulordava et al. (2018) andHall Maudslay andCotterell (2021) recently argued that the work on probing syntax does not fully separate the effect of semantics while probing syntax.Both modify datasets such that the sentences become semantically nonsensical while remaining syntactically well-formed in order to assess, based on this data, whether a LM represents syntactic information.Gulordava et al. (2018) modify treebanks in four languages by replacing content words with other content words that have matching POS and morphological features.They focus on the question if agreement information in the nonce sentences can be recovered from RNN language models trained on regular data.Hall Maudslay and Cotterell (2021) replaced words with pseudowords in a dependency treebank, and quantified how much the pseudowords affect the performance of syntactic dependency probes.We followed a similar setup to separate out the effect of syntax from semantics.In contrast to Hall Maudslay and Cotterell (2021), we replace words with other words (not pseudowords) that occur in a similar syntactic context but are different semantically.This way, the LM has seen most or all words from the semantically nonsensical sentences in pretraining and has learned their syntactic properties.

Syntactic and semantic knowledge
Fine-grained LM analysis Durrani et al. (2020) used a unified diagnostic classifier approach to perform analyses at various granularity levels, extending Dalvi et al. (2019a).We follow their approach and perform model-level, layer-level and neuron-level analyses (Sajjad et al., 2022a).We additionally extend their approach by proposing an improved way to probe representations of two words.Previous work has mainly used a bilinear probe to investigate syntax.We select a linear model for our experiments.Selecting a weak model ensures that the representations learn the linguistic property, and the probe is not relying on the strength of the classification model used.

Diagnostic Tasks and Constituency trees
In this section, we define three classification tasks that are aimed at making different properties of constituency structure explicit.More specifically, the goal of these tasks is to make explicit if and how the LMs encode syntactic categories, such as S, NP, VP, PP.The first task, lowest common ancestor prediction, focuses on constituents that span large portions of the sentence.The second task, chunking, focuses on constituents with smaller spans.
The third task focuses on complete syntactic trees.
Lowest common ancestor (LCA) prediction Let s = w 0 , . . ., w n be a sentence.Given combined representations of two tokens w i , w j with j ≥ i, predict the label of their lowest common ancestor in the constituency tree.LCA prediction is a multiclass classification task with 28 target classes (for the PTB).In the example in Fig. 1, luxury and maker have the LCA NP: the lowest node dominating both words has label NP (ignoring the function tag SBJ).The task also predicts LCA of two identical tokens.In this case, the lowest phrasal node above the token is the target label (for example, VP is the target label for sold).Chunking For each token w i , predict whether it is the beginning of a phrase (B), inside a phrase (I), the end of a phrase (E) or if the token constitutes a single-token phrase (S).A token can be part of more than one phrase, and in this case we consider the shortest possible phrase only.For example, 1,214 in Fig. 1 has label B because it marks the beginning of a noun phrase.We also propose a version of this task with finer labels that combine B,I,E,S with the different phrase labels.In the detailed tagset, 1, 214 receives the label B-NP.
Reconstructing full constituent trees Vilares et al.
(2020) considered constituency parsing as a multilabel sequence labeling problem.For each token w i , three labels are predicted: First, the label of the LCA of the token pair (w i , w i+1 ).Second, the depth of the LCA of (w i , w i+1 ) in the tree, relative to the depth of (w i−1 , w i ).Third, if the first token is a single-word constituent, and the label of the internal tree node directly above w i .(Tokens in multiword constituents make up > 90% of the data and receive a negative label.).For the first two classifications, see Fig. 1.We build separate linear classifiers for each of these tasks and use their predictions to reconstruct full constituent trees.

Methods
Diagnostic classification A common method to reveal linguistic representations learned in contextualised embeddings is to train a classifier, a probe, using the activations of the trained LM as features.
The classifier performance provides insights into the strength of the linguistic property encoded in contextualised word representations.For all our experimental setups, we employ the NeuroX toolkit (Dalvi et al., 2019b) for diagnostic classification, as it confers several mechanisms to probe neural models on the level of both neurons and layers.Layer-level probing We probe the activations of individual layers with linear classifiers to measure the linear separability of the syntactic categories at this layer.The performance at each layer serves as a proxy to how much information it encodes with respect to a given syntactic property.Neuron-level probing Layer-wise probing cannot account for all syntactic abstractions encoded by individual neurons in deep networks.Some groups of neurons that are spread across many layers might robustly respond to a given linguistic property without being exclusively specialized for its detection.By operating also at the level of the neurons, we aim at separating the most salient neurons across the network that learn a given linguistic property.
We conduct a linguistic correlation analysis, as proposed by Dalvi et al. (2019a).This consists in augmenting the linear classifier with elastic net regularization (Zou and Hastie, 2005).The classifier is then trained by minimizing the loss function in Eq. 1: where (θ) are the trained weights of the classifier and λ 1 ∥θ∥ 1 and λ 2 ∥θ∥ 2 2 correspond to L 1 and L 2 regularization.Elastic net regularization strikes a balance between selecting very focused features (neurons) (L 1 ) versus distributed features (L 2 ) shared across many properties.The input neurons to the linear classifier are ranked by saliency with respect to the classification task.Input Representation We combine the representation vectors x i , x j ∈ R r of two tokens in LCA prediction and parse tree reconstruction via concatenation (concat(x i , x j ) ∈ R 2r ).In all experiments, concat produced significantly better results than elementwise averaging or a variant of the maximum.We cover the latter methods in Apps.A.3 and A.6.
For chunking, the label distribution is relatively balanced.For LCA prediction, we remove all function tags to keep the number of target labels small.Most of the token pairs have a relatively large distance in the constituent tree, and their LCA is a node very high in the tree (typically with a label for some kind of sentence, such as S or SBAR).In addition, some phrase labels are less frequent than others (see App. A.4).We train and evaluate on the standard PTB training/development split.

Syntactic and semantic knowledge
To ensure that the probing classifier captures syntactic properties and not semantic properties, we use the original PTB data as well as two modified versions of the PTB with semantically nonsensical sentences that have the same syntactic structure as the original data.The modified versions of the PTB are obtained by making use of the dependency PTB (de Marneffe et al., 2006): 1. Record the dependency context of each token in the dataset.The dependency context consists of (i) the POS tag, (ii) the dependency relation of the token to its head, and (iii) the list of dependency relations of the token to its dependents.
2. Replace a fraction of tokens with other tokens that also appear in the dataset in the same dependency context.
Two versions are created, replacing either a third or two thirds of the tokens.See Table 1 for two examples.When creating manipulated datasets, we separate the training and evaluation data.To create manipulated training data, we look for token replacements in the training split of the PTB (PTB sections 0-18).For manipulated evaluation data, we look for token replacements in the development and test splits of the PTB (sections 19-24).This ensures that the training and evaluation data do not mix, and at the same time, meanings of the newly created sentences are as diverse as possible.

Probing Classifier Settings
We use linear classifiers trained for 10 epochs with Adam optimization, an initial learning rate of 0.001 and regularization parameters λ 1 = λ 2 = 0.001.
The contextualized representations for an input token is created by averaging the representations of its subword tokens of the LM tokenizer.

Baselines
We use three baselines to put the results of our probes into context.Random BERT The first baseline is used in all experiments.It evaluates how much information about linguistic context is accumulated in the LM during pretraining (Belinkov, 2022).The model for this baseline has the same neural architecture, vocabulary and (static) input embeddings as BERT base, but all transformer weights are randomized.
Selectivity To evaluate if the the probe makes linguistic information explicit or just memorizes the tasks, we use the control task proposed by Hewitt and Liang (2019) and described in App.A.2.The difference between control task performance and linguistic task performance is called selectivity.The higher the selectivity, the more one can be sure that the classifier makes linguistic structure inside the representations explicit and does not memorize the task.This baseline is used for the chunking and, in a modified version, the LCA experiments.
Individual tokens This baseline evaluates how much the representation of each token, in contrast to token pairs, contributes to the overall performance.This baseline is used only for the LCA experiments.We train two classifiers using the representation of either the first token in the pair or the second token and evaluate the performance on the diagnostic tasks.The trees in the PTB are rightbranching.The left token in most cases is closer to the LCA node than the right token, thus we expect that the classifier trained on only the left token has a better overall performance.
6 Results for LCA prediction and chunking We experimented using four pretrained models.Due to the limited space and the consistency of results, we reported the analysis for the RoBERTa model only in most of the cases.The complete re- sults of all models are shown in appendix sections A.5 and A.6.In the following, we assess the overall performance of the probing classifiers on both linguistic tasks.Then, we evaluate how changing the semantic structure of the data influences the probing classifiers.Lastly, we show some insights into layer-level and neuron-level experiments.

Overall performance
Tab. 2 shows the performance of the classifiers trained on non-manipulated data using all neurons of the network (orig./orig.).We observed high performance for each diagnostic task.The differences to the baselines show that the knowledge about the task is indeed learned in the representation.

LCA prediction
The best results are achieved when concatenating token representations (82.8% acc.).For other representation methods, see App.A.6.We additionally consider single word representations from the word pair as input.The left token representations are better predictors (66.5% acc. on orig./orig.)than those from the right token (40.8%).The large differences between concat and the baselines shows that the probe is not memorizing the task, and that information relevant for predicting LCA is acquired during pretraining.
Chunking Chunking detailed (91.2% acc.) is a harder task than chunking simple (96.0%).Although the classifier for the detailed tagset shows relatively low selectivity in comparison to chunking simple, the overall selectivity is high enough to claim that the knowledge about these probing tasks is learned in the representation.The difference to the random BERT model is higher for chunking detailed than for chunking simple, which shows that fine-grained syntactic knowledge is indeed learnt during pretraining.

Does the probe learn syntax or semantics?
The high performance of the classifiers serves as a proxy to the amount of syntactic knowledge learned in the representations.But Hall Maudslay and Cotterell (2021) argued that due to the presence of semantic cues in the data, high performance of a syntactic probe may not truly reflect the learning of syntax in the model.To investigate this, we manipulated our diagnostic task data (Sec.5.1) to separate syntax from semantics, and then trained the probing classifiers on the manipulated data.
The second column in Table 2 shows variations of the manipulated data.The classification performance dropped slightly on the diagnostic tasks at 0.33/orig.Moreover, the classifier performed slightly better when evaluating on original data ( * /orig.)compared to manipulated data (such as .33/.33).There are two possible reasons for this: First, the probing classifiers may still rely on semantic knowledge, even when trained on the manipulated data; second, it is possible that the manipulated data contains syntactically ill-formed sentences.Nonetheless, performance and differences to baselines are reasonably high and give good reason to believe that the classifiers are able to extrapolate syntactic knowledge even from semantically nonsensical data.We now proceed with summa- rizing what our experiments tell about syntactic knowledge in specific layers/neurons of the LM.

Layer-wise results
Fig. 2 shows how syntactic knowledge is distributed across all layers.The embedding layer performed worse while middle layers showed the best results, i.e., syntactic information is better represented in the middle layers.The highest layers are more heavily influenced by the pretraining objective, which explains the consistent performance drop across models and tasks on the last layers.
Comparing layer-wise performance with the overall performance, none of the individual layers outperformed the classifier trained on all layers for chunking.In the case of LCA prediction, the performance of layers 4-5 in RoBERTa (6-8 in BERT) are better than the overall performance on all layers.Comparing models, we observed that RoBERTa learns the syntactic knowledge much earlier in the network compared to BERT (see the relatively sharp rise of performance in the lower layers of RoBERTa).

Neuron-level results
In this section we carry out a more fine-grained neuron-level analysis of the representations.Linguistic correlation analysis (Dalvi et al., 2019a) provides a ranking of neurons with respect to the diagnostic task.Minimum Subset of Neurons We evaluated the neuron ranking by training classifiers using top/bottom/random N% neurons.Fig. 3 shows the accuracy curves for the chunking task.The performance margin between different selected neurons is very low.This shows that syntactic information can be extracted from any relatively small subset of neurons i.e. 20 − 30% of neurons suffice for a probing classifier to perform with the same accuracy as when trained on full representations.Neuron ranking on combined representations does not work well: In some cases, performance on a fraction of randomly selected neurons is worse than performance on the same fraction of neurons ranked as important (see App. A.7).

Distribution of Neurons for LCA prediction
Training on subsets of the neurons for LCA prediction is problematic, because the neuron ranking list contains neurons from both token representations.Even though, the distribution of salient neurons across layers yields interesting insights.Fig. 4 presents the spread of top selected neurons for S. As in Sec.6.3, we found again that top neurons learning syntactic properties come from the middle layers.For 12-layer LMs, we see a trend that neurons from the positional encoding in the embedding layer are utilized to identify distant tokens with LCA S. When comparing the salient neurons selected from each layer, we observe that for identifying S, neurons from the highest layers are less relevant than when identifying NPs (Fig. 5).This might be due to the comparatively high structural diversity we find in NPs. 2 Neurons learning Syntax vs. Semantics Comparing the neuron rankings of chunking classifiers trained on the different datasets shows that there is relatively little overlap between the different groups of highest-ranking neurons (see App. A.8).This means that the probing classifiers focus on different neurons when training on manipulated data, compared to the original data.Presumably, the probe focuses more on syntactic and less on semantic information when trained on manipulated data.

Reconstructing full Parse Trees
With the insights gained in the previous section, we test if full constituency trees can be linearly separated from LM representations.For this, we train 3 linear classifiers that take as input the concatenated representations of two adjacent tokens and predict the three labels described in Sec. 3. The classifiers for this task take as input not the full LM representation from all layers, but instead the concatenations of every third layer for the 12-layer LMs, and every second layer for the 6-layer LM.This way, the input dimensionality of the classifier is restricted, but the probe can use information from different parts of the LM.The probe is trained (evaluated) on all 38k (5.5k) sentences of the training (development) split of the PTB.We find that the constituency trees reconstructed from different LMs are of high quality (Tab.3, App.A.9).We achieve a labeled F1 score of 82.6 on the non-manipulated dataset for RoBERTa (80.5 for XLNet, 80.4 for BERT) which is 31 points better than the random BERT baseline.This outperforms the result from Vilares et al. (2020) for BERT by 2.2 points.They also use a linear classifier, but their classifier receives as input only the final layer representation of BERT for the first token in the token pair.
When comparing trees reconstructed from different LMs against each other, we find however that they are quite different.For example, comparing the sentence-level F scores for trees reconstructed from XLNet to those from RoBERTa yields a Pearson correlation of 0.52 only (compared to 0.64 for DistilBERT and BERT, see App.A.10 for the full comparison).This shows that different syntactic properties are linearly separable from the represen-2 All models are more accurate in LCA prediction when the two tokens are more distant, see App.A.6.Large distance between tokens correlates with LCA nodes close to the root of the syntactic tree, where the LCA often has label S.

Conclusions
Our experiments have shown that different pretrained LMs encode fine-grained linguistic information that is also present in constituency trees.More specifically, LMs are able to identify properties of different types of constituents, such as S, NP, VP, PP.Good results on the chunking task show that the classifiers are able to combine knowledge about the kind of constituents that a token is part of, and knowledge of the position of a token in the constituent.Using the sequence labeling tasks presented in Vilares et al. (2020), we have shown that full constituency trees are linearly separable from four different pretrained LMs with high quality -even for semantically nonsensical data.In line with Gulordava et al. (2018), we observe a moderate performance drop between performance on the original and nonce dataset.The performance drop is smaller than in Hall Maudslay and Cotterell (2021).who use English pseudowords which the LM has probably never encountered.We use English words whose syntactic and semantic properties are already well-established inside the LM.
In future work, we plan to extend this syntactic probing approach to other languages and other syntactic annotation schemes (for instance Hockenmaier and Steedman, 2007;Evang et al., 2021).

Limitations
Our work investigates the question whether syntactic structure is linearly separable from LM representations.However, we make no claim about the question if the syntactic concepts we probe for are actually relevant for LM predictions.
We demonstrate the effectiveness of our method for one high-resource language, namely English.While our methodology is in principle languageagnostic, our study requires high-performing LMs as well as large amounts of annotated data.Both are available for only a relatively small set of languages.More specifically, we found in pilot experiments that supervised probing in general and separating syntactic and semantic knowledge in particular is very data-hungry.While high probe performance on the original data required less than 10k sentences of training data, the performance difference between original and semantically manipulated data shrank with increasing the size of the training data from 10k sentences to 38k sentences.Consequently, in order to have reliable findings, our experiments require large datasets which reflects into the need of sufficient computational resources.For a general discussion of the limitations of supervised probing classifiers, we refer to (Belinkov, 2022).
A.4 Label distributions for the different probing tasks

Figure 1 :
Figure 1: Example tree from the PTB.The line below the text shows gold labels for the simple chunking task.The bottom line shows label pairs from which the complete tree can be reconstructed.

Figure 4 :Figure 5 :
Figure 4: Spread of neurons relevant for recognizing S in LCA prediction, across layers.
Accuracy per category for the most frequent labels in the detailed tagset.All beginnings of frequent constituents are recognized with high accuracy.The classifier is also able to distinguish between different kinds of NPs, such as NP without further specification, subject NPs (NP-SBJ) or temporal and local NPs (NP-TMP,NP-LOC) For LCA prediction, all models are more accurate when the distance between two tokens is higher

Figure 6 :
Figure 6: Results for LCA prediction

Table 1 :
orig.Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29..33Pierre Berry, 5,400 years old, shall join the board as a nonexecutive director Nov. 29..67Mesnil Vitulli, 9.76 beers state-owned, ca succeed either board as a cash-rich director October 213,000.orig.Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group..33Mr. Vinken is growth without Elsevier Hills, each Dutch publishing group..67Tata Helpern s chairman plus Elsevier Ohls, a Dutch snaking group.Examples for the manipulated data.Replaced words are printed in boldface.

Table 2 :
Results on different datasets.For each task, there are different setups where the model is trained and evaluated on the unchanged treebank (orig.), and two versions with either a third (0.33) or two thirds (0.67) of the tokens replaced.'task' shows the performance on test set.'sel.' shows the selectivity (difference to control task), and ∆ Random shows the performance difference to the Random BERT model.
B: beginning, I: inside, E: end, S: Single, PCT: punctuation: punctuation is not considered for evaluation

Table 4 :
Label distributions for the different probing tasks

Table 5 :
Results for chunkingA.6LCA prediction results LCA prediction results.The performance gains of concat wrt.avg and maxs are not matched by higher performance in the control task for concat.Thus concat shows not only the best task performance but also the highest selectivity for LCA prediction.Confusion matrix for LCA prediction for the most frequent constituents labels for RoBERTa when trained and evaluated on non-manipulated data.The columns represent predicted values, rows represent actual values.Some categories are better represented in the probing classifiers than others.For example, prepositional phrases are recognized quite reliably, but adjectival phrases are confused for VPs and NPs in a number of cases.NPs are frequently confused with all other categories.The reason might be that a variety of different phenomena are collected under NP, such as appositions and relative clauses.