Probing LLMs for Joint Encoding of Linguistic Categories

Large Language Models (LLMs) exhibit impressive performance on a range of NLP tasks, due to the general-purpose linguistic knowledge acquired during pretraining. Existing model interpretability research (Tenney et al., 2019) suggests that a linguistic hierarchy emerges in the LLM layers, with lower layers better suited to solving syntactic tasks and higher layers employed for semantic processing. Yet, little is known about how encodings of different linguistic phenomena interact within the models and to what extent processing of linguistically-related categories relies on the same, shared model representations. In this paper, we propose a framework for testing the joint encoding of linguistic categories in LLMs. Focusing on syntax, we find evidence of joint encoding both at the same (related part-of-speech (POS) classes) and different (POS classes and related syntactic dependency relations) levels of linguistic hierarchy. Our cross-lingual experiments show that the same patterns hold across languages in multilingual LLMs.


Introduction
Recent advancements in natural language processing (NLP) can be attributed to the development and pretraining of large language models (LLMs) such as BERT, GPT-3, and many others (Devlin et al., 2019;Brown et al., 2020;Touvron et al., 2023).For their intended use of providing general-purpose language representations suitable for many NLP tasks, these models must efficiently capture a wide range of linguistic features within their finite capacity.Despite their success, little is known about the way in which different types of linguistic information are organized in these models.Systematically understanding how these models represent linguistic phenomena and their interaction is crucial for the development of more effective NLP methods.
Existing research probed LLMs for their encoding of various linguistic properties such as agreement (Jawahar et al., 2019), word order and sentence structure (Tenney et al., 2018;Hewitt and Manning, 2019), co-reference (Tenney et al., 2019), semantics (Ettinger, 2020) and multilinguality (Ravishankar et al., 2019;Libovický et al., 2020).Taking a step further, Tenney et al. (2019) and Clark et al. (2019) studied where linguistic information is encoded in LLMs by probing different layers.Their results demonstrated that a linguistic hierarchy emerges in BERT representations, with lower layers capturing local syntax and higher layers being employed in higher-level semantic and discourse tasks.However, we do not yet understand how encodings of different linguistic phenomena interact within the models and to what extent processing of linguistically-related categories relies on the same, shared model representations.
There are many dependencies between processing different linguistic phenomena: for instance, information about a word's part of speech is likely to be employed when disambiguating its word sense.Alternatively, lower-level syntax is an important first step for semantic composition and natural language understanding tasks.In this work, we investigate how the (hierarchical) dependencies between different linguistic categories are encoded in LLMs, focusing on syntax.We ask a set of novel questions: (1) how related syntactic categories (e.g.different parts of speech (POS), such as noun or verb) are encoded within the models; (2) how syntactic categories at different levels of the linguistic hierarchy (e.g.POS classes and syntactic dependency relations) interact within the model; and (3) whether the observed patterns hold across languages.
To answer these questions, we propose a framework for testing the joint encoding of linguistic categories in LLMs.Specifically, we investigate to what extent the information about distinct linguistic categories is shared in the parameters of the Repeat classification on the "neutralized" repr.s with the probing network.
Step 1: Representation classification Classify representations from encoder using probing network

Group representations and aggregate into classrepresentative centroids.
Subtract centroids from incoming centroids before re-classification.
Step 2: Centroid estimation Step 3: Cross-neutralization ⋮ "The quick brown fox …" Extract repr.s from n th layer of a pretrained encoder.
Classify the repr.susing the probing network.A "neutralizer" centroid is subtracted from incoming repr.sbefore downstream classification.
Drops in perf. in a particular target class suggest that the neutralizer and target are jointly encoded.model.Our approach (see Fig. 1) is inspired by the work of Choenni and Shutova (2022), who studied how LLMs share information across languages.We employ their cross-neutralization method, but extend it to study how information is encoded across linguistic categories in two syntactic tasks-partof-speech (POS) tagging and dependency (DEP) labeling.In short, we test whether removing information on one syntactic category results in a failure to process another, related one.For instance, would removing a representation of the verb class hurt the model's ability to identify the verbs' syntactic dependencies, while still succeeding in this task on other categories, e.g.nouns?This provides insight into whether these categories are jointly encoded.
We study joint encoding patterns within both a monolingual and multilingual LLM, namely RoBERTa (Liu et al., 2019b) and XLM-R (Conneau et al., 2020).For the latter, we focus our analysis on English, Greek, and Italian, investigating to what extent multilingual models encode information for POS tags and DEP relations in a language-agnostic manner.We find that POS tags that are linguistically related are indeed jointly encoded by both the monolingual and multilingual models, and observe similar joint encoding patterns across all three languages.Moreover, we obtain further evidence of both language-agnostic and language-specific encoding within multilingual models, given that representations specific to POS tags and DEP relations can be approximately transferred across languages without a substantial impact on performance.Lastly, we find evidence of joint encoding between related POS tags and DEP relations, suggesting information sharing across tasks at different levels of the linguistic hierarchy.

Background and related work
A wide range of methods have emerged to study the inner workings of neural networks (Belinkov and Glass, 2019;Madsen et al., 2022).Our approach is situated within the field of probing (Conneau et al., 2018), which typically involves the use of a simple auxiliary "probing" network trained to perform a specific task on the representations from a pretrained LLM.By keeping the probing network shallow and the pretrained model frozen, the predictions can be used to identify which information was already captured by the pretrained LLM.Shi et al. (2016) are the first to introduce the idea of probing, training a logistic regression classifier on top of the embeddings from two neural machine translation encoders to study the syntactic information learned by them.Similarly, Adi et al. (2017) probe sentence representations for sentence length and word order by training auxiliary task classifiers on predicting these exact properties from the representations.Continuing on this work, Conneau et al. (2018) introduce a suite of probing tasks.Tenney et al. (2019) apply these to BERT's hidden states, quantifying where linguistic information is captured within the network.They find that the model follows the hierarchy of the traditional NLP pipeline, with lower-level, syntactic features appearing earlier than more complex semantic roles and structures.Ravishankar et al. (2019) bring the same analysis to the multilingual setting.Dalvi et al. (2021) find more supporting evidence of this through an unsupervised approach.Manning et al. (2020) probe the attention mechanism in BERT for correspondence between linguistic phenomena such as syntactic dependencies and coreference.They also make use of structural probes (Hewitt and Manning, 2019), finding that the representations from BERT capture parse tree structure.

Previous work on probing
Syntactic knowledge in BERT Various approaches have been used to study what knowledge LLMs capture about syntax (Rogers et al., 2021).Previous probing studies have shown that BERT embeddings encode information about parts of speech, syntactic chunks, constituent and dependency labeling (Tenney et al., 2018;Liu et al., 2019a) as well as a broader set of syntactic features, such as tree depth and tense (Conneau et al., 2018).Yet, while it has become evident that syntax is captured to some extent, less is still known about where and how this information is acquired.Some works suggest that syntactic information is encoded in the token representations as they can be used to successfully reconstruct syntactic trees (Vilares et al., 2020;Kim et al., 2019;Hewitt and Manning, 2019).Others have instead studied syntactic knowledge at the level of attention heads, and show that particular heads specialize to specific aspects of syntax (Htut et al., 2019;Clark et al., 2019).However, Htut et al. (2019) find that these heads can not recover syntactic trees, suggesting that attention heads do not reflect the full extent of syntactic knowledge that these models learn.
Information sharing in LLMs Multiple works study information sharing in LLMs (Blevins et al., 2018;Şahin et al., 2020), with most focusing on cross-lingual sharing (Chi et al., 2020;Shapiro et al., 2021;Stanczak et al., 2022).Libovický et al. (2020) propose a simple method that removes language-specific information from model representations by capturing it through the mean of a set of representations from the respective language.Ravfogel et al. (2020) iteratively remove gender information in word embeddings through projection onto the nullspace (INLP) in order to mitigate bias in biography classification (De-Arteaga et al., 2019).Elazar et al. (2021)

POS tagging and dependency labeling
We study the joint encoding of POS categories and syntactic dependencies.Given a sentence, POS tagging is the task of mapping each word to the appropriate part of speech.For instance, in "The sailor dogs the hatch.","dogs" is a verb, while in "He chases the dogs", "dogs" is a noun.Dependency labeling is the higher-level task of labeling the dependency relation between a "head" word and a "dependent" (or "parent" and "child").For instance, in "That is a black car.","black" is an adjectival modifier of "car".

Probing
To study the representations from a given encoder network, we train a shallow classifier, or "probing classifier" on a probing task, using the representations from the encoder as input.By keeping the encoder frozen, we can ascribe all the learning to the relatively inexpressive probing classifier, allowing us to probe whether the representations from the encoder contain the information necessary to solve the task at hand.

Probing tasks
We focus on POS tagging and DEP labeling as our probing tasks.For both cases, our encoders take sentences as input, producing token-level embeddings at each layer.We extract the token-level embeddings from a given layer and pool them into word embeddings (detailed in Section 3.5).For POS tagging, the probing classifier receives the word embeddings and is trained to classify them across 17 categories.For DEP labeling, we aim to label the dependency between child and parent.Thus, we pair each word in the sentence with its head as labeled in the dataset and concatenate their representations 1 .The probing classifier takes this concatenation as input and is trained to classify the dependency relation between the corresponding words across 36 categories.

Datasets
We use the Universal Dependencies treebanks (Nivre et al., 2020), manually annotated for POS tagging and DEP labeling.We choose the GUM (Zeldes, 2017), VIT (Delmonte et al., 2017), and the GDT (Prokopidis and Papageorgiou, 2017) datasets for English, Italian, and Greek respectively.All datasets contain word-level annotations with a total of 17 POS tags and 36 DEP relations shared across languages 2 .

Models
Encoders We use RoBERTa (Liu et al., 2019b) and XLM-R (Conneau et al., 2020) as the encoders we probe.RoBERTa is an optimized version of the encoder-only transformer BERT (Devlin et al., 2019), and XLM-R is its multilingual variant, supporting 100 languages.Unlike BERT, both models are trained exclusively on the masked language modeling (MLM) objective.We use the "base" version of each model, comprising of 12 encoder layers, 12 attention heads and an embedding size of 1 Different methods such as adding the mean vector or absolute difference of the pair resulted in similar performance.
2 We summarize the split sizes and an overview of the preprocessing pipeline for all three languages in Appendix A.
768, with a total parameter count of ∼125 million3 .
Probing classifier For our probing classifier, we use a multilayer perceptron (MLP) consisting of two linear layers with a tanh activation function in between.We train our probing classifier using the AdamW optimizer (Loshchilov and Hutter, 2019) with a learning rate of 10 −3 and weight decay of 10 −2 , and employ an early stopping criterion based on the validation set.We report the classification accuracy for the best configurations in Table 1.

Probing Classifier Selectivity Baseline
A sufficiently expressive probing classifier may be capable of learning any task given representations as input.This would render its probing functionality obsolete since the information localized by the probing classifier would no longer be necessarily attributed solely to the input representations.To ensure that our probing classifier is not overly expressive, we conduct a baseline check as outlined by Hewitt and Liang (2019).Here, we construct an analogous control task for POS tagging, where we randomly assign each word in the training set to one of N arbitrary labels, where N is the number of POS tag labels (i.e.17).A good probe should have high selectivity, computed as the difference between the accuracy on the probing task and the accuracy on the control task.The more selective the probe, more likely it is that the information it accesses is specific to the input representations.If the probe performs equally well on the control task, it suggests that the probe may be leveraging some inherent properties of the model architecture or the data distribution, rather than specifically extracting useful information from the representations.We train a new checkpoint of our probing classifier on the control task, ending training after the same number of update steps performed when training on the POS tagging task, and report the outcome in Section 4.

Centroid estimation
To study the joint encoding of linguistic categories (e.g.nouns, verbs, etc.) in our encoders, we use the probing classifier to localize the subspace of the encoder that corresponds to each of these categories by obtaining their mean vector representation, or "centroid", similarly to the language centroids of Libovický et al. (2020).The intuition is that representations that repeatedly result in predictions of a particular class will encode information specific to that class, which can be captured through aggregation such as averaging.For POS tagging, the centroid u t of each target POS class t is defined as where V l t is the set of the representations of the words in our validation set that were predicted as tag t when probing the lth layer of our the encoder.For DEP labeling, each prediction depends on both the representation of the head h and the child c of the DEP relation.The centroid of each target DEP relation t is computed as: where P l t is the set of (head, dependent) representation pairs that were predicted as DEP relation t when probing the l-th layer of our encoder, and [h ; d] is the concatenation of the representations.4

Cross-neutralization
To study whether different linguistic categories are jointly encoded within LLMs, we use a crossneutralization method.We first evaluate the classspecific accuracy of our probing classifier on the original representations of our encoder.We then take a neutralizer centroid estimated as outlined in Section 3.3, and subtract it from the encoder representations corresponding to some target category.For instance, taking verbs as neutralizer and nouns as target, we subtract all verb information from the noun representations.We then repeat probing classification on these "neutralized" encoder representations.The intuition behind this is that if the encoders were to represent linguistic categories in independent ways, we expect the performance to deteriorate only for the linguistic category that we use for computing the centroids, i.e. removing verb information only negatively affects the representations of verbs.However, in the case of joint encoding, we expect to see substantial performance drops for other target categories as well, suggesting that some features of the encoder play a role in encoding both categories.As a baseline, we additionally experiment with subtracting random vectors (rather than centroids) from the encoder representations.If the centroids are indeed responsible for the drop in performance, then we should not observe similar performance drops when subtracting random vectors.We report the outcome of this experiment in Section 4.

Choosing a configuration for probing
Understanding where to best localize the taskspecific information within the models is not trivial.For instance, while Tenney et al. (2019) show that syntax information is more localized in lower layers, de Vries et al. ( 2020) demonstrate that such findings do not automatically port to multilingual models.Moreover, Del and Fishel (2022) show that for multilingual models, the pooling method used can have important effects on the knowledge that is captured.Thus, we first study which layer l and pooling function maximizes the amount of information that is captured in our centroids.As such, we employ the notion of self-neutralization, the case in which the same linguistic category is both the target and the neutralizer.
We hypothesize that a large drop in accuracy after self-neutralization is indicative of a high amount of relevant information being captured by the centroid.This drop should be relative to a high accuracy.We thus select the top quartile of our configurations in terms of original-encoder accuracy, and from this subset use the configuration with the highest relative drop due to selfneutralization.We consider representations from layers l ∈ {1, 3, 6, 9, 12} and pool subwords by either taking the first token in a word, or by maxpooling or mean-pooling over the token representations.For each encoder, language and task, we separately find the optimal probing configuration and report them in Table 2.We make use of the validation set for this portion of the methodology.

Joint encoding of POS tags
We begin by examining whether representations of different POS tags share information.For each POS tag, we compute its centroid and re-classify representations neutralized by the centroid.With regard to our baselines, we find that subtracting random vectors leaves the performance relatively unchanged, with no evidence of systematic drops.For our selectivity baseline, we find that our probing classifier achieves an average accuracy of 96% on the POS tagging task and 63% on the control task, giving a selectivity value of 33%.This is in line with probes for similar tasks accepted for their validity in the literature (Hewitt and Liang, 2019).
To facilitate visualization and discussion, we select an illustrative subset 5 of POS tags that covers both open-and closed-class words, as we expect potentially different patterns between these two groups.For instance, we expect neutralization to be symmetric (neutralizer and target can be swapped for similar results) in the case of openclass words as we hypothesize that joint encoding here is dictated by co-occurrence patterns.Conversely, we expect asymmetric neutralization in the case of close-class neutralizers, supposing that joint encoding here may be dictated by functional dependencies, which may not necessarily be reciprocal.

Results
We find that linguistically related categories generally tend to be jointly encoded in RoBERTa (see Fig. 2).For instance, the relative decrease of VERB classification accuracy by 53% when neutralizing using auxiliaries (AUX) suggests information sharing.This may be explained by the fact that auxiliaries themselves are verbs (e.g."has done"), and that they functionally modify verbs.We note that the information flow in this pairing is asymmetric, as noted by the much smaller 11% drop in the reciprocal case.This aligns with our 5 See Appendix F for results between all classes.hypothesis that joint encoding for closed-class neutralizers may be explained by their functional role, which is often not reciprocal (verbs will generally not modify auxiliaries).
While we observe further evidence of linguistically related categories being jointly encoded (e.g.VERB and ADV), this is not ubiquitous.For instance, joint encoding is not found when neutralizing nouns (NOUN) with determiners (DET).Here, the relative percentage decrease is only 2%, despite determiners acting as modifiers for nouns.In general, we do not find distinct patterns of joint encoding unique to closed-class and open-class words.Our hypothesis of symmetry due to co-occurrence is at best supported by the ADV-ADJ pairing, with relative drops of -21% and -29%.In most other pairings information sharing appears to be asymmetric, suggesting functional dependence or another root cause as explanation.
Moreover, we note that open-class words tend to be "adept" at cross-neutralizing, sharing information as neutralizers with many other tags as shown by the NOUN, ADJ and VERB neutralizers rows successfully neutralizing several columns.However, we note similar patterns when neutralizing with the closed-class DET.Overall, we find evidence of joint encoding of POS tags in RoBERTa, but further work is necessary to establish which mechanisms lead to specific pairs sharing information or not.

Joint encoding in multilingual models
We now investigate how our findings transfer to other languages.We examine to what extent multilingual models demonstrate language-specific or language-agnostic information-sharing behavior.Additionally, we verify the consistency of our method across models as we can directly compare the results between RoBERTa and XLM-R in English, and also across languages, as we can compare the effects of cross-neutralization on English, Italian, and Greek.Our setup is the same as in Section 4, replacing RoBERTa with XLM-R and repeating the experiment in the three languages.
Monolingual vs. multilingual model When comparing RoBERTa and XLM-R on the English dataset, we observe similar patterns across POS tags (see Fig. 2 and the left-most column of Fig. 3).For instance, for both models, we see that nouns have a neutralization effect on adjectives, numerals, and adverbs.This suggests that given a language, multilingual models jointly encode POS tags simi- larly to their monolingual counterparts.
Joint encoding within languages We find that XLM-R exhibits similar sharing patterns within different languages (see Fig. 3).For instance, we see that adjectives consistently neutralize adverbs across all languages (-31%, -44% and -40% in English, Italian and Greek).This suggests that the representations responsible for predicting these tags contain some language-agnostic information.However, we also observe some languagespecific behavior.For instance, VERB categories cross-neutralize adjectives (ADJ) more prominently in Italian (-63%) than in English (-26%) and Greek (-18%).This may be explained by the postnominal use of adjectives in Italian which may overlap more closely with the positioning of verbs, particularly in the past participle tense (Cinque, 2010) (in contrast to the strictly prenominal use in the other languages).The results suggest that XLM-R discovers and leverages language-agnostic information when possible, while also learning language-specific information when necessary.

Information sharing across languages
Our findings for individual languages on XLM-R raise the question of whether information about linguistic categories from two different languages can also be jointly encoded.To further probe for language-agnostic joint encoding of linguistic categories, we extend our experiment to test whether information between two linguistic categories from two different languages is jointly encoded.We posit that if categories share information across languages, this is evidence that the models learn at least partially language-independent representations for these categories.We test this hypothesis by cross-neutralizing every linguistic category in a language A with another category from a language B, e.g.neutralizing all Italian POS tags with English nouns.Aside from probing for languageagnosticism, this allows for further exploration of joint encoding of linguistic categories.To enable consistent neutralization, we use the same layer and pooling function configuration (see Section 3.5) for both neutralizer and target.We do this by utilizing the neutralizing language configuration for both neutralizer and target in cases where they differ.
Results Fig. 5 shows results of cross-neutralizing for Italian POS tags when using the corresponding English centroids from each linguistic category.On the diagonal, neutralizer and targets correspond to the same linguistic category, but for different languages.Accuracy drops substantially here (∼ 70% on avg.).This is evidence of language-agnostic encoding of these linguistic categories, as centroids in one language neutralizing targets in another language points to information in the representations being shared across both languages.
We also observe joint encoding of different linguistic categories, for instance with English NOUNs neutralizing Italian numerals (NUM) (-48%), or English adjectives (ADJ) neutralizing Italian ad-

verbs (ADV) (-47%
).An alternative explanation to language-agnostic representation learning in XLM-R may be that instances of code-switching or language borrowing in the pretraining data may have encouraged XLM-R to jointly encode different linguistic categories across languages.Future work may investigate why joint encoding occurs across different categories of different languages.We repeat the experiment using English neutralizers on the Greek corpus, and observe similar but milder trends.This may be due to English being phylogenetically closer to Italian than Greek (Chang et al., 2015), leading to a higher degree of information sharing between the former two languages, but further work in determining which language pairs share more information is needed to draw solid conclusions.For more detailed plots on cross-lingual cross-neutralization, we refer the reader to Appendix F, where we present results from all combinations of neutralizer and target across the three languages.

Information sharing across tasks
To study whether information is shared across the linguistic hierarchy, we cross-neutralize between tasks using POS neutralizers and dependency relation targets.Since a given parent may have several dependents while the child will only have one parent, we posit that most of the information for the child-parent dependency is captured by the child.We therefore subtract the POS centroids from the child representations in the child-parent concatenations for dependency labeling.We also complete the complementary experiment of subtracting the child portion of the dependency relation centroids from the POS embeddings for POS tagging, to test whether joint encoding happens in both directions.
As in Section 5, we use the neutralizer configuration for both neutralizer and target for consistency.
Results Fig. 4 (top) shows the results when neutralizing RoBERTa representations for DEP labels using POS centroids.We find further evidence of linguistically related units being jointly encoded.
For instance, adverb POS tags (ADV) neutralize adverbial modifier (ADVMOD) dependency labels, as can be seen by the 97% relative drop in accuracy.
To study whether our findings generalize beyond English and RoBERTa, we repeat the experiment on XLM-R (see Fig. 6).Large drops in accuracy on the diagonal suggest that many informationsharing patterns hold across languages and models, consolidating our findings from sections 4.1 and 5.We also find language-specific results: pronouns (PRON) neutralize nominal subjects (NSUBJ) only in English (-56%), having surprisingly little effect on the other languages (-5% and -2%).On occasion, we find evidence of bidirectional sharing between different levels of the linguistic hierarchy.In Fig. 4 (bottom), we note that certain pairs highlighted by the diagonals such as adpositions (POS: ADP) and case relations (DEP: CASE) present evidence of joint encoding both when neutralizing DEP with POS centroids (-55%) and when neutralizing POS with DEP centroids (-84%).This seems to suggest that the hierarchical nature of the representations learned by these LLMs is not necessarily unidirectional: information appears to be shared both upwards and downwards in the linguistic hierarchy.

Conclusion
We study information sharing between linguistic categories in LLMs, finding evidence of joint encoding between pairs of related POS tag classes.By applying our method to XLM-R, we find evidence of joint encoding in XLM-R across languages, showing that our findings hold for both the monolingual (RoBERTa) and multilingual (XLM-R) case.Lastly, we cross-neutralize between POS tagging and DEP labeling, and find evidence of information sharing across the linguistic hierarchy.We test specifically for joint encoding of different syntactic categories that rely on the same linguistic concept, such as an "amod" dependency implying some relationship with the adjective and the noun POS tags.However, our method could be extended to test for joint encoding of categories in other tasks that can be expected to share information.This in turn could be informative on whether an LLM indeed captures this shared information between different tasks.A more complete map of this knowledge could help develop better models in many application scenarios: for instance, in a multitask learning setting where negative interference between parameter updates from different tasks is known to hamper performance (Zhao et al., 2018) or a lifelong learning setting where learning a new task often leads to (catastrophic) forgetting of the previously learned ones (Biesialska et al., 2020).Future work may additionally seek to explore more languages, particularly low-resource and typologically distant languages, or consider including higher-level semantic tasks from the classical NLP pipeline, such as semantic role labeling, word sense disambiguation and coreference resolution.Further work is also necessary for understanding why certain representations share information while others do not.

Limitations
In this work, we limit our experiments to an encoder-only architecture.Further research could be carried out on encoder-decoder or decoder-only architectures.
Additionally, our experiments only examine three languages.Future work may therefore aim to extend the experiments to include more languages, particularly low-resource languages.Similarly, our discussion of the linguistic hierarchy is limited to two tasks from the early stages of the classical NLP pipeline.As mentioned, further work may extend the experiments to higher-level semantic tasks.
Lastly, the lack of cross-neutralization effects between some linguistic categories does not necessarily indicate the absence of joint encoding, but merely that our representations were not sufficient to prove its existence.

A Dataset pre-processing
Table 3 presents the train, validation and test split sizes (in terms of number of sentences) for each language considered in our work.
The sentences in the corpora from the Universal Dependencies framework are already tokenized to the word level and stored as lists of words in the tokens field.However, since we use sub-word tokenizers, namely the Byte Pair Encoding (Sennrich et al., 2016) and SentencePiece (Kudo and Richardson, 2018) tokenizer, we further split the words into their sub-word tokens.Depending on the task, we also include either the upos field, which is a list of integers corresponding to one of the 17 universal POS tags available, or the head and deprel fields which contain the head and one of the 36 dependence relations for dependency labeling 7 .It should be noted that we only keep the languageindependent relations, as some of them appear only with a language-specific modifier, and including them would make comparisons across languages less straightforward8 .Furthermore, upon inspecting the datasets we observed that the annotators had split contractions into their parts and included them next to the original contraction for the Italian and Greek corpora.However, ground-truth labels were only provided for the sub-words, with the compound words annotated as a special class "_".Hence, we filtered out the compound words from these datasets and retained their sub-parts.In addition to that, for dependency labeling, we ignored words with the root dependency label, as they have no head and their prediction is trivial.

B Scaling to larger models
We repeat the experiments from Sections 4 & 5 to verify whether they hold across different model sizes.More specifically, we scale the English POS tag cross-neutralization from RoBERTA-base to RoBERTA-large (Fig. 10a), and from XLM-Rbase to XLM-R-large (Fig. 10b) and XLM-R-XL (Fig. 10c).We notice that similar patterns occur across all tested model sizes (e.g.NOUN neutralizing ADV).The effect appears to be generally less pronounced for larger models, although we also did find instances where the drop in accuracy is the same or higher.

C Golden and Predicted Centroid Similarity
We present the cosine similarity between golden and predicted centroids for the POS tagging setup in table 4, as mentioned in Section 3.3.

D Self-neutralization visualization
Figs. 8 and 7 showcase the decrease in accuracy when self-neutralizing in the POS tagging/dependency labeling task for RoBERTa and XLM-R accordingly.

E Engineering Logistics
We rely on PyTorch Lightning (Falcon and The PyTorch Lightning team, 2019) and Hugging Face Datasets (Lhoest et al., 2021) and Transfomers (Wolf et al., 2020) for our implementation.We run our experiments on an NVIDIA A100 GPU, with each training run taking approximately three minutes.Our code is available at github.com/thesofakillers/infoshare.

Greek
Figure 17: Relative change in accuracy when cross-neutralizing XLM-R POS embeddings using XLM-R DEP centroids.In English, Italian and Greek.Some languages have less columns or rows than others because the particular category for the missing row/column is not in their dataset.

Greek
Figure 18: Relative change in accuracy when cross-neutralizing XLM-R DEP embeddings using XLM-R POS centroids.In English, Italian and Greek.Some languages have less columns or rows than others because the particular category for the missing row/column is not in their dataset.

Figure 1 :
Figure 1: A diagram illustrating the three steps of our method: 1) Representation classification.2) Centroid estimation.3) Cross-neutralization.The complete methodology is outlined in Section 3.

Figure 3 :
Figure 3: Relative change in accuracy when cross-neutralizing POS tags using embeddings from XLM-R in (left-to-right respectively) English, Italian and Greek, labeled by their universal dependencies treebank.

Figure 4 :
Figure 4: Relative change in accuracy cross-neutralizing RoBERTa DEP representations using RoBERTa POS centroids (top) and vice versa (bottom) 6 .

Figure 5 :
Figure 5: Relative change in accuracy when crossneutralizing XLM-R embeddings in Italian using centroids from English.

Figure 6 :
Figure 6: Relative change in accuracy for a sample of dependency relations when cross-neutralizing XLM-R DEP embeddings using XLM-R POS centroids in (left-to-right respectively) English, Italian and Greek.
POS tagging Accuracy Decrease using different WordPiece pooling methods ("aggregation" in the plot).
Dependency labeling Accuracy Decrease using different WordPiece pooling methods ("aggregation" in the plot).

Figure 7 :
Figure7: Decrease in performance for XLM-R when self-neutralizing in the POS tagging (left) and dependency labeling (right) tasks using embeddings extracted from different layers and setup configurations, for each of for English (en_gum), Italian (it_vit) and Greek (el_gdt) treebanks.For dependency labeling, we use the best child-head concatenation mode (ONLY) based on the results we aquired with RoBERTa, as shown in Fig.8b.
decrease for dependency labeling when using different WordPiece pooling methods ("aggregation" in the plot) and child-head concatenation configurations.

Figure 8 :
Figure 8: Decrease in performance for RoBERTa when self-neutralizing in the POS tagging (top) and dependency labeling (bottom) tasks using embeddings extracted from different layers and setup configurations.

Figure 9 :Figure 10 :
Figure 9: Relative change in accuracy when cross-neutralizing POS tags using centroids from RoBERTa in English.

Table 4 :
Cosine similarities between golden and predicted centroids for the RoBERTa POS tagging setup.