Same Neurons, Different Languages: Probing Morphosyntax in Multilingual Pre-trained Models

The success of multilingual pre-trained models is underpinned by their ability to learn representations shared by multiple languages even in absence of any explicit supervision. However, it remains unclear how these models learn to generalise across languages. In this work, we conjecture that multilingual pre-trained models can derive language-universal abstractions about grammar. In particular, we investigate whether morphosyntactic information is encoded in the same subset of neurons in different languages.We conduct the first large-scale empirical study over 43 languages and 14 morphosyntactic categories with a state-of-the-art neuron-level probe. Our findings show that the cross-lingual overlap between neurons is significant, but its extent may vary across categories and depends on language proximity and pre-training data size.


Introduction
Massively multilingual pre-trained models (Devlin et al., 2019;Conneau et al., 2020;Liu et al., 2020;Xue et al., 2021, inter alia) display an impressive ability to transfer knowledge between languages as well as to perform zero-shot learning (Pires et al., 2019;Wu and Dredze, 2019;Nooralahzadeh et al., 2020;Hardalov et al., 2022, inter alia). Nevertheless, it remains unclear how pre-trained models actually manage to learn multilingual representations despite the lack of an explicit signal through parallel texts. Hitherto, many have speculated that the overlap of sub-words between cognates in related languages plays a key role in the process of multilingual generalisation (Wu and Dredze, 2019;Cao et al., 2020;Pires et al., 2019;Abend et al., 2015;Vulić et al., 2020). In this work, we offer a concurrent hypothesis to explain the multilingual abilities of various pretrained models; namely, that they implicitly align morphosyntactic markers that fulfil a similar grammatical function across languages, even in absence of any lexical overlap. More concretely, we conjecture that they employ the same subset of neurons to encode the same morphosyntactic information (such as gender for nouns and mood for verbs). 1 To test the aforementioned hypothesis, we employ Stańczak et al.'s (2022) latent variable probe to identify the relevant subset of neurons in each language and then measure their cross-lingual overlap.
We experiment with two multilingual pre-trained models, m-BERT (Devlin et al., 2019) and XLM-R (Conneau et al., 2020), probing them for morphosyntactic information in 43 languages from Universal Dependencies (Nivre et al., 2017). Based on our results, we argue that pre-trained models do indeed develop a cross-lingually entangled representation of morphosyntax. We further note that, as the number of values of a morphosyntactic category increases, cross-lingual alignment decreases. Finally, we find that language pairs with high proximity (in the same genus or with similar typological features) and with vast amounts of pre-training data tend to exhibit more overlap between neurons. Identical factors are known to affect also the empirical performance of zero-shot cross-lingual transfer (Wu and Dredze, 2019), which suggests a connection between neuron overlap and transfer abilities.

Intrinsic Probing
Intrinsic probing aims to determine exactly which dimensions in a representation, e.g., those given by m-BERT, encode a particular linguistic property (Dalvi et al., 2019;Torroba Hennigen et al., 2020). Formally, let Π be the inventory of values that some morphosyntactic category can take in a particular language, for example Π = {FEM, MSC, NEU} for grammatical gender in Russian. Moreover, let D = {(π (n) , h (n) )} N n=1 be a dataset of labelled embeddings such that π (n) ∈ Π and h (n) ∈ R d , where d is the dimensionality of the representation being considered, e.g., d = 768 for m-BERT. Our goal is to find a subset of k neurons C ⊆ D = {1, . . . , d}, where d is the total number of dimensions in the representation being probed, that maximises some informativeness measure.
In this paper, we make use of a latent-variable model recently proposed by Stańczak et al. (2022) for intrinsic probing. The idea is to train a probe with latent variable C indexing the subset of the dimensions D of the representation h that should be used to predict the property π: where we opt for a uniform prior p(C) and θ are the parameters of the probe. Our goal is to learn the parameters θ. However, since the computation of Eq. (1) requires us to marginalise over all subsets C of D, which is intractable, we optimise a variational lower bound to the log-likelihood: where H(·) stands for the entropy of a distribution, and q φ (C) is a variational distribution over subsets C. 2 For this paper, we chose q φ (·) to correspond to a Poisson sampling scheme (Lohr, 2019), which models a subset as being sampled by subjecting each dimension to an independent Bernoulli trial, where φ i parameterises the probability of sampling any given dimension. 3 Having trained the probe, all that remains is using it to identify the subset of dimensions that is most informative about the morphosyntactic category we are probing for. We do so by finding the subset C k of k neurons maximising the posterior: In practice, this combinatorial optimisation problem is intractable. Hence, we solve it using greedy search.

Experimental Setup
We now describe the experimental methodology of the paper, including the data, training procedure and statistical testing.
Data. We select 43 treebanks from Universal Dependencies 2.1 (UD; Nivre et al., 2017), which contain sentences annotated with morphosyntactic information in a wide array of languages. Afterwards, we compute contextual representations for every individual word in the treebanks using multilingual BERT (m-BERT-base) and the base and large versions of XLM-RoBERTa (XLM-Rbase and XLM-R-large). We then associate each word with its parts of speech and morphosyntactic features, which are mapped to the UniMorph schema (Kirov et al., 2018). 4 The selected treebanks include all languages supported by both m-BERT and XLM-R which are available in UD. Rather than adopting the default UD splits, we re-split word representations based on lemmata ending up with disjoint vocabularies for the train, development, and test set. This prevents a probe from achieving high performance by sheer memorising. Moreover, for every category-language pair  Figure 2: The percentage overlap between the top-50 most informative number dimensions in m-BERT for number (top) and XLM-R-large for case (bottom). Statistically significant overlap after Holm-Bonferroni family-wise error correction (Holm, 1979), with α = 0.05, is marked with an orange square.
(e.g., mood-Czech), we discard any lemma with fewer than 20 tokens in its split.
Training. We first train a probe for each morphosyntactic category-language combination with the objective in Eq.
(2). In line with established practices in probing, we parameterise p θ (·) as a linear layer followed by a softmax. Afterwards, we identify the top-k most informative neurons in the last layer of m-BERT, XLM-R-base, and XLM-Rlarge. Specifically, following Torroba Hennigen et al. (2020), we use the log-likelihood of the probe on the test set as our greedy selection criterion. We single out 50 dimensions for each combination of morphosyntactic category and language. 5 Next, we measure the pairwise overlap in the top-k most informative dimensions between all pairs of languages where a morphosyntactic cat- 5 We select this number as a trade-off between the size of a probe and a tight estimate of the mutual information based on the results presented in Stańczak et al. (2022). egory is expressed. This results in matrices such as Fig. 2, where the pair-wise percentages of overlapping dimensions are visualised as a heat map.
Statistical Significance. Suppose that two languages have m ∈ {1, . . . , k} overlapping neurons when considering the top-k selected neurons for each of them. To determine whether such overlap is statistically significant, we compute the probability of an overlap of at least m neurons under the null hypothesis that the sets of neurons are sampled independently at random. We estimate these probabilities with a permutation test. In this paper, we set a threshold of α = 0.05 for significance.
Family-wise Error Correction. Finally, we use Holm-Bonferroni (Holm, 1979) family-wise error correction. Hence, our threshold is appropriately adjusted for multiple comparisons, which makes incorrectly rejecting the null hypothesis less likely.
In particular, the individual permutation tests are ordered in ascending order of their p-values. The test with the smallest probability undergoes the Holm-Bonferroni correction (Holm, 1979). If already the first test is not significant, the procedure stops; otherwise, the test with the second smallest p-value is corrected for a family of t − 1 tests, where t denotes the number of conducted tests. The procedure stops either at the first non-significant test or after iterating through all p-values. This sequential approach guarantees that the probability that we incorrectly reject one or more of the hypotheses is at most α.

Results
We first consider whether multilingual pre-trained models develop a cross-lingually entangled notion of morphosyntax: for this purpose, we measure the overlap between subsets of neurons encoding similar morphosyntactic categories across languages. Further, we debate whether the observed patterns are dependent on various factors, such as morphosyntactic category, language proximity, pretrained model, and pre-training data size.
Neuron Overlap. The matrices of pairwise overlaps for each of the 14 categories, such as Fig. 2 for number and case, are reported in App. B. We expand upon these results in two ways. First, we report the cross-lingual distribution for each category in Fig. 1 for m-BERT and XLM-R-base, and in an equivalent plot comparing XLM-R-base and XLM-R-large in Fig. 3. Second, we calculate how many overlaps are statistically significant out of the total number of pairwise comparisons in Tab. 1. From the above results, it emerges that ≈ 20% of neurons among the top-50 most informative ones overlap on average, but this number may vary dramatically across categories.
Morphosyntactic Categories. Based on Tab. 1, significant overlap is particularly accentuated in specific categories, such as comparison, polarity, and number. However, neurons for other categories such as mood, aspect, and case are shared by only a handful of language pairs despite the high number of comparisons. This finding may be partially explained by the different number of values each category can take. Hence, we test whether there is a correlation between this number and average cross-lingual overlap in Fig. 5a. As expected, we generally find negative correlation coefficientsprominent exceptions being number and person. As the inventory of values of a category grows, cross-lingual alignment becomes harder.
Language Proximity. Moreover, we investigate whether language proximity, in terms of both language family and typological features, bears any relationship with the neuron overlap for any particular pair. In Fig. 4, we plot pairwise similarities with languages within the same genus (e.g., Baltic) against those outside. From the distribution of the dots, we can extrapolate that sharing of neurons is more likely to occur between languages in the same genus. This is further corroborated by the language groupings emerging in the matrices of App. B. In Fig. 5b, we also measure the correlation be- tween neuron overlap and similarity of syntactic typological features based on Littell et al. (2017). While correlation coefficients are mostly positive (with the exception of polarity), we remark that the patterns are strongly influenced by whether a category is typical for a specific genus. For instance, correlation is highest for animacy, a category almost exclusive to Slavic languages in our sample.
Pre-trained Models. Afterwards, we determine whether the 3 models under consideration reveal different patterns. Comparing m-BERT and XLM-R-base in Fig. 1, we find that, on average, XLM-Rbase tends to share more neurons when encoding particular morphosyntactic attributes. Moreover, comparing XLM-R-base to XLM-R-large in Fig. 3 suggests that more neurons are shared in the former than in the latter. Altogether, these results seem to suggest that the presence of additional training data engenders cross-lingual entanglement, but increasing model size incentivises morphosyntactic information to be allocated to different subsets of neurons. We conjecture that this may be best viewed from the lens of compression: if model size is a bottleneck, then, to attain good performance across many languages, a model is forced to learn cross-lingual abstractions that can be reused.
Pre-training Data Size. Finally, we assess the effect of pre-training data size 6 for neuron overlap. According to Fig. 5c, their correlation is very high. We explain this phenomenon with the fact that more data yields higher-quality (and as a consequence, more entangled) multilingual representations.

Conclusions
In this paper, we hypothesise that the ability of multilingual models to generalise across languages results from cross-lingually entangled representations, where the same subsets of neurons encode universal morphosyntactic information. We validate this claim with a large-scale empirical study on 43 languages and 3 models, m-BERT, XLM-R-base, and XLM-R-large. We conclude that the overlap is statistically significant for a notable amount of language pairs for the considered attributes. However, the extent of the overlap varies across morphosyntactic categories and tends to be lower for categories with large inventories of possible values. Moreover, we find that neuron subsets are shared mostly between languages in the same genus or with similar typological features. Finally, we discover that the overlap of each language grows proportionally to its pre-training data size, but it also decreases in larger model architectures.
Given that this implicit morphosyntactic alignment may affect the transfer capabilities of pre- 6 We rely on the CC-100 statistics reported by Conneau et al. (2020) for XLM-R and on the Wikipedia dataset's size with TensorFlow datasets (Abadi et al., 2015) for m-BERT. trained models, we speculate that, in future work, artificially encouraging a tighter neuron overlap might facilitate zero-shot cross-lingual inference to low-resource and typologically distant languages(Zhao et al., 2021).

Ethics Statement
The authors foresee no ethical concerns with the work presented in this paper.