Examining Cross-lingual Contextual Embeddings with Orthogonal Structural Probes

State-of-the-art contextual embeddings are obtained from large language models available only for a few languages. For others, we need to learn representations using a multilingual model. There is an ongoing debate on whether multilingual embeddings can be aligned in a space shared across many languages. The novel Orthogonal Structural Probe (Limisiewicz and Mareček, 2021) allows us to answer this question for specific linguistic features and learn a projection based only on mono-lingual annotated datasets. We evaluate syntactic (UD) and lexical (WordNet) structural information encoded inmBERT’s contextual representations for nine diverse languages. We observe that for languages closely related to English, no transformation is needed. The evaluated information is encoded in a shared cross-lingual embedding space. For other languages, it is beneficial to apply orthogonal transformation learned separately for each language. We successfully apply our findings to zero-shot and few-shot cross-lingual parsing.


Introduction
The representation learned by language models has been successfully applied in various NLP tasks. Multilingual pre-training allows utilizing the representation for various languages, including lowresource ones. There is an open discussion about to what extent contextual embeddings are similar across languages (Søgaard et al., 2018;Hartmann et al., 2019;Vulić et al., 2020). The motivation for our work is to answer: Q1 Is linguistic information uniformly encoded in the representations of various languages? And if this assumption does not hold: Q2 Is it possible to learn orthogonal transformation to align the embeddings?
We probe for the syntactic and lexical structures encoded in multilingual embeddings with the new Orthogonal Structural Probes (Limisiewicz and Mareček, 2021). Previously, Chi et al. (2020) employed structural probing (Hewitt and Manning, 2019) to evaluate cross-lingual syntactic information in MBERT and visualize how it is distributed across languages. Our approach's advantage is learning an orthogonal transformation that maps the embeddings across languages based on monolingual linguistic information: dependency syntax and lexical hypernymy. This new capability allows us to test different probing scenarios. We measure how adding assumptions of isomorphism and uniformity of the representations across languages affect probing results to answer our research questions.

Related Work
Probing It is a method of evaluating linguistic information encoded in pre-trained NLP models. Usually, a simple classifier for the probing task is trained on the frozen model's representation (Linzen et al., 2016;Belinkov et al., 2017;Blevins et al., 2018). The work of Hewitt and Manning (2019) introduced structural probes that linearly transform contextual embeddings to approximate the topology of dependency trees. Limisiewicz and Mareček (2021) proposed new structural tasks and introduced orthogonal constraint allowing to decompose projected embeddings into parts correlated with specific linguistic features. Kulmizev et al. (2020) probed different languages to examine what type of syntactic dependency annotation is captured in an LM. Hall Maudslay et al. (2020) modify the loss function, improving syntactic probes' ability to parse.
Cross-lingual embeddings There is an essential branch of research studying relationships of embeddings across languages. Mikolov et al. (2013) showed that distributions of the word vectors in different languages could be aligned in shared space. Following research analyzed various methods of aligning cross-lingual static embeddings (Faruqui and Dyer, 2014;Artetxe et al., 2016;Smith et al., 2017) and gradually dropped the requirement of parallel data for alignment (Artetxe et al., 2018;Zhang et al., 2017;Lample et al., 2018).
Significant attention was also devoted to the analysis of multilingual and contextual embeddings of MBERT (Pires et al., 2019;Libovický et al., 2020). There is also no conclusive answer to whether the alignment of such representations is beneficial to cross-lingual transfer. Wang et al. (2019) show that the alignment facilitates zero-shot parsing, while results of Wu and Dredze (2020) for multiple tasks put in doubt the benefits of the alignment.

Method
The Structural Probe (Hewitt and Manning, 2019) is a gradient optimized linear projection of the contextual word representations produced by a pretrained neural model (e.g. BERT Devlin et al. (2019), ELMO Peters et al. (2018)).
In a Distance Probe, the Euclidean distance between projected word vectors approximates the distance between words in a dependency tree: B is the Linear Transformation matrix and h i , h j are the vector representations of words at positions i and j.
Another type of a probe is a Depth Probe, where the token's depth in a dependency tree is approximated by the Euclidean norm of a projected word vector: Orthogonal Structural Probes Limisiewicz and Mareček (2021) proposed decomposing matrix B and then gradient optimizing a vector and orthogonal matrix. The new formulation of an Orthogonal Distance Probe is 2 : where V is an orthogonal matrix (Orthogonal Transformation) andd is a Scaling Vector, which can be changed during optimization for each task to allow multi-task joint probing.
This procedure allowed optimizing a separate Scaling Vectord for a specific objective, allowing probing for multiple linguistic tasks simultaneously. In this work, an individual Orthogonal Transformation V is trained for each language, facilitating multi-language probing. This approach assumes that the representations are isomorphic across languages; we examine this claim in our experiments.

Experiments
We examine vector representations obtained from multilingual cased BERT (Devlin et al., 2019).

Data and Probing Objectives
We probe for syntactic structure annotated in Universal Dependencies treebanks (Nivre et al., 2020) and for lexical hypernymy trees from WordNet (Miller, 1995). We optimize depth and dependency probes in both types of structures jointly.
For both dependency and lexical probes, we use sentences from UD treebanks in nine languages. For each treebank, we sampled 4000 sentences to diminish the effect of varying size datasets in probe optimization. Lexical depths and distances for each sentence are obtained from hypernymy trees that are available for each language in Open Multilingual Wordnet (Bond and Foster, 2013). 3

Choice of Layers
We probe the representations of the 7th layer for dependency information and representations of the 5th layer for lexical information. These layers achieve the highest performance for the respective features.

Multilingual Evaluation
We utilize the new joint optimization capability of Orthogonal Structural Probes to analyze how the encoding of linguistic phenomena are expressed across different languages in MBERT representations.
To answer our research question, we evaluate three settings of multilingual Orthogonal Structural Probe training. The approaches are sorted by expressiveness; the most expressive one makes the weakest assumption about the likeness of representations across languages: IN-LANG no assumption We train a separate instance of Orthogonal Structural Probe for each language. Neither Scaling Vector nor Orthogonal Transformation is shared between languages.

MAPPEDLANGS isomorphity assumption
We train a shared Scaling Vector for each probing task and a separate Orthogonal Transformation per language. If the embedding subspaces are orthogonal across languages, the orthogonal mapping will be learned during probe training, and the setting will achieve similar results as the previous one.
ALLLANGS: uniformity assumption Both the Scaling Vector and Orthogonal Transformation are shared across languages. If the same embedding subspace encodes the probed information across languages, the results of this setting will be on par with the first approach. The first and the last approach was proposed analyzed for Structural Probes by Chi et al. (2020). MAPPEDLANGS setting is possible thanks to the new probing formulation of Limisiewicz and Mareček (2021). For evaluation, we compute Spearman's correlations between predicted and gold depths and distances. In this evaluation, we use supervision for a target language. Furthermore, we analyze the impact of two language-specific features on the results: a) size of the MBERT training corpus in a given language; b) typological similarity to English. The former is expressed in the number of tokens in Wikipedia. The latter is a Hamming similarity between features in WALS (Dryer and Haspelmath, 2013). 4

Zero-and Few-shot Parsing
We extract directed trees from the predictions of dependency probes. For that purpose, we employ the Maximum Spanning Tree algorithm on the predicted distances and the algorithm's extension of Kulmizev et al. (2020) to extract directed trees based on predicted depths. We examine cross-lingual transfer for parsing sentences in Chinese, Basque, Slovene, Finnish, and Arabic. For each of them, we train the probe on the remaining eight languages. In a few-shot setting, we also optimize on 10 to 1000 examples from the target language.

Results
Sperman's correlation Using IN-LANG probes for each language gives high Spearman's correlations across the languages. The MAPPEDLANGS approach brings only a slight difference for most of the configuration while imposing uniformity constraint (ALLLANGS) deteriorates the results for some of the languages, as shown in Table 1. The drop in correlation is especially high for Non-Indo-European languages (except for lexical distance where the difference between Indo-European and Non-Indo-European groups is small).
In Fig. 1, we present the Pearson's correlations between results from Table 1 and two languagespecific features. The key observation is that topological similarity to English is strongly correlated with ∆ALLLANGS. Hence, a shared probe achieves relatively good for English, Spanish, and French. It shows that lexical and dependency infor-  mation is uniformly distributed in the embedding space for those languages. We bear in mind that the European languages are over-represented in the MBERT's pre-training corpus. However, the size of pre-training corpora is correlated to a lesser extent with ∆ALLLANGS than WALS similarity, suggesting that the latter has a more prominent role than the former. There is no significant correlation between ∆MAPPEDLANGS and typological similarity; the embeddings of diverse languages can be similarly well mapped into a shared space. Notably, we observe that some languages with the lower performance of IN-LANG probes can benefit from mapping (e.g., Slovene, Finnish, and Basque in the lexical depth). We view it as a benefit of cross-lingual transfer from more resourceful languages.
Zero-shot Parsing For all languages except Finnish in zero-shot configuration, our ALLLANGS approach is better than other works that utilize a biaffine parser (Dozat and Manning, 2017) on top of MBERT representations, shown in Table 2. Without any supervision, our MAPPEDLANGS approach performs poorly because mapping cannot be learned effectively. When some annotated data is added to the training, the difference between ALLLANGS and MAPPEDLANNGS decreases. We observe that between 100 and 1000 training samples are needed to learn the Orthogonal Transformation effectively. Also, with higher supervision, we observe that the results reported by (Lauscher et al., 2020) notably outperform our approach. The outcome was anticipated because they fine-tune MBERT and use biaffine with a larger capacity than a probe. For their approach, the introduction of even small supervision is more advantageous than for probing.

Conclusions
We propose an effective way to multilingually probe for syntactic dependency (UD) and lexical hypernymy (WordNet). Our algorithm learns probes for multiple tasks and multiple languages jointly. The formulation of Orthogonal Structural Probe allows learning cross-lingual transformation based on mono-lingual supervision. Our comparative evaluation indicates that the evaluated information is similarly distributed in the MBERT's representations for languages typologically similar to English: Spanish, French, and Finnish. We show that aligning the embeddings with Orthogonal Transformation improves the results for other examined languages, suggesting that the representations are isomorphic. We show that the probe can be utilized in zero-and few-shot parsing. The method achieves better UAS results for Chinese, Slovene, Basque, and Arabic in a zero-shot setting than previous approaches, which use a more complex biaffine parser.
Limitations In our choice of languages, we wanted to ensure diversity. Nevertheless, four of the analyzed languages belong to an Indo-European family that could facilitate finding shared encoding subspace for those languages.

Acknowledgments
We thank anonymous EMNLP reviewers for their valuable comments and suggestions for improvement. This work has been supported by grant 338521 of the Charles University Grant Agency and by Progress Q48 grant of Charles University. We have been using language resources and tools developed, stored, and distributed by the LINDAT/CLARIAH-CZ project of the Ministry of Education, Youth and Sports of the Czech Republic (project LM2018101).  In Fig. 2, we present typological similarities between languages. Bases on Fig. 3 we observe that typological similarity to languages related to English: Spanish, Finnish, French is correlated to ∆ALLLANGS. Moreover, the correlation between similarity to these languages and the number of tokens in Wikipedia is smaller than for English 5 . It supports our claim that typological similarity is more important for uniformity assumption than the size of the pre-training corpus.

B Pre-training corpus size
Sizes of Wikipedia in eight analyzed languages are presented in Table 3.

C Datasets
In Table 4 we aggregate all the datasets used in our experiments.

D Information separation
In line with the findings of Limisiewicz and Mareček (2021) we have observed that in multilingual setting Orthogonal Structural Probes disentangle the subspaces responsible for encoding lexical and dependency structures

E.1 Number of Parameters
A Scaling Vector for each of 4 objectives has a size 768 × 1 and an Orthogonal Transformation for each language is a matrix of size 768 × 768. In MAPPEDLANGS, our largest memory-wise setting, we train 8 Orthogonal Transformations. In this configuration, our probe has 4, 721, 664 parameters.

E.2 Computation Time
We optimized probes on a GPU core GeForce GTX 1080 Ti. Training a probe in MAPPEDLANGS configuration takes about 3 hours.

F.1 UUAS results
The Table 6 contains the results for undirected dependency trees. We use the same probing setting as in Section 3.2 without assigning directions to the edges. Similarly to Chi et al. (2020), we exclude punctuation from the evaluation.

F.2 Validation Results
In Table 7, we present the validation results corresponding to the test results in Table 1