Are Structural Concepts Universal in Transformer Language Models? Towards Interpretable Cross-Lingual Generalization

Large language models (LLMs) have exhibited considerable cross-lingual generalization abilities, whereby they implicitly transfer knowledge across languages. However, the transfer is not equally successful for all languages, especially for low-resource ones, which poses an ongoing challenge. It is unclear whether we have reached the limits of implicit cross-lingual generalization and if explicit knowledge transfer is viable. In this paper, we investigate the potential for explicitly aligning conceptual correspondence between languages to enhance cross-lingual generalization. Using the syntactic aspect of language as a testbed, our analyses of 43 languages reveal a high degree of alignability among the spaces of structural concepts within each language for both encoder-only and decoder-only LLMs. We then propose a meta-learning-based method to learn to align conceptual spaces of different languages, which facilitates zero-shot and few-shot generalization in concept classification and also offers insights into the cross-lingual in-context learning phenomenon. Experiments on syntactic analysis tasks show that our approach achieves competitive results with state-of-the-art methods and narrows the performance gap between languages, particularly benefiting those with limited resources.


Introduction
Cross-lingual generalization entails repurposing the knowledge acquired in one language to another with little supervision, thereby mitigating the digital language divide.Despite the vast variations across languages, it is possible to identify corresponding concepts among them, which provides a basis for cross-linguistic generalization (Croft, 1991;Haspelmath, 2010Haspelmath, , 2021)).This has been instantiated by frameworks such as Universal Dependencies (UD) (de Marneffe et al., 2021), where the structural concepts including word classes (e.g., "noun" and "verb") and grammatical relations (e.g., "subject" and "object") are defined in a cross-linguistically consistent way.While large language models (LLMs) have demonstrated their capacity to induce these concepts within individual languages (Tenney et al., 2019;Liu et al., 2019;Chi et al., 2020;Linzen and Baroni, 2021), they encounter difficulties in generalizing the knowledge across languages (Joshi et al., 2020;Blasi et al., 2022;Majewska et al., 2022).This raises questions about whether LLMs are able to capture the underlying conceptual correspondence and how to harness the knowledge for improved generalization.
Previous work has shown that cross-linguistic similarities are automatically captured in the representation space of LLMs, enabling zero-shot cross-lingual transfer (Pires et al., 2019;Wu and Dredze, 2019;Chi et al., 2020;Papadimitriou et al., 2021;Muller et al., 2021;Xu et al., 2022).Efforts have been made to further enhance their generalization by exploiting high-resource languages for parameter and information sharing (Üstün et al., 2020;Nooralahzadeh et al., 2020;Choenni et al., 2023) or enforcing alignment between languages (Cao et al., 2019;Schuster et al., 2019;Sherborne and Lapata, 2022), whereas these approaches typically rely on structural and lexical similarities between languages and fall short when dealing with low-resource languages distant from high-resource ones (Ponti et al., 2021;de Lhoneux et al., 2022).With the ever-increasing size of LLMs, recent work has explored methods to elicit the multilingual ability of LLMs via in-context learning (Winata et al., 2021;Tanwar et al., 2023), alleviating the cost of parameters updates, but the generalization performance lags behind and is highly sensitive to prompt design (Lai et al., 2023;Ahuja et al., 2023).Overall, the generalization is predominantly realized implicitly and it remains unclear whether the commonalities shared across languages have been fully exploited.
In this paper, we investigate the potential to explicitly leverage the conceptual correspondence between languages for cross-lingual generalization.We focus on the structural concepts outlined in UD, which are at the core of syntactic analyses across languages (Croft, 1991) and have shown to be learned by LLMs, offering a valuable testbed for our analyses.For each language, we learn a linear transformation of an LLM's representation space, which defines a conceptual space where samples can be classified by their distances to prototypes for each concept.Concepts represented by the LLM are then regarded as clusters discernible from others based on their prototypes.Analyses across 43 typologically distinct languages reveal a high degree of alignability among their conceptual spaces, indicating that the conceptual correspondence is implicitly established in LLMs (Section 2).We then present a meta-learning-based method that learns to explicitly align different languages with limited data available, facilitating zero-shot and few-shot generalization in concept classification.We demonstrate the effectiveness of our approach for both encoder-only (Section 3) and decoder-only LLMs (Section 4), achieving encouraging results especially for low-resource languages.
In summary, our contributions are as follows: 1) We demonstrate that the conceptual correspondence between languages, in terms of the structural concepts defined in UD, is implicitly established in both encoder-only and decoder-only LLMs.
2) We propose a meta-learning-based approach to explicitly aligning conceptual correspondence between different languages, enabling cross-lingual generalization in zero-shot and few-shot scenarios without requiring parameter updates to the LLMs.Our method achieves competitive results with stateof-the-art methods and reduces the performance gap between languages, particularly benefiting low-resource ones.3) Our approach provides insights into the cross-lingual in-context learning phenomenon.Integrated with the prompt-based learning paradigm, it achieves promising gains in generalizing to novel languages. 1 correspondence between different languages from plain text, which could lay the foundation for better cross-lingual generalization.Concretely, we first derive structural concepts within individual languages, and then evaluate whether these concepts are readily alignable across languages.

Method
Deriving concepts Let D = x i , y i N i=1 denote a dataset consisting of N feature-label pairs in a language L, where the features x i ∈ R n are ndimensional representations yielded by LLMs and the labels y i ∈ {1, • • • , K} are the corresponding structural concept, and D k is the set of samples with label k.We compute an m-dimensional prototype c k ∈ R m for each concept k by learning a linear transformation A ∈ R n×m , such that and the label of a feature x can be identified with respect to its distances to the prototypes in the transformed representation space.Specifically, the probability that x is an instance of concept k is given by where d (•, •) is the squared Euclidean distance.
The parameters of the transformation, the matrix A, is learned by minimizing the negative log probability L A = − log p A (y = k | x) of the gold concept k through gradient descent.Our probing method is inspired by Prototypical Networks (Snell et al., 2017), but we constrain the transformation to be linear as our goal is to investigate the geometry of LLMs, akin to Hewitt and Manning (2019).
Measuring Alignability We employ two complementary methods to measure the alignability between structural concepts in different languages: representational similarity analysis (RSA) (Kriegeskorte et al., 2008) and Procrustes analysis.
RSA is non-parametric and has been widely used for measuring the degree of topological alignment between representation spaces based on their (dis)similarity matrices (Kriegeskorte and Diedrichsen, 2019).Procrustes analysis evaluates the extent to which two spaces can be aligned linearly by finding the optimal orthogonal transformation.Given two languages L 1 and L 2 with K shared structural concepts, we derive prototypes for each concept, which serve as parallel points that allows for comparison among different languages.Infrequently used concepts with fewer than 20 samples are excluded for our analysis.For RSA, we compute a dissimilarity matrix M ∈ R K×K for each language, where the entry M i,j = d (c i , c j ) is the distance between the i th and j th prototypes (1 ≤ i, j ≤ K).The alignability is computed as the Spearman's correlation between the lower diagonal portion of the two matrices, ranging from −1 to 1.For Procrustes analysis, we evaluate the fitness of the linear transformation through the average proportion of explained variance.

Setup
Model We use Multilingual BERT (mBERT) (Devlin et al., 2019) and LLaMA 7B model (Touvron et al., 2023) for our experiments.Both are pretrained on multiple languages without explicit cross-lingual information, enabling us to probe the cross-linguistic knowledge induced exclusively from raw text.A linear transformation A ∈ R n×m with varying m is trained to project the features x ∈ R n yielded by the LLM into an m-dimensional space, whereby we test what rank of transformation is needed to extract structural concepts.

Data
The data used in all our experiments is from UD v2.10.For mBERT, we select 43 typologically distinct languages that represent a diverse range of language families.For LLaMA, we test it on 20 languages, including one that is not included in its pretraining corpus for comparison. 2   Baselines The alignability between the structural concepts in two languages should contrast with cases where the structure of their spaces is 2 See Appendix E for the languages and treebanks we use.deformed.Given the prototypes for K structural concepts derived from two languages, we construct the following baselines: (i) RP, which randomly swaps each prototypes for another in one of the two languages; (ii) RC, where we randomly select a sample of each concept instead of their prototypes in one language; (iii) RS, where we randomly select K samples in one language.

Results
Structural concepts can be identified based on prototypes The structural concepts including word classes and grammatical relations can be successfully distinguished according to their distances to the prototypes (Figure 2)3 .The structural information can be effectively encoded within a relatively low-dimensional space, which varies across different models.We note that more expressive probing models are needed to extract structural concepts, especially grammatical relations, from LLaMA; we leave exploration of this to future work.
Structural concepts are readily alignable across languages Figure 1 depicts the alignablity between the structural concepts in different languages.
Both word classes and grammatical relations are highly correlated across languages and can be approximately aligned through an orthogonal transformation (rotation, reflection, etc.).Moreover, the alignability is significantly higher than baselines, reinforcing that the conceptual correspondence between languages is reflected in the representation space.

Discussion
It has been suggested that word embeddings in different languages are approximately isometric and can be aligned through a linear transformation (Mikolov et al., 2013;Lample et al., 2018;Schuster et al., 2019).However, the meaning of each individual words, rather than the underlying concepts (Youn et al., 2016;Xu et al., 2020), might not be indeed alignable across languages (Thompson et al., 2020), and enforcing such alignment can hurt downstream performance (Glavaš et al., 2019;Wu and Dredze, 2020).We here propose to establish the alignment based on conceptual correspondence that can serve as yardsticks for cross-linguistic comparison (Haspelmath, 2021), and the structural concepts defined in UD are generally designed to meet the need.The alignability of structural concepts across different languages is relatively consistent, providing evidence that their correspondence implicitly encoded in LLMs, though not well aligned.However, variations remain between different language pairs.Besides subtle cross-linguistic differences with regard to the structural concepts (e.g., Ponti et al., 2018), this might result from i) the lack of sufficient data in the UD treebank for approximating the prototypes, and ii) the degenerate representation spaces of certain languages, which has been attributed to factors including inadequate pretraining data and deficiencies in tokenization methods for specific languages (Vulić et al., 2020;Rust et al., 2021;Blaschke et al., 2023;Purkayastha et al., 2023).The disparities among languages are also reflected in our experiment results regarding the classification of structural concepts, especially for languages not well represented in the pretraining corpora4 .

Aligning Conceptual Correspondence
for Cross-Lingual Generalization The previous section shows that the universal structural concepts are readily alignable in the LLMs.Next, we investigate how to leverage the knowledge for cross-lingual generalization.We rely on meta-learning to learn to align conceptual correspondence between different languages in both zero-shot and few-shot scenarios, analyzing how different factors including the number of available examples and the languages used for meta-training may impact the generalization.

Method
Learning to align with a few examples We first derive prototypes c S k from a source language L S following the method in Section 2.1, and then learn to align samples in different languages with c S k via meta-learning.We employ a composite function F = g α • f ϕ to establish the alignment.The function f ϕ : R n → R m with parameters ϕ is language-agnostic and projects features yielded by LMs into an m-dimensional space, where samples belonging to each concept in a target language L T are expected to cluster around their prototypes c T k .As shown in Section 2.3, an orthogonal transformation suffices to align the prototypes in two languages while preserving the geometry of the original spaces.We thus use a language-specific linear mapping g α : R m → R m to convert c T k into prototypes c S k , which allows us to identify the structural concepts according to . (3)   The parameters are together optimized by minimizing the negative log-probability of the gold concept k.We use labeled data in multiple languages to learn the function F during meta-training.The language-agnostic function f ϕ is optimized over the entire training procedure, and the languagespecific function g α is learned separately for different languages.During meta-testing, f ϕ is kept fixed while g α is learned from scratch using the provided examples.
Aligning with unified prototypes for zero-shot generalization In the zero-shot setting, instead of being given a few examples to learn the alignment, we rely on meta-learning to establish unified prototypes for each concept.We learn a linear Table 1: The zero-shot and few-shot generalization performance on POS tagging for a subset of languages unseen during meta-training and low-resource languages.Languages marked with " * " are not included in the pretraining corpus." †" indicates that the language is involved in meta-training for M43.AVG and STD denotes the average accuracy and standard deviation respectively for 30 low-resource languages.(The results for all languages, together with the performance on the classification of grammatical relations, are given in Appendix B.) k and use a languageagnostic function f ϕ to match samples with them.The classification is then performed based on d (f ϕ (x) , c ω k ).We optimize the parameters during meta-training and directly apply the models to other languages for meta-testing.

Setup
Model We derive representations from the 7 th layer of mBERT.The language-agnostic function f ϕ is parameterized by a 2-layer perceptron with h hidden units that projects the features derived from mBERT to an m-dimensional space.We set h = 256 and m = 32 for word class identification, i.e., part-of-speech (POS) tagging.For grammatical relation, we set h = 384 and m = 64.
Data To investigate the impact of languages involved in meta-training on performance, we examine two distinct settings: i) M28 with 28 high-resource languages included in meta-training and ii) M43 with 15 additional languages, which are primarily low-resource ones.Unless otherwise stated, we use English as the source language.The languages and datasets used here is shown in Appendix E.
Evaluation We randomly select N sentences from the training set of a target language5 as the support set, and evaluate the model on its test set in terms of accuracy.We vary N from 0 to 200 to test the number of sentences needed for generalization.
Baselines We compare our method with the following baselines: (i) FT The mBERT model is fine-tuned with the N available sentences in the target language, with a linear classifier on top of it as the task-specific layer.(ii) UDapter, which is a state-of-the-art multilingual parser (Üstün et al., 2022) that extends mBERT with adapters whose parameters are generated based on language embeddings from URIEL (Littell et al., 2017).

Results
Explicit alignment benefits low-resource languages Our method achieves competitive results with UDAPTER without any parameter updates to the LLM (Table 1), and even surpasses it for some languages distant from high-resource ones like Marathi (mr) and Warlpiri (wbp).Moreover, our method helps mitigate the performance gap between different languages, as shown in the reduced standard deviation among languages.

Meta-learning supports efficient alignment
Comparing M28 and M43, we observe that, by incorporating low-resource languages like Marathi (mr) into meta-training, even with limited data 6 , substantial performance gains can be achieved without compromising performance in other languages.This suggests that the inclusion of diverse languages in meta-training is critical for learning to derive conceptual spaces from LLMs.Based on the knowledge meta-learned from various languages, the model efficiently learns to align unseen languages with little supervision and optimizes at about 50 sentences, in contrast to FT (Figure 3), supporting the effectiveness of our method.

Discussion
Our results taps into the potential for aligning systems possessing common structure, which has shown to support generalization even in the absence of explicit supervision (Roads and Love, 2020;Aho et al., 2022).Previous work has suggested that word embeddings in different languages are automatically aligned through joint pretraining (Cao et al., 2019;Conneau et al., 2020).In terms of structural concepts, their correspondence is reflected in the geometry of pretrained LLMs and have been aligned to a certain extent (Chi et al., 2020).However, the alignment is not optimal and can be improved with a few examples, as suggested by Lauscher et al. (2020).Our approach may be expanded to other aspects of language and, by utilizing the knowledge acquired via selfsupervision, it is promising to better accommodate the rich linguistic diversity despite the scarcity of labeled data.

Aligning Conceptual Correspondence during In-Context Learning
In this section, we further explore whether the correspondence between structural concepts can be harnessed for cross-lingual generalization in the in-context learning setting.Specifically, we first show that the cross-lingual syntactic abilities of LLMs can be elicited through in-context learning.
We then rely on our method to probe the underlying mechanisms and further enhance the cross-lingual generalization by aligning the structural concepts within different languages.

Method
Learning cross-lingual structural concepts in context We focus on the POS tagging task and use the structured prompting method proposed by Blevins et al. (2022) to evaluate the few-shot incontext learning ability of LLMs (Figure 4), where the model is given a small number of demonstration examples and then required to label additional sentences in either the same language or a different one.
During in-context learning, an LLM is provided with N pairs of sentences and tagged sequences as task demonstrations and a query sentence to be labeled.It is then required to iteratively tag the words.Specifically, given an input sequence ℓ with N demonstration examples and the query sentence S = s 1 , . . ., s n , at each time step t, the LLM M encodes [ℓ; s t ] and generate the label of s t with ĉt = arg max c P M (c | ℓ, s t ).The input sequence is then updated with the predicted label ĉt and the following word s t+1 appended to the end of ℓ.
Probing underlying mechanisms Firstly, we take the demonstration examples as query sentences, and evaluate the accuracy of the LLM in classifying these examples.We then investigate whether the representation space contextualized by the demonstrations effectively serves as a conceptual space where samples can be classified based on their distances to prototypes for each concept.We use the N demonstration examples as query sentences and obtain their representations h t at time step t for generating the label ĉt , whereby we construct a dataset D = {x i , y i } N i=1 based on h t and the gold label c t .We follow the approach in Section 2.1 to probe the extent to which structural concepts can be derived from the contextualized representation space, with the exception that the linear transformation is an identity matrix.Finally, we modify the labels and languages provided in demonstrations to assess whether our results generalize across different settings.
Meta-learning for better generalization Our analysis demonstrates that the LLM learns to accurately label the demonstration examples through in-context learning, but the generalization performance falls short in both monolingual and cross-lingual settings.We thus rely on our metalearning-based method to improve the LLM's generalization ability.As previously mentioned, we obtain prototypes by utilizing the demonstration examples as query sentences, and then learn to align representations of other query sentences, contextualized by these demonstrations, with the prototypes.This resembles the zero-shot setting in Section 3.1, but we discard the linear mapping applied to the prototypes, as the prototypes themselves are projected into a contextualized representation space and serve as good anchor points.During meta-learning, we introduce varying demonstration examples to construct different training episodes, and the model is directly applied across different contexts for meta-testing.

Setup
Model We use LLaMA-7B as the underlying LLM.The network used for meta-learning resembles the one described in Section 3.2, which is a 2-layer perceptron with a hidden layer of size 512.
Data We employ 24 languages for our experiments, among which 5 languages are used for meta-training.We represent each POS tag with the token corresponding to the surface form of the label defined in UD by default (UPOS), e.g., "NOUN".We investigate three additional settings where the label forms are modified: i) SHFL, which shuffles the surface forms of the labels, ii) PXY, which uses proxy labels where each class is represented by an arbitrary token-we employ capital alphabet letters here, and iii) WORD, which uses words as labels, e.g., "adverb"7 .
Evaluation We randomly select 9 sentences from the training set in a source language as demonstrations, ensuring they cover the label space if possible.For in-context learning, the LLM is evaluated on 50 randomly selected sentences from the test set for each language.We report the average accuracy across 10 runs, where a sample is considered correctly labeled only if the first word the LLM generates after seeing the delimiter matches the form of the gold label.In terms of the probing and meta-learning experiments, we focus on the setting where  the demonstrations provided are in English and perform 10 runs for each language with 50 query sentences randomly sampled from the training set of UD.The evaluation set for each language consists of 10 runs under similar settings, with the exception that the query sentences are sampled from the test set of UD.

Results
LLM successfully learns to label the demonstration examples, but with limited generalization abilities Table 2 shows that the LLM is able to accurately label demonstration examples, regardless of changes in the label forms.Moreover, the contextualized representations of the demonstrations can be effectively classified based on the prototypes, indicating a good conceptual space for them.However, the performance significantly decreases when generalizing to unseen sentences, and the cross-lingual generalization performance can be even worse (Figure 5).
Aligning with demonstrations supports generalization Through learning to align with prototypes derived from a few demonstration examples, our method achieves remarkable gains in generalization for both monolingual and cross-lingual scenar-ios (Figure 5).Extending the inclusion of diverse languages in our method holds promise for further improving cross-lingual generalization, particularly for languages that are not well represented in the meta-training languages, like Japanese (ja).These findings support the effectiveness of our method even in the face of variations introduced by changes in demonstrations, and suggest that better generalization performance can be achieved through explicit alignment.

Discussion
While in-context learning has shown an efficient way to leverage LLMs for various downstream tasks and enables few-shot generalization (Brown et al., 2020;Winata et al., 2022), the underlying mechanisms remain unclear.Previous research has suggested that prompting can be regarded as probing knowledge from LLMs (Li et al., 2022a;Blevins et al., 2022;Alivanistos et al., 2022), but the performance is sensitive to prompt engineering, including factors such as the label space and the input distribution (Zhao et al., 2021;Lu et al., 2022;Min et al., 2022;Mishra et al., 2022).We here investigate the representation space of LLMs and find the demonstrations are effectively learned by them despite changes in the label forms.These demonstrations establishes a conceptual space with which we may align different samples.Our findings are in line with Olsson et al. (2022), suggesting that LLMs learn to match the patterns in the context, and offer insights into improving the generalization abilities of LLMs beyond prompt design.

Related Work
Probing linguistic knowledge Pretrained LLMs have shown able to induce sophisticated linguistic knowledge via self-supervision (Manning et al., 2020;Linzen and Baroni, 2021).As evidenced by probing analyses, structural information including word classes, grammatical relations and syntactic parse trees can be decoded from their representations to a remarkable extent (Conneau et al., 2018;Blevins et al., 2018;Liu et al., 2019;Tenney et al., 2019;Clark et al., 2019;Hewitt and Manning, 2019;Eisape et al., 2022).Expanded to the multilingual setting, these models automatically capture nuanced similarities and differences between languages (Chi et al., 2020;Papadimitriou et al., 2021;Singh et al., 2019;Bjerva and Augenstein, 2021;Xu et al., 2022), enabling efficient zero-shot cross-lingual transfer.We here further explore whether the correspondence between structural concepts are reflected in LLMs' representation space, which could potentially be harnessed for better generalization.
Cross-lingual generalization While multilingual LLMs are capable of zero-shot cross-lingual transfer across various tasks (Pires et al., 2019;Wu and Dredze, 2019;Winata et al., 2021), the performance is sensitive to linguistic diversity.Efforts have been made to facilitate generalization to low-resource languages by learning proper information sharing (Ammar et al., 2016;Üstün et al., 2020;Nooralahzadeh et al., 2020) and optimizing data selection (Ponti et al., 2018;Lin et al., 2019;Glavaš and Vulić, 2021), but problems remain when it comes to outlier languages for which there is no high-resource related ones (Blasi et al., 2022).Another line of work strives to overcome the language barriers by imposing alignment of word embeddings (Lample et al., 2018;Ruder et al., 2019;Schuster et al., 2019;Cao et al., 2019).While the alignment typically ensures that semantically and syntactically similar words are clustered together, it is left implicit whether the underlying conceptual correspondence between languages are properly aligned.

Meta-learning
for generalization Metalearning has showcased great success in enabling effective generalization.Through the process of learning to learn, namely, improving learning over multiple learning episodes (Wang et al., 2020;Huisman et al., 2021;Hospedales et al., 2022), it facilitates rapid adaptation to novel contexts with limited data available.Prior work has exploited methods including Model-Agnostic Meta-Learning (Finn et al., 2017), Reptile (Nichol et al., 2018) and Prototypical Networks (Snell et al., 2017) for improved cross-lingual generalization (Ponti et al., 2021;Langedijk et al., 2022;Sherborne and Lapata, 2023;Cattan et al., 2021).Our method is similar to Prototypical Networks, but instead of estimating prototypes with the few available samples for each class, we derive prototypes from a source language and learn to align different languages with them, verifying the conceptual correspondence captured by LLMs.
We have demonstrated that multilingual LMs are able to induce the correspondence between structural concepts within different languages without any explicit supervision.This knowledge is encoded in their geometry and can be exploited for generalization, whereby we rely on metalearning to learn to align different languages with minimal examples available.Our approach can be used to evaluate the correspondence between different systems that has been acquired by LLMs, and explicitly leverage them for sample-efficient generalization, suggesting a new path toward measuring and manipulating the knowledge encoded in LLMs.Future research may generalize it to other contexts (e.g., other modalities) to probe the commonalities and differences shared between systems and develop more sophisticated way for alignment and generalization.

Limitations
Our goal in this work is to measure the underlying conceptual correspondence between languages encoded in LLMs, and leverage it for generalization.While we have demonstrated the effectiveness of our approach, it is only a first step toward the more general goal.The foremost limitation of our approach is that it relies on the comparable concepts defined by linguists and manually created datasets to derive proper features for analyses.Continued research could expand upon it to other tasks where prior knowledge of association is available and explore how different kinds of alignment impact the performance of LMs.
Despite the correspondence, our findings suggest that the alignability as well as the generalization performance still varies across languages, though not as pronounced as in the zero-shot crosslingual transfer scenario.These variations may be attributed to factors like nuanced cross-linguistic differences, degraded representation space for some low-resource languages and insufficient taskspecific data.Further investigation is needed to fully understand the cause of these disparities and how they can be reduced to improve the generalization abilities of pretrained LMs, e.g., how to improve the representation spaces of individual languages.

A Additional Materials for Correspondence between Structural Concepts in Transformers
A.

Identification of Structural Concepts Based on Prototypes
Representations of structural concepts Given an input sequence ℓ of n tokens w ℓ 1:n , an LLM produces contextual representations h ℓ 1:n for each of the token w ℓ i (i = 1, . . ., n).We take the representation h ℓ i corresponding to the word w ℓ i as the feature of its word class.The feature of the grammatical relation between a head-dependent pair of words w ℓ head , w ℓ dep is given by the difference between their representations: akin to previous works (Hewitt and Manning, 2019;Chi et al., 2020;Xu et al., 2022).The features and corresponding labels constitute the datasets D = x i , y i N i=1 , whereby we derive each structural concept k and measure the alignablity between languages (Section 2.1).

Validation of method
The middle layers of BERT-like models have shown most effective in encoding syntactic information (Hewitt and Manning, 2019;Chi et al., 2020).We validate this applies to our setting through probing the different layers of mBERT.Besides, to ensure that our method reflects the information about structural concepts encoded in the representation space, we compare the performance in classification with the following baselines: 1) LAYER0, the 0th layer of mBERT, where no contextual information is given; and 2) RAND, a model shares the same architecture as mBERT with its weights randomized 8 .For these experiments, we set the maximum rank of the probe model to 768 to maximally extract relevant information encoded in the representations.

Results
The 7 th and 8 th layers of the model are most effective in encoding the grammatical relations (Figure 7).For word classes, performance is relatively consistent from the 3 rd to 9 th layer except for some low-resource languages like Marathi (mr) and Tamil (ta) (Figure 6).We thus take the 7 th layer for our experiments.The comparison with the two baselines (Figure 8 and 8 As the performance of RAND is approximately equal across different layers, we consistently select the 7 th layer for our analysis.Figure 9) supports the efficacy of our method in deriving prototypes of structural concepts while reflecting the geometry of LMs.Moreover, we also observe disparities between languages reflected in the classification of structural concepts.As shown in Figure 6 and Figure 7, the performance on low-resource languages like Marathi and Tamil consistently lags behind, indicating insufficient representation of these languages in LLMs.

A.2 Alignability between Structural Concepts in Different Languages
Details of evaluation We use RSA and Procrustes analysis9 (PA) to measure the alignability between structural concepts in different languages.
The RSA between two languages is evaluated through the Spearman's rank correlation between the lower diagonal portion of their dissimilarity matrices.The fitness of the linear transformation derived from PA is evaluated through the average proportion of explained variance10 .

Details of baselines
Given the prototypes for K structural concepts derived from two languages L 1 and L 2 , we construct the following three baselines: (i) RP, which randomly swaps each prototypes      for another in one of the two languages; (ii) RC, where we randomly select a sample of each concept instead of their prototypes in one language; (iii) RS, where we randomly select K samples in one language.Each baseline thus creates a different mapping between two languages, and we test 100 random mappings per baseline.We employ the Wilcoxon test to assess whether the alignability between a language pair in LLMs, computed based on our method, is significantly higher than these baselines.

Results
The alignability between all 43 languages in mBERT with regard to word classes is shown in Figure 10 (RSA) and Figure 11 (PA).Figure 12 (RSA) and Figure 13 (PA) show the results for grammatical relations in mBERT.In   terms of LLaMA, the results for word classes are shown in Section 2.3, and the results for grammatical relations are depicted in Figure 14 (RSA) and Figure 15 (PA).The alignability in mBERT and LLaMA is both significantly higher than the baselines, with p < 0.001 for almost all language pairs.The only exception arises when comparing the alignability with RC, where the samples are randomly taken from the same class (Table 3).

B Additional Materials for Aligning Conceptual Correspondence for Cross-Lingual Generalization
For POS tagging, Table 6 shows our results on 30 low-resource languages compared with UDAPTER.Results on other languages are shown in Table 7.The results for the classification of grammatical relations are shown in Figure 16, Table 8 and  Table 9.
Additionally, we present the average accuracy and standard deviation for both low-resource and high-resource languages, as shown in Table 4 and  Table 5.With an increasing number of available examples, our approach demonstrates consistent improvements and helps mitigate the performance gap across diverse languages.6 and the high-resource languages (HR) include: ar, en, eu, fi, he, hi, it, ja, ko, ru, sv, tr and zh.AVG and STD are computed over all languages.While our method lags behind UDAPTER in terms of high-resource languages, its demonstrates an increasing performance on low-resource ones when provided with additional examples.Moreover, the gap between different languages becomes smaller, especially for M28, which does not involve any low-resource languages in meta-training.
Languages for meta-training We employ five languages for meta-training, including bg, en, fi, fr, ru.The other languages and datasets used in our experiments are shown in Appendix E.

C.2 Full Results
Figure 17 presents the full results for our metalearning-based methods across 24 languages.
Deriving structural concepts from LLMs For all our experiments that derive structural concepts from LLMs through a linear transformation, we train the linear probe with a batch size of 8 and a max sequence length of 128 for 20 epochs, and   validate it at the end of each epoch.We select the model performing the best on the development set.We use the Adam optimizer with β 1 = 0.9, β 2 = 0.999, and a weight decay of 1 × 10 −6 .The learning rate is set to 1 × 10 −4 .

Learning to align conceptual correspondence
Our meta-learning-based method follows the procedure described above to derive prototypes for each concept.Subsequently, the networks are trained for 50 epochs, with a maximum sequence length of 128.During meta-training, given m languages, each epoch consists of m × 50 training episodes.These episodes are constructed using N labeled sentences as the support set and 30 labeled sentences as the query set.The method we use here bears resemblance to Li et al. (2022b).The parameters of the network are optimized through the Adam optimizer, with β 1 = 0.9, β 2 = 0.999, and a weight decay of 1 × 10 −4 .The learning rate is set to 5 × 10 −5 .The hidden layer dropout probability is 0.33.Note that as the conceptual correspondence holds across languages for concepts with (Section 2.3) for concepts with more than (or equal to) 20 samples, concepts with fewer than 20 samples in either the source or target languages are excluded during meta-training.During metatesting, examples belonging to these categories are considered as misclassified.
Aligning conceptual correspondence during incontext learning For meta-learning during incontext learning, our networks are trained for 100 epochs with each consists of m × 10 episodes, where m = 5 is the number of languages involved in training.The parameters of the network are optimized through the Adam optimizer, with β 1 = Table 6: The zero-shot and few-shot generalization performance on POS tagging for 30 low-resource languages.Languages marked with " * " are not included in the pretraining corpus." †" indicates that the language is involved in meta-training for M43.The UD dataset for some languages comprises less than 100 or 200 sentences, and the few-shot performance for these languages is left unspecified.0.9, β 2 = 0.999, and a weight decay of 1 × 10 −4 .The learning rate is set to 5 × 10 −4 .The hidden layer dropout probability is 0.33.

E Data
The data used in all our experiments is from UD v2.10 13 .We follow the split of training, development and test set in it.

Figure 1 :
Figure 1: Alignability between structural concepts (word classes) in different languages measured by RSA and Procrustes analysis, which is significantly higher than baselines.(The results for all languages, along with the alignability between grammatical relations, are presented in Appendix A.2.)

Figure 3 :
Figure 3: Accuracy of M28 in identifying word classes in languages used during meta-training (depicted by bluish lines) and novel languages encountered during meta-testing (depicted by reddish lines) given N sentences.(The results for the classification of grammatical relations are shown in Appendix B.)

Figure 4 :
Figure 4: Sequence tagging via structured prompting.The classification of structural concepts is performed in a sequential manner.

Figure 5 :
Figure5: The few-shot generalization performance on POS tagging in the monolingual (MONO) and cross-lingual (EN) in-context learning settings, with English as the source language.META denotes our meta-learning-based method.Error bars represent the standard deviation calculated from 10 runs.Languages marked with " * " are not included in the pretraining corpus." †" indicates that the language is involved in meta-training for META.(The results for all languages are presented in Appendix C.2.)

Figure 7 :
Figure 7: Accuracy in identifying grammatical relations of different languages across different layers of mBERT.

Figure 8 :
Figure8: The distribution of the accuracy in deriving word classes from the 7 th layer of mBERT, along with two baselines.The x-axis denotes the accuracy, of which the distribution is derived from the results in 43 languages.The Wilcoxon test shows that the 7 th layer exhibits a significantly higher performance (W = 0.0, p = 2.27 × 10 −13 ).

Figure 9 :
Figure9: The distribution of the accuracy in deriving grammatical relations from the 7 th layer of mBERT, along with two baselines.The x-axis denotes the accuracy, of which the distribution is derived from the results in 43 languages.The Wilcoxon test shows that the 7 th layer exhibits a significantly higher performance (W = 0.0, p = 2.27 × 10 −13 ).

Figure 10 :
Figure 10: The alignability between word classes within different languages in mBERT measured by RSA.

Figure 11 :Figure 12 :
Figure 11: The alignability between word classes within different languages in mBERT measured by Procrustes analysis.

Figure 13 :
Figure 13: The alignability between grammatical relations within different languages in mBERT measured by Procrustes analysis.

Figure 14 :
Figure 14: The alignability between grammatical relations within different languages in LLaMA measured by RSA.

Figure 15 :Figure 16 :
Figure 15: The alignability between grammatical relations within different languages in LLaMA measured by Procrustes analysis.

Figure 17 :
Figure17: The few-shot generalization performance on POS tagging in the monolingual (MONO) and cross-lingual (EN) in-context learning settings, with English as the source language.META denotes our meta-learning-based method.Error bars represent the standard deviation calculated from 10 runs.Languages marked with " * " are not included in the pretraining corpus." †" indicates that the language is involved in meta-training for META.

Table 2 :
Accuracy in POS tagging in English (en) for demonstrations taken as query sentences.ICL denotes the performance achieved through in-context learning, and Proto for classification based on prototypes computed based on our method.

Table 3 :
Exceptions where the alignability between the language pair exceeds a significance threshold of p < 0.001.

Table 4 :
The average accuracy (AVG) and standard deviation (STD) with regard to POS tagging measured across the languages used in UDAPTER, where the low-resource languages (LR) are the 30 languages listed in Table

Table 5 :
The average accuracy (AVG) and standard deviation (STD) with regard to the classification of grammatical relations measured across different languages, where the low-resource languages (LR) are the 30 languages listed in Table6and the high-resource languages (HR) include: ar, en, eu, fi, he, hi, it, ja, ko, ru, sv, tr and zh.AVG and STD are computed over all languages.Our method demonstrates an increasing performance on low-resource ones when provided with additional examples.Moreover, the gap between different languages becomes smaller, especially for M28, which does not involve any low-resource languages in meta-training.