Improving Cross-Lingual Transfer through Subtree-Aware Word Reordering

Despite the impressive growth of the abilities of multilingual language models, such as XLM-R and mT5, it has been shown that they still face difficulties when tackling typologically-distant languages, particularly in the low-resource setting. One obstacle for effective cross-lingual transfer is variability in word-order patterns. It can be potentially mitigated via source- or target-side word reordering, and numerous approaches to reordering have been proposed. However, they rely on language-specific rules, work on the level of POS tags, or only target the main clause, leaving subordinate clauses intact. To address these limitations, we present a new powerful reordering method, defined in terms of Universal Dependencies, that is able to learn fine-grained word-order patterns conditioned on the syntactic context from a small amount of annotated data and can be applied at all levels of the syntactic tree. We conduct experiments on a diverse set of tasks and show that our method consistently outperforms strong baselines over different language pairs and model architectures. This performance advantage holds true in both zero-shot and few-shot scenarios.


Introduction
Recent multilingual pre-trained language models (LMs), such as mBERT (Devlin et al., 2019), XLM-RoBERTa (Conneau et al., 2020), mBART (Liu et al., 2020b), and mT5 (Xue et al., 2021), have shown impressive cross-lingual ability, enabling effective transfer in a wide range of cross-lingual natural language processing tasks.However, even the most advanced LLMs are not effective when dealing with less-represented languages, as shown by recent studies (Ruder et al., 2023;Asai et al., 2023;Ahuja et al., 2023).Furthermore, annotating sufficient training data in these languages is not a 1 Code available at https://github.com/OfirArviv/ud-based-word-reordering feasible task, and as a result speakers of underrepresented languages are unable to reap the benefits of modern NLP capabilities (Joshi et al., 2020).
Numerous studies have shown that a key challenge for cross-lingual transfer is the divergence in word order between different languages, which often causes a significant drop in performance (Rasooli and Collins, 2017;Wang and Eisner, 2018;Ahmad et al., 2019;Liu et al., 2020a;Ji et al., 2021;Nikolaev and Pado, 2022;Samardžić et al., 2022).2This is unsurprising, given the complex and interdependent nature of word-order (e.g., verb-final languages tend to have postpositions instead of prepositions and place relative clauses before nominal phrases that they modify, while SVO and VSO languages prefer prepositions and postposed relative clauses, see Dryer 1992) and the way it is coupled with the presentation of novel information in sentences (Hawkins, 1992).This is especially true for the majority of underrepresented languages, which demonstrate distinct word order preferences from English and other well-resourced languages.
Motivated by this, we present a reordering method applicable to any language pair, which can be efficiently trained even on a small amount of data, is applicable at all levels of the syntactic tree, and is powerful enough to boost the performance of modern multilingual LMs.The method, defined in terms of Universal Dependencies (UD), is based on pairwise constraints regulating the linear order of subtrees that share a common parent, which we term POCs for "pairwise ordering constraints".
We estimate these constraints based on the probability that the two subtree labels will appear in one order or the other when their parent has a given label.Thus, in terms of UD, we expect, e.g., languages that use pre-nominal adjectival modification to assign a high probability to amods' preceding their headword, while languages with post-nominal adjectival modification are expected to have a high probability for the other direction.
The estimated POCs are fed as constraints to an SMT solver 3 to produce a general reordering algorithm that can affect reordering of all types of syntactic structures.In addition to being effective, POCs are interpretable and provide a detailed characterization of typical word order patterns in different languages, allowing interpretability as to the effect of word order on cross-lingual transfer.
We evaluate our method on three cross-lingual tasks -dependency parsing, task-oriented semantic parsing, and relation classification -in the zeroshot setting.Such setting is practically useful (see, e.g., Ammar et al. 2016;Schuster et al. 2019;Wang et al. 2019;Xu and Koehn 2021 for successful examples of employing ZS learning cross-lingually) and minimizes the risk of introducing confounds into the analysis.
We further evaluate our method in the scarcedata scenario on the semantic parsing task.This scenario is more realistic as in many cases it is feasible to annotate small amounts of data in specific languages (Ruder et al., 2023).
Experiments show that our method consistently presents a noticeable performance gain compared to the baselines over different language pairs and model architectures, both in the zero-shot and fewshot scenario.This suggests that despite recent advances, stronger multilingual models still faces difficulties in handling cross-lingual word order divergences, and that reordering algorithms, such as ours, can provide a much needed boost in performance in low-resource languages.
Additionally, we investigate the relative effectiveness of our reordering algorithm on two types of neural architectures: encoder-decoder (seq2seq) vs. a classification head stacked on top of a pretrained encoder.Our findings show that the encoder-decoder architecture underperforms in cross-lingual transfer and benefits more strongly from reordering, suggesting that it may struggle with projecting patterns over word-order divergences.
The structure of the paper is as follows: Section 2 surveys related work.The proposed approach is introduced in Section 3. Section 4 de-3 An extension of the SAT solver that can, among other things, include mathematical predicates such as + and < in its constraints and assign integer values to variables.scribes the setup for our zero-shot and few-shot experiments, the results of which are presented in Section 5. Section 6 investigates the comparative performance of encoder-based and sequenceto-sequence models, and Section 7 concludes the paper.

Related Work
A major challenge for cross-lingual transfer stems from word-order differences between the source and target language.This challenge has been the subject of many previous works (e.g., Ahmad et al., 2019;Nikolaev and Pado, 2022), and numerous approaches to overcome it have been proposed.
One of the major approaches of this type is reordering, i.e. rearranging the word order in the source sentences to make them more similar to the target ones or vice versa.Early approaches, mainly in phrase-based statistical machine translation, relied on hand-written rules (Collins et al., 2005), while later attempts were made to extract reordering rules automatically using parallel corpora by minimizing the number of crossing wordalignments (Genzel, 2010;Hitschler et al., 2016).
More recent works focusing on reordering relied on statistics of various linguistic properties such as POS-tags (Wang andEisner, 2016, 2018;Liu et al., 2020a) and syntactic relations (Rasooli and Collins, 2019).Such statistics can be taken from typological datasets such as WALS (Meng et al., 2019) or extracted from large corpora (Aufrant et al., 2016).
Other works proposed to make architectural changes in the models.Thus, Zhang et al. (2017a) incorporated distortion models into attention-based NMT systems, while Chen et al. (2019) proposed learning reordering embeddings as part of Transformer-based translation systems.More recently, Ji et al. (2021) trained a reordering module as a component of a parsing model to improve cross-lingual structured prediction.Meng et al. (2019) suggested changes to the inference mechanism of graph parsers by incorporating targetlanguage-specific constraintsin inference.
Our work is in line with the proposed solutions to source-sentence reordering, namely treebank reordering, which aim to rearrange the word order of source sentences by linearly permuting the nodes of their dependency-parse trees.Aufrant et al. (2016) and Wang and Eisner (2018) suggested permuting existing dependency treebanks to make their surface POS-sequence statistics close to those of the target language, in order to improve the performance of delexicalized dependency parsers in the zero-shot scenario.While some improvements were reported, these approaches rely on short POS n-grams and do not capture many important patterns. 4Liu et al. (2020a) proposed a similar method but used a POS-based language model, trained on a target-language corpus, to guide their algorithm.This provided them with the ability to capture more complex statistics, but utilizing black-box learned models renders their method difficult to interpret.
Rasooli and Collins (2019) proposed a reordering algorithm based on UD, specifically, the dominant dependency direction in the target language, leveraging the rich syntactic information the annotation provides.Their method however, leverages only a small part of UD richness, compared to our method.
We note that previous work on treebank reordering usually only evaluated their methods on UD parsing, using delexicalized models or simple manually aligned cross-lingual word embeddings, which limited the scope of the analysis.In this paper, we experiment with two additional tasks that are not reducible to syntactic parsing: relation classification and semantic parsing.We further extend previous work by using modern multilingual LMs and experimenting with different architectures.

Approach
Given a sentence s = s 1 , s 2 , ..., s n in source language L s , we aim to permute the words in it to mimic the word order of a target language L t .Similarly to previous works (Wang and Eisner, 2018;Liu et al., 2020a), we make the assumption that a contiguous subsequence that forms a constituent in the original sentence should remain a contiguous subsequence after reordering, while the inner order of words in it may change.This prevents subtrees from losing their semantic coherence and is also vital when dealing with tasks such as relation extraction (RE), where some of the subequence must stay intact in order for the annotation to remain valid.Concretely, instead of permuting the words of sentence s, we permute the subtrees of its UD parse tree, thus keeping the subsequences of s, as 4 Aufrant et al. (2016) further experimented with manually crafting permutations rules using typological data on POS sequences from WALS.This approach is less demanding in terms of data but is more labor intensive and does not lead to better performance.defined by the parse-tree structure, intact. 5e define a set of language-specific constraintsbased on the notion of Pairwise Ordering Distributions, the tendency of words with specific UD labels to be linearly ordered before words with other specific labels, conditioned on the type of subtree they appear in.To implement a reodering algorithm we use these constraints as input to an SMT solver.

Pairwise Ordering Distributions
Let T (s) be the Universal Dependencies parse tree of sentence s in language L, and π = (π 1 , ..., π n ) the set of all UD labels.We denote the pairwise ordering distribution (POD) in language L of two UD nodes with dependency labels π i , π j , in a subtree with the root label π k with: where p is the probability of a node with label π i to be linearly ordered before a node with label π j , in a subtree with a root of label π k .Note that being linearly ordered before a node with index i, means having an index of j < i, and that the nodes are direct children of the subtree root.We include a copy of the root node in the computation as one of its own children.Thus we can distinguish between a node acting as a representative of its subtree and the same node acting as the head of that subtree.6

Pairwise Ordering Constraints and Reordering
Given the pairwise ordering distribution of target language L t , denoted as dist Lt = P , we define a set of pairwise constraints based on it.Concretely, for dependency labels π k , π i , π j , we define the following constraint: where π k : (π i < π j ) = 1 indicates that a node n with dependency label π i should be linearly ordered before node n ′ with dependency label π j if they are direct children of a node with label π k .Using these constraints, we recursively reorder the tokens according to the parse tree T (s) in the following way.For each subtree T i ∈ T (s) with UD label π j and children n 1 , n 2 , ..., n m , with UD labels n 1π , n 2π , ..., n mπ : 1. We extract the pairwise constraints that apply to T i based on the UD labels of its root and children.
2. We feed the pairwise constraints to the SMT solver7 and use it to compute a legal ordering of the UD labels, i.e. an order that satisfies all the constraints.
3. If there is such an ordering, we reorder the nodes in T i accordingly.Otherwise, we revert to the original order.
4. We proceed recursively, top-down, for every subtree in T , until all of T (s) is reordered to match dist Lt .
For example, assuming the constraints nsubj → root, obj → root, and obl → root for the main clause and obl → case, corresponding to a typical SOV language, and assuming that the target language does not have determiners,8 the sentence

Estimating the Pairwise Ordering Constraints
In this section we describe two possible methods for estimating the POCs of a language, one relying on the availability of a UD corpus in the target language and one relying on the Bible Corpus.
Using The UD Treebank.The first method we use to estimate POCs is by extracting them from corresponding empirical PODs in a UD treebank.
When there are multiple treebanks per language, we select one of them as a representative treebank.We use v2.10 of the Universal Dependencies dataset, which contains treebanks for over 100 languages.
Estimating POCs without a Treebank.While the UD treebank is vast, there are still hundreds of widely spoken languages missing from it.The coverage of our method can be improved by using annotation projection (Agić et al., 2016) on a massively parallel corpus, such as the Bible Corpus (McCarthy et al., 2020).Approximate POCs can then be extracted from the projected UD trees.While we do not experiment with this setting in this work due to resource limitations, we mention it as a promising future work venue, relying on the work done by Rasooli and Collins ( 2019), which used this approach successfully, to extract UD statistics, and utilize them in their reordering algorithm on top of annotation projection.

Experimental Setup
We evaluate our reordering algorithm using three tasks -UD parsing, task-oriented semantic parsing, and relation extraction -and over 13 different target languages, with English as the source language.9For each task and target language, we compare the performance of a model trained on the vanilla English dataset against that of a model trained on a transformed (reordered) version of the dataset, using the target-language test set in a zero-shot fashion.
We explore two settings: STANDARD, where we reorder the English dataset according to the target language POCs and use it for training, and EN-SEMBLE, where we train our models on both the vanilla and the reordered English datasets.The main motivation for this is that any reordering algorithm is bound to add noise to the data.First, the underlying multilingual LMs were trained on the standard word order of English, and feeding them English sentences in an unnatural word order will likely produce sub-optimal representations.Secondly, reordering algorithms rely on surface statistics, which, while rich, are a product of statistical estimation and thus imperfect.Lastly, the use of hard constrains may not be justified for target languages with highly flexible word-order 10 .The ENSEMBLE setting mitigates these issues and improves the "signal to noise ratio" of the approach.
We use the vanilla multilingual models and the reordering algorithm by Rasooli and Collins (2019) as baselines.To the best of our knowledge, Rasooli and Collins proposed the most recent preprocessing reordering algorithm, which also relies on UD annotation.We re-implement the algorithm and use it in the same settings as our approach.
Lastly, we evaluate our method in the scarce-data setting.We additionally train the models fine-tuned on the vanilla and reordered English datasets on a small number of examples in the target language and record their performance.Due to the large amount of experiments required we conduct this experiment using only our method in the context of the semantic-parsing task, which is the most challenging one in our benchmark (Asai et al., 2023;Ruder et al., 2023), on the mT5 model, and in the ENSEMBLE setting.

Estimating the POCs
We estimate the POCs ( §3.2) by extracting the empirical distributions from UD treebanks (see §3.3).While this requires the availability of an external data source in the form of a UD treebank in the target language, we argue that for tasks other than UD parsing this is a reasonable setting as the UD corpora are available for a wide variety of languages.Furthermore, we experiment with various treebank sizes, including ones with as few as 1000 sentences.Further experimentation with even smaller treebanks is deferred to future work.Appendix B lists the treebanks used and their sizes.

Evaluation Tasks
In this section, we describe the tasks we use for evaluation, the models we use for performing the tasks, and the datasets for training and evaluation.All datasets, other than the manually annotated UD corpora, are tokenized and parsed using Trankit (Nguyen et al., 2021).Some datasets contain subsequences that must stay intact in order for the annotation to remain valid (e.g., a proper-name sequence such as The New York Times may have internal structure but cannot be reordered).In cases where these subsequences are not part of a single subtree, we manually alter the tree to make them dering as the "correct" one.If this ordering contradicts the original English one, it will be both nearly 50% incorrect and highly unnatural for the encoder.Ensembling thus ensures that the effect of estimation errors is bounded.so.Such cases mostly arise due to parsing errors and are very rare.The hyper-parameters for all the models are given in Appendix D.

UD Parsing
Dataset.We use v2.10 of the UD dataset.For training, we use the UD English-EWT corpus with the standard splits.For evaluation, we use the PUD corpora of French, German, Korean, Spanish, Thai and Hindi, as well as the Persian-Seraji, Arabic-PADT, and the Irish-TwittIrish treebanks.
We note that our results are not directly comparable to the vanilla baseline because our model has indirect access to a labeled target dataset, which is used to estimate the POCs.This issue is less of a worry in the next tasks, which are not defined based on UD.We further note that we do not use the same dataset for extracting the information about the target language and for testing the method.11 Model.We use the AllenNLP (Gardner et al., 2018) implementation of the deep biaffine attention graph-based model of Dozat and Manning (2016).We replace the trainable GloVe embeddings and the BiLSTM encoder with XLM-RoBERTa-large (Conneau et al., 2020).Finally, we do not use gold (or any) POS tags.We report the standard labeled and unlabeled attachment scores (LAS and UAS) for evaluation, averaged over 5 runs.

Task-oriented Semantic Parsing
Datasets.We use the MTOP (Li et al., 2021) and Multilingual TOP (Xia and Monti, 2021) datasets.MTOP covers 6 languages (English, Spanish, French, German, Hindi and Thai) across 11 domains.In our experiments, we use the decoupled representation of the dataset, which removes all the text that does not appear in a leaf slot.This representation is less dependent on the word order constraints and thus poses a higher challenge to reordering algorithms.The Multilingual TOP dataset contains examples in English, Italian, and Japanese and is based on the TOP dataset (Gupta et al., 2018).Similarly to the MTOP dataset, this dataset uses the decoupled representation.Both datasets are formulated as a seq2seq task.We use the standard splits for training and evaluation.
Models.We use two seq2seq models in our evaluation, a pointer-generator network model (Ron-gali et al., 2020) and mT5 (Xue et al., 2021).The pointer-generator network was used in previous works on these datasets (Xia and Monti, 2021;Li et al., 2021); it includes XLM-RoBERTa-large (Conneau et al., 2020) as the encoder and an uninitialized Transformer as a decoder.In this method, the target sequence is comprised of ontology tokens, such as [IN:SEND_MESSAGE in the MTOP dataset, and pointer tokens representing tokens from the source sequence (e.g.ptr0, which represents the first source-side token).When using mT5, we use the actual tokens and not the pointer tokens as mT5 has copy-mechanism built into it, thus enabling the model to utilize it.In both models, we report the standard exact-match (EM) metric, averaged over 10 runs for the pointer-generator model and 5 runs for mT5.

Relation Classification
Datasets.We use two sets of relation-extraction datasets: (i) TACRED (Zhang et al., 2017b)  IndoRE (Nag et al., 2021) contains 21K sentences in Indian languages (Bengali, Hindi, and Telugu) plus English, covering 51 relation types.We use the English portion of the dataset for training, and the Hindi and Telugu languages for evaluation. 12odel.We use the relation-classification part of the LUKE model (Yamada et al., 2020). 13The model uses two special tokens to represent the head and the tail entities in the text.The text is fed into an encoder, and the task is solved using a linear classifier trained on the concatenated representation of the head and tail entities.For consistency, we use XLM-RoBERTa-large (Conneau et al., 2020) as the encoder.We report the micro-F1 and macro-F1 metrics, averaged over 5 runs.

Results and Discussion
The results on the various tasks, namely UD parsing, semantic parsing, and relation classification, are presented in Tables 1, 2 and 3 respectively.The few-shot experiment results are in Table 4. Standard deviations are reported in Appendix E.
In UD parsing, the ENSEMBLE setting presents noticeable improvements for the languages that are more typologically distant from English (2.3-4.1 LAS points and 1.8-3.5 UAS points), with the exception of Arabic, in which the scores slightly drop.No noticeable effect is observed for structurally closer languages.
In the STANDARD setting, a smaller increase in performance is present for most distant languages, with a decrease in performance for closes ones, Persian and Arabic.This is in agreement with previous work that showed that reordering algorithms are more beneficial when applied to structurallydivergent language pairs (Wang and Eisner, 2018;Rasooli and Collins, 2019).The ENSEMBLE approach, therefore, seems to be essential for a generally applicable reordering algorithm.
The algorithm by Rasooli and Collins (2019) (RC19), in both settings, presents smaller increase in performance for some typologically-distant languages and no noticeable improvements for others, while sometimes harming the results.This suggests that for this task the surface statistics the algorithm uses are not enough and a more fine-grained approach in needed.
In the semantic-parsing task, the reordering algorithm presents substantial improvements for all languages but Italian in the ENSEMBLE setting (2-6.1 increase in exact match), for the RoBERTa based model.Noticeably, the gains are achieved not only for typologically-distant languages but also for languages close to English, such as French.In the MTOP dataset, the ENSEMBLE setting proves bigger gains over the STANDARD for all languages.In Multilingual-TOP, we surprisingly observe the opposite.Given that Japanese in terms of word order is comparable to Hindi and Italian, to French, we tend to attribute this result to the peculiarities of the dataset.This, however, merits further analysis.
When compared to RC19, the proposed algorithm consistently outperforms it, by an average of about 2 points (in the ENSEMBLE setting).
For mT5 we observe increase in performance, in the ENSEMBLE setting, of 2.5 and 5 points in Thai and Hindi, respectively.For the other languages, we do not observe a strong impact.We note however, that in French and Spanish, there is a slight drop in the score (less then 1 point).When compared to RC19, our method provide larger gains in Hindi and Thai.
In the few-shot scenario, we observe improved performances for all languages and sample sizes.Surprisingly, the improvements hold even when training on a large sample size of 500, indicating that the model is not able to easily adapt to the target word-order.
Lastly, in the relation-classification task, in the ENSEMBLE setting we observe an increase in performance for all languages (2.3-10.4increase in the Micro and Macro F1 points).In the STANDARD setting, there is a drop in performance for Hindi and Telugu.When compared to RC19, our algorithm outperforms it in the ENSEMBLE setting, by more than 5 points for Korean, but only by 0.5 points for Russian.In Hindi and Telugu, the performance of both algorithms is close, and RC19 does perform better in some cases.

Comparison between Encoder-with-Classifier-Head and Seq2Seq Models
Past work has shown that the architecture is an important predictor of the ability of a given model to generalize over cross-lingual word-order divergences.For example, Ahmad et al. (2019) showed that models based on self-attention have a better overall cross-lingual transferability to distant languages than those using RNN-based architectures.
One of the dominant trends in recent years in NLP has been using the sequence-to-sequence formulation to solve an increasing variety of tasks (Kale and Rastogi, 2020;Lewis et al., 2020).Despite that, recent studies (Finegan-Dollak et al., 2018;Keysers et al., 2019;Herzig and Berant, 2019) demonstrated that such models fail at compositional generalization, that is, they do not generalize to structures that were not seen at training time.Herzig and Berant (2021) showed that other model architectures can prove advantageous over seq2seq architecture in that regard, but their work was limited to English.
Here, we take the first steps in examining the cross-lingual transfer capabilities of the seq2seq encoder-decoder architecture (S2S) vs. a classification head stacked over an encoder (E+C), focusing on their ability to bridge word-order divergences.

Experimental Setup
We compare the performance of an E+C model against a S2S one on the task of UD parsing over various target languages.Similar to §4, we train each model on the vanilla English dataset and compare it against a model trained on a version of the dataset reordered using our algorithm.We evaluate the models using the target-language test set in a zero-shot setting.
Dataset and POCs Estimates.We use the same UD dataset and POCs as in §4.For the S2S task, we linearize the UD parse tree using the method by Li et al. (2018).
Models.For the E+C model we use the deep biaffine attention graph-based model with XLM-RoBERTa-large as the encoder, as in §4.2.1.For S2S model, we use the standard transformer architecture with XLM-RoBERTa-large as the encoder and an uninitialized self-attention stack as the decoder.The hyper-parameters for the models are given in Appendix D.

Results and Discussion
The results for LAS (averaged over 5 runs), normalized by the base parser performance on the English test-set, are presented in then 50%), despite relying on the same underline multilingual LM.Furthermore, the S2S architecture benefits more strongly from reordering for distant languages (more than twice as much), compared to the E+C one.This suggests that seq-to-sequence architecture may be less effective in handling crosslingual divergences, specifically word-order divergences, and may gain more from methods such as reordering.

Conclusion
We presented a novel pre-processing reordering approach that is defined in terms of Universal Dependencies.Experiments on three tasks and numerous architectures and target languages demonstrate that this method is able to boost the performances of modern multilingual LMs both in the zero-shot and few-shot setting.Our key contributions include a new method for reordering sentences based on fine-grained word-order statistics, the Pairwise Ordering Distributions, using an SMT solver to convert the learned constraints into a linear ordering, and a demonstration of the necessity of combining the reordered dataset with the original one (the ENSEMBLE setting) in order to consistently boost performance.
Our results suggest that despite the recent improvements in multilingual models, they still face difficulties in handling cross-lingual word order divergences, and that reordering algorithms, such as ours, can provide a much needed boost in performance in low-resource languages.This result holds even in the few-shot scenario, when the model is trained on few hundred examples, underscoring the difficulty of models to adapt to varying word orders, as well as the need for more typologically diverse data, additional inductive bias at the training time, or a pre-processing approach such as ours to be more effective.Furthermore, our experiments suggest that seq2seq encoder-decoder architectures may suffer from these difficulties to a bigger extent than more traditional modular ones.
Future work will include, firstly, addressing the limitations of the proposed approach in order to make it less language-pair dependent and reduce the computational and storage overhead, and secondly, leveraging the POCs in order to compute the word-order distance between languages in a rich, rigorous corpus-based way,14 and to more precisely predict when the reordering algorithm will be beneficial as well as to provide a fine-grained analysis of the connection between word order and cross-lingual performance, in line with Nikolaev et al. (2020).

Limitations
There are several limitation to our work.First, as shown in the experiment results, for some tasks, reordering, even with ensembling, is not beneficial cannot be set to 1), it can be uninformative (both values set to 0) when not enough label-ordering data is presented in the training treebank, which means that the ordering of the corresponding nodes is not subject to any constraint.
Moreover, it is possible to encounter loops or transitivity conflicts when joining different constraints, which makes it a priori impossible for the solver to satisfy them.To alleviate this problem, for each subtree we aim to reorder, we only consider the constraints that are relevant to it.For example, if the subtree does not contain any token with label nmod, we discard all the constraints which include this label, such as amod : (nmod < amod).This, together with the tendency of languages to have a preferred ordering to their constituent elements, makes it so that only a small percentage of subtrees cannot be ordered.
Last, sparsity issues may prevent some constraints from being statistically justified, and rounding the constraints to a hard 0 or 1 may result in information loss and thus be detrimental.For example, if our pairwise distributions are P nmod,nmod,amod = 0.51 and P nmod,amod,nmod = 0.49, deriving a constraint of nmod : (nmod < amod) may not be warranted.This may also happens in the case of a highly flexible order of a particular pair of syntactic elements in the target language.If this ordering contradicts the original English one, it will be both nearly 50% incorrect and highly unnatural for the encoder.This is, however, partially mitigated by the ENSEMBLE$ method.For the purposes of this work we do not distinguish between POCs according to their statistical validity and defer this question to future work.

B Estimating the POCs
We estimate the POCs ( §3.2) by extracting the empirical distributions from UD treebanks.The UD treebanks used are reported in Table 6.

C Example of Learned Distributions
Here are the statistics of the pairwise ordering of main elements of the matrix clause 15 learned on the Irish-IDT treebank: As expected, Irish behaves as a strict head-initial language: root overwhelmingly precedes all other constituents, including subordinate clauses, and modifier subordinate clauses (acl, advcl) follow nominal clause participants (nsubj, obj, obl).Adjectival modifiers, however, mostly precede nominal elements; this may be due to the fact that some frequent pronominal adjectives, such as uile 'all' and gach 'every' do not follow the general rule and precede the nouns they modify.
The position of adverbial modifiers is not restricted by the grammar, and it may be noted that it generally follows nominal subjects but as often as not precedes direct objects, and in 3/5 of cases precedes obliques, which suggests the general order root → nsubj → advmod/obj → obl.

D Models Hyperparameters
The hyperparameters of the UD parser are given in Table 9.For the seq2seq pointer-generator network model -in Table 10, for mT5 -in Table 11, and for LUKE relation classification model -in Table 12.
For the UD Seq2Seq parser, we use the same hyperparameters as for the seq2seq pointer-generator network model, with the following exceptions: we train for only 50 epochs and set the learning rate to 1e-5 for both the encoder and decoder.

E Standard Deviations
The standard deviations of the results of the experiments in UD parsing, semantic parsing, and relation classification are presented in Tables 14,  13, and 15 respectively.

F UD Seq2Seq Model Performances
The full results (averaged over 5 models) of the S2S model in UD parsing, are presented in Table 7.The standard deviations are in Table 8.
(TAC)   and Translated TACRED(Arviv et al., 2021) (Trans-TAC), and (ii) IndoRE(Nag et al., 2021).TAC is a relation-extraction dataset with over 100K examples in English, covering 41 relation types.The Trans-TAC dataset contains 533 parallel examples sampled from TAC and translated into Russian and Korean.We use the TAC English dataset for training and Trans-TAC for evaluation.As the TAC train split is too large for efficient training, we only use the first 30k examples.

Table 1 :
The results (averaged over 5 models) of the application of the reordering algorithm to cross-lingual UD parsing.Columns correspond to evaluations settings and score types; rows correspond to evaluation-dataset languages.The best LAS and UAS scores per language are represented in boldface and underlined, respectively.Abbreviations: RC19 -the algorithm by Rasooli and Collins; E -the ENSEMBLE setting.

Table 3 :
The results (averaged over 5 models) of the application of the reordering algorithm to Translated Tacred and IndoRE (above and below the horizontal line respectively).Columns correspond to evaluations settings and score types; rows correspond to evaluation-dataset languages.The best Micro-F1 and Macro-F1 scores, per language, are represented in boldface and underlined, respectively.Abbreviations: Mic-F1 -Micro-F1; Mac-F1 -Macro-F1; RC19 -the algorithm by Rasooli and Collins; E -the ENSEMBLE setting.

Table 4 :
The results (averaged over 5 models) of the application of the reordering algorithm to MTOP in the few-shot scenario, on the mT5 model.Columns correspond to evaluations settings; rows correspond to evaluation-dataset languages.Values are exact-match scores.E -the ENSEMBLE setting.

Table 5 (
UAS follow the same trends.See full scores in Appendix F).The zero-shot performance of the S2S model is subpar compared to the E+C one (less

Table 6 :
Elements that are directly under the root node.The UD treebanks used to estimate the POCs.Each row corresponds to a language-treebank pair.Treebank size is measured in sentence counts. 15

Table 10 :
Hyperparameters for the pointer network generator seq2seq model.

Table 15 :
The standard deviations of the results (averaged over 5 models) of the application of the reordering algorithm to Translated Tacred and IndoRE (above and below the horizontal line respectively).Columns correspond to evaluations settings and score types; rows correspond to evaluation-dataset languages.Abbreviations: Mic-F1 -Micro-F1; Mac-F1 -Macro-F1; RC19 -the algorithm by Rasooli and Collins; E -the ENSEMBLE setting.

Table 16 :
The standard deviations of the results (averaged over 5 models) of the application of the reordering algorithm to MTOP in the few-shot scenario.Columns correspond to evaluations settings; rows correspond to evaluation-dataset languages.Values are exact-match scores.Abbreviations: E -the ENSEMBLE setting.Language abbreviations: Hi -Hindi, Th -Thai, Fr -French, Sp -Spanish, Ge -German.