Word Reordering for Zero-shot Cross-lingual Structured Prediction

Adapting word order from one language to another is a key problem in cross-lingual structured prediction. Current sentence encoders (e.g., RNN, Transformer with position embeddings) are usually word order sensitive. Even with uniform word form representations (MUSE, mBERT), word order discrepancies may hurt the adaptation of models. In this paper, we build structured prediction models with bag-of-words inputs, and introduce a new reordering module to organizing words following the source language order, which learns task-specific reordering strategies from a general-purpose order predictor model. Experiments on zero-shot cross-lingual dependency parsing, POS tagging, and morphological tagging show that our model can significantly improve target language performances, especially for languages that are distant from the source language.


Introduction
Extracting linguistic structures from natural language usually relies on high quality human annotations. To handle low resource scenarios, efforts have been devoted to sharing resources among languages and adapting from high resource models. One crucial step of these methods is how to unify input and output spaces over languages. For example, the universal dependency project (McDonald et al., 2013) constructs a universal output space for crosslingual dependency parsers, and cross-lingual word representation learning algorithms helps aligning word forms of different languages (Conneau et al., 2017;Devlin et al., 2019).
Beyond word form, word order is another important factor in cross-lingual structured prediction (Wang and Eisner, 2018b): it is possible that sentences in two different language have similar parse trees, but their words are organized in different orders (e.g., SVO or SOV). To share annotations among them, we need to handle word order discrepancies carefully: if a model learned on the source language is tightly coupled with the source language word order, performances on target languages could be hurt as their word order could be incompatible (Wang et al., 2019). On the other side, if one completely drops word order (e.g., bag-ofwords), the source language (and target languages) performances might be poor as order-sensitive features could be essential. Trade-offs have been made by using weak word order information (e.g., relative positions instead of absolute positions (Ahmad et al., 2019a)), but we still want to seek better adaptation of word order without sacrificing source language performances.
In this work, we integrate new reordering modules to help cross-lingual structured prediction. Given a bag-of-words from the target language, the module tries to reorder them to best resemble a source language sentence. The structured prediction part then receives inputs with a more familiar order information. Crucially, • the training of the reordering model only requires unlabelled source language data, without parallel corpora or off-the-shelf word alignment tools (Tiedemann et al., 2014; (thanks to the universal word forms). • we don't really need to perform the reordering action. Instead, the correct order can be implicitly encoded by multi-task learning: word order information accesses the model as a supervision signal. The separation of reordering module and structured prediction module provides a new way to both explore and transfer order information.
We suggest a distillation framework (Hinton et al., 2015) for learning the reordering module. Un mignon chiot A cute puppy Figure 1: A diagram of the traditional zero-shot cross-lingual transfer approach (a) and the reordering-based approach (b) on the dependency parsing task. The structure in red is the output of the parser.
A general order prediction model is first trained on large scale unlabelled source language data, then for each structured prediction task, we distill the knowledge from the general model to teach task specific reordering modules.
We evaluate our method on three zero-shot crosslingual structured prediction tasks (dependency parsing, part-of-speech and morphological tagging). By taking English as source language, we observe no obvious monolingual result loss in most cases, while the cross-lingual results could be significantly improved. We also analyze reordering perplexities of different target languages and show their correlation with performances of cross-lingual representation and structured prediction tasks. The conclusion reflects that both word order and word form could be important in cross-lingual learning.

Method Overview
Given a sentence x = B, w 1 , · · · , w n , E (Begin and End are synthetic marks), we can have two views on x, a bag-of-words ω and an order of words o. Structured prediction tasks map x to a linguistic structure y (e.g., a parse tree). Here, we consider a strict zero-shot cross-lingual setting: given a labelled source language corpus S = {(x s , y s )}, we train a structured prediction model on S, and apply it directly on a target language sentence x t (with ω t , o t ), without seeing any labelled or unlabelled target language data. We will assume a reasonable cross-lingual word representation which maps source and target language words into a same vector space.
Usually, we use S to estimate probability p(y|x s ) = p(y s |ω s , o s ) which takes word order as an input. Models use word order o s as a natural reference to design hyper-parameters (e.g., it sug-gests the linear chain structure of RNN and CRF) or extract order sensitive features (e.g., position embeddings in Transformer (Vaswani et al., 2017)). However, the target language may hold a different word order, so this tight connection between word order patterns and model architectures is inflexible for adaptation.
Here, we propose to estimate p(y, o s |ω s ) instead of p(y|ω s , o s ) (i.e., moving the word order to the output). We can conceptually factorize p(y, o s |ω s ) to p(y|ω s )p(o s |ω s ), where p(y|ω s ) is a structured prediction module, and p(o s |ω s ) is a reordering module whose job is to recover a source language word order from bag-of-words ω (of both source and target language). Two modules share internal hidden states and model parameters. Comparing with p(y|ω s , o s ), we observe that, • p(y|ω s ) decouples word order and model architectures: we are now free to use orderinsensitive models (e.g., Transformer without position embeddings). Instead of being in hyperparameters, word order information now appears in the (shared) model parameters of p(y|ω s ) and p(o s |ω s ). • p(o s |ω s ) plays the reordering action implicitly.
When training on the source language, it guides the shared parameters to derive a correct order o s from ω s . When testing on the target language, since words in ω t are in the same space of those in ω s , the learned parameters in p(o s |ω s ) will implicitly encode ω t with a proper source language word order (like what it does for ω s ). As a consequence, p(y|ω t ) will receive a more familiar order information even if o t is different from o s . Figure 1 shows an illustration of the two methods. Before moving to details of p(y, o s |ω s ), we describe possible representations of word order.

Word Order Representation
To share more parameters with downstream structured prediction tasks, we consider the word order recovery task as another (linear) structured prediction task. For two words w i , w j in the bag of words of a sentence x, we design two kinds of order objectives based on the adjacency matrix, • undirected adjacency matrix M , where M ij = 1 means that w i and w j are adjacent in x. M ij = 1) means that w i is the previous (next) word of w j , Besides, as suggested by Mena et al. (2018), another word order objective is assignment matrix M . It assigns each word to its correct position in a shuffled sentence. They show that assignment matrix has a strong ability to restore sequence from a random order. All of the above objectives are in the form of word-to-word matrices ( Figure 2). They can easily supervise the same word-to-word self-attention matrices in Transformer Encoder.

Joint Prediction and Reordering
We now describe the implementation of p(y, o s |ω s ). Basically, it requires the input to be a bag of words without order information. Here we choose a Transformer Encoder without position embeddings 2 (Vaswani et al., 2017) instead of order-sensitive networks like RNNs.
Input Features Denote x 1 , · · · , x n as each word's vector representation. We obtain x i by concatenating a fixed cross-lingual word embedding w i (from MUSE or mBERT), an optional crosslingual part-of-speech tag embedding t i (used only for dependency parsing task), and a trainable oc-2 It can be seen as an order-insensitive graph neural network.

Single-Head Attn
Add & Norm  Figure 3: A diagram of a reordering-based Transformer encoder for dependency parsing task. A blue block is the reordering block. A red block is the structured prediction block. The yellow block is the supervision of reordering or structured prediction module. currence index embedding c i . The c i avoids the isomorphism problem 3 of same words by indicating which of them is used first, which is used second, and so on.
Transformer Encoder A Transformer Encoder is obtained by stacking N layers of identical Transformer blocks. We follow the standard notation to introduce how a Transformer block outputs a deep representation X = [x 1 , · · · , x n ] based on a shallow input X = [x 1 , · · · , x n ]. We start by mapping the input X to three spaces via linear transformations: A self-attention matrix A is then obtained by scaled dot-product attention function. Where d is the column dimension of Q, A ij measures an interaction score between x i and x j . The self-attention output O = A · V contains an average of vectors X using weights from A, and the vector o i in O is a deep hidden representation of x i . We can extend the above single-head self-attention layer to multihead by defining multiple sets of {W Q , W K , W V }.
The output of a multi-head layer is the concate-nation of a set of O k obtained by corresponding Q k , K k and V k . We obtain the dimension-reduced output O through a linear transformation: There is a residual connection and normalization (Add&Norm) layer after the self-attention layer, followed by a feed-forward layer, and finally another residual connection and normalization layer. After these layers we obtain a deep representation X = [x 1 , · · · , x n ]. The omitted detail of these three layers can be found in Vaswani et al. (2017).
Reordering Block A reordering block just adds a reordering signal L order to the ordinary Transformer block. As mentioned in Section 3, L order can guide the attention matrix A (in Equation 1), which is also a word-to-word matrix, to display a linear chain structure. After aggregating the shallow representation by using the matrix A as weights, the deep output X of the reordering block has the contextual information with respect to the word order, since A is forced to match the desined word order. We test different setting of L order with different word order repersentation matrices. (2) In our experiments, we will select one of the reordering signals to train the model and compare it with each other. In addition, we use single-head self-attention to reduce computation because preliminary experiments show that multiple heads are not helpful for reordering blocks. Michel et al. (2019) has also shown that replacing multi-head with single-head does not hurt performance.
We cross stack reordering blocks with original Transformer blocks to build the complete encoder ( Figure 3). The reordering blocks estimate probability p(o s |ω s ) by learning linear word order on the source language. The original blocks estimate probability p(y|ω s ) by learning a structured prediction task without a direct reordering supervision. Given the bag of words ω t in target language, thanks to cross-lingual word representation techniques, the reordering block predicts a similar word order p(o s |ω t ) as source language. Since the word order of target language is not introduced, the encoder does not have word order drift between target language and source language. Downstream Classifier For a downstream task, we add a structured prediction classifier over the last encoder block. We investigate three downstream tasks. For graph-based dependency parsing (dep), we follow Dozat and Manning (2017) to use two bi-affine classifiers. For universal part-ofspeech (upos) and morphological (mor) tagging, we use a multi-layer perceptron (MLP) classifier. We train one of the downstream tasks by the corresponding cross-entropy loss function L down ∈ {L dep , L upos , L mor }. We assume the set of reordering block IDs is L. The joint objective function L joint is to minimize a weighted combination:

Distillation
In Equation 2, supervision signals on word reordering blocks are directly obtained from order objective (e.g.,

− →
M and ← − M ). Since we jointly perform reordering and structured prediction. The data for learning word order is constrained by the corpus size of structured prediction task (e.g., treebanks). On the other hand, there are massive unlabelled sentences which can help build a more powerful reordering module. To use those unlabelled data, one challenge is that directly feeding them into the joint learning model could be problematic since the severe imbalance between structured prediction signals and reordering signals would make the model focus on reordering. Furthermore, if we solve a different structured prediction task, we need to repeat the learning on those unlabelled data, which could be unnecessary and time-consuming. Therefore, it is important to separate the learning on those unsupervised data and sharing them efficiently.
Here we design a generic p(o s |ω s ) model as a teacher and then distill the knowledge it learned in a large-scale unlabelled corpus into reordering blocks in a structured prediction task, which play the role of a student (Figure 4). The reordering student model attempts to learn the soft probability distribution from the teacher model, rather than the hard ground truth. Furthermore, multiple student models can share one teacher model, which avoids repeated training on a large-scale unlabelled corpus.

The Teacher Model
The input features of teacher model are the same as described in Section 4. We slightly modify the Transformer encoder by removing the feed forward layer to reduce GPU memory usage and speed up training on a large-scale corpus. There are only reordering supervision signals here, and all blocks of the encoder are reordering blocks.
Reordering Classifier We use a reordering classifier over the last encoder block's output (X = [x 1 , . . . , x n ]) instead of the downstream classifier. To save space, we focus on describing the training and teaching of undirected adjacency matrix M order objective. 4 .
We feed the vector x i into an MLP to obtain the dimension-reduced vector v i , and then compute the probability P ij of word w i and w j being adjacent to each other by a bi-affine and sigmoid (σ) function. v where {W, b 1 , b 2 } are parameters of the classifier.
Training and Teaching In training, the reordering blocks are supervised by the order objective like Equation 2, while the reordering classifier is supervised by cross-entropy loss function: In teaching, the objective of the student model's reordering blocks is the teacher model's output instead of the hard order matrices. We modify Equation 2 as follows:

Experiments
We demonstrate the effectiveness of our approach on three structured prediction tasks in a strict zero-shot cross-lingual setting, dependency parsing (DEP), universal part-of-speech tagging (UPOS), and morphological tagging (MOR). We train our general reordering teacher and three structured prediction models on the train set of Universal Dependencies (UD) English-EWT treebank (v2.2) (Nivre et al., 2018). We use the development set and test set of the UD English-EWT treebank to validate source language performance. Following Ahmad et al. (2019a)'s setup, we take 30 other languages as target languages, and use the development set and test set of their treebanks to evaluate target languages performance.
For the reordering model, a Base train set is UD English, and an Extra set is automatically annotated raw texts (Ginter et al., 2017) generated by UDPipe v2.0 (Straka and Straková, 2017) from CommonCrawl and Wikipedia. Each sentence is automatic tokenization and syntactic annotations (include UPOS).
The hyperparameters we used in word reordering task and downstream tasks are summarized in Appendix B. The statistics of the UD treebanks are summarized in Appendix C.

Performances of the Reordering Model
Our Models and Baselines We explore the input features' influence, order representations , and unlabeled data size to the reordering model.  (2018)). For unlabeled data size, we gradually increase the training set in 10% increments.
Evaluation We use exact matching (EM) to measure reordering at the sentence level, which means the whole sentence to be correctly decoding, use order perplexity (PPL= 2 H(x) ) to measure the cross entropy H(x) of sentence x 5 . The cross-entropy of a sentence usually decomposes to some local n-grams, so PPL can more finely measure partial matching. During testing, we exclude sentences that are longer than 20 (only for PPL).  Table 1). We first compare two adjacency matrix objects.    M uses two bi-affine functions to calculate forward and backward scores respectively, while M uses only one dot product scoring function in order to satisfy symmetry. Previous work Manning, 2017, 2018) shows bi-affine function outperforms dot product, and using two scoring functions can benefit from ensemble learning. Next we compare adjacency with assignment matrix, Secondly, since different downstream tasks assume different input features, we list four cases containing two cross-lingual word embeddings and conditions on whether to use extra upos features (Table 2). Basically, using extra upos features can improve the results because it alleviates the problem of out-of-vocabulary and low frequency words. Surprisingly, the mBERT representation formally carries some order information from the positional encoding in itself, but the reordering results based on mBERT is lower than MUSE. We guess this may be due to the loss of some lexical and order information when doing the subword-to-word conversion.
Finally, since reordering is an unsupervised task, we analyze the impact of the number of Extra texts used (Table 2). Two phenomena are observed. One is the more unannotated data, the better re-ordering performance because of the improved expression capacity of neural networks. The other is it's more difficult to improve reordering after using up to 50% data. This is because the reordering task is difficult in some cases. For example, "I like apples, bananas and oranges", any exchange of three fruits is a reasonable sentence. In the future, reordering models could track the progress of graph neural networks and further improve performance.

Results of The Downstream Tasks
Our Baselines We have five benchmark models from previous work (Ahmad et al., 2019a): • RNN uses biLSTM encoder, • Abs uses Transformer encoder with absolute position embedding, • Rel uses undirected relative position embedding, • NoP drops position embedding directly.
• mBERT is a direct fine-tuning method based on the pre-trained language model mBERT.
Our Models We have two variants of our model: • Reord represents pipeline the reordering model and structured prediction model. Directly feed the reordered sequence into Rel model. • noDst models two tasks as a multi-task learning task without distillation, with use 10% Extra data. Our final two models use MUSE and word-level mBERT (base version) as word representations, respectively. Note that our mBERT-based model stacks 4-layers modified Transformer encoder on the top of mBERT encoder.
Evaluation First, we have three downstream task performance metrics. For DEP, following Ahmad et al. (2019a), we evaluate it with label attachment score (LAS) with punctuations excluded 6 . For UPOS, we evaluate it with token-level accuracy (Acc). For MOR, we evaluate it with token-level exact match (EM).
Second, we discuss three language distance metrics. Smith et al. (2017) report the "precision@5" of MUSE in a bilingual dictionary, which measures the word form distance (S17). Ahmad et al. (2019a) Table 3: LAS scores of our models and baseline models on 31 test sets of DEP task. ' †'means that the best transfer model is statistically significantly better (by paired bootstrap test, p < 0.05) than others.
(A19). We use the PPL of target languages in the reordering model as distance, which measures the confidence of reordering model to restore target order. All distances are calculated by source (en) to 30 targets.
Results Firstly, we compare our model with some benchmark models on DEP task (Table 3). The first part is MUSE-based models, and overall, our hard reordering approach achieves competitive performance on the source language and soft reordering approach achieves the best cross-lingual performance on 21 languages. First, the Rel model achieves the highest cross-lingual performance in 4 baselines because of weakening the order by using a undirected relative position. The results of the benchmark model demonstrate that word order is indeed a trade-off of source and cross-lingual performance. Second, we analyze the effectiveness of multi-task learning. Comparing the original Rel as the directional feature.
model with the Reord model, we observe an improvement in cross-lingual performances, and this approach does not affect English performance at all. This illustrates the effectiveness of the reordering model, increasing the similarity between target sentences and source sentences. Nevertheless, the Reord model is still weaker than our approach. The main reason is that combining the reordering model in a pipline way can cause error propagation problems which affect cross-lingual performance. Third, we analyze the effectiveness of the distillation method. Comparing noDst with Abs, we observe that noDst performs worse in both English and cross-lingual results. This result shows that multi-tasking learning is affected by the data imbalance problem of the reordering task and DEP tasks. The model overfits the reordering task which has large amounts of data. In fact, it makes the parsing performances even weaker than single-task learning setting. It demonstrates the effectiveness of the proposed knowledge transfer approach. The second part is encoding with mBERT representation. Overall, our approach outperforms the mBERT encoder, This shows our method also works for mBERT representation. We analyze two cross-lingual representations by comparing the mBERT baseline with MUSE-based baselines (RNN, Abs, Rel), it shows that the zero-shot transfer performance with mBERT embeddings is weaker than MUSE on the DEP task, and that may because the mBERT representation is not well aligned across languages. Previous work (Wang et al., 2019) also demonstrates this conclusion, and they use parallel corpus of source and target language pairs as supervision to learn better alignment of contextual cross-lingual representations. Since using parallel corpus is not strictly a zero-shot setting, we do not compare this work, but our model is compatible with their approach and can benefit from the better aligned representations.
Secondly, we discuss the correlation among language distance metrics (A19, S17, PPL), and their correlation to the parsing performance (LAS). (Figure 5). Basically, all three distances are positively correlated with LAS. We observe a clearly higher correlation between A19 and LAS than S17 and LAS, possibly because the impact of word form to LAS is weaker than word order. We observe a slightly higher correlation between PPL and LAS than A19 and LAS, which shows that, the reordering module does learn word order information. We also observe a clearly higher correlation between PPL and S17 than A19 and S17, since the reordering model uses MUSE as input and order as object, it can specifically co-represent the distance of word form and word order, which may indicate that it's not enough to consider word order alone and the word form will helps.
Thirdly, we report results on the UPOS task and   the MOR task (Table 4). Overall, our approach further improves cross-lingual results on both tasks over strong MUSE (+3.1 Acc/+1.1 EM) or mBERT (+0.9 Acc/+0.5 EM) baselines. This suggests that the reordering module is generally effective for different tasks and different cross-lingual word embeddings. In particular, RNN model achieves the best source language performance on UPOS task, which may indicate that RNN is able to capture better word order information than absolute position embedding. Comparing the MUSE-based cross-lingual results in the DEP task, UPOS and MOR have a worse performance, which suggests that upos, as a well-aligned cross-lingual feature, is useful for zero-shot transfer. However, the improvement after using the mBERT embeddings is more significant (e.g. from 47.7 to 74.5 Acc) than DEP task, the reason may be that UPOS and MOR task relies more on local information. Finally, we analyze the reordering module by exploring three versions of the odd, even, and bottom layers (Figure 6). We observe that odd is the most reasonable version, as it incorporates order information evenly in each layer's representation. The even decreases performance less because it is only an offset of the odd layer, and we think that the main part of the decrease is due to the reordering supervision in the last layer, which slightly hurts the learning process of structured prediction. The bottom approach has the worst performance, and we guess the reason is that the high level representations are unable to receive fresh word order information.

Related Work
There has been a lot of recent research on the cross-lingual transfer of structured prediction tasks, including dependency parsing (Wang and Eisner, 2018a;Ahmad et al., 2019b), and POS tagging (He et al., 2019;Kim et al., 2017). Early work built delexicalized models to direct transfer, but at the expense of performance (McDonald et al., 2011;Rosa and Žabokrtský, 2015). With the development of cross-lingual word embedding techniques (Conneau et al., 2017;Devlin et al., 2019), the recent work has utilized it to retain lexical information (Ahmad et al., 2019b;Wang et al., 2019).
Data augmentation can enrich the word order that appears on the source language, thereby increasing the intersection with the target language's word order. Tiedemann and Agic (2016); Eisner (2016, 2018a) create a high-quality synthetic treebank to increase source data. But data augmentation requires expert knowledge to build treebank and extra train time. It does not apply to a larger number of target languages. Annotation projection relies on cross-language annotation mapping using parallel corpus and automatic alignment (Rasooli and Collins, 2015;Agić et al., 2016;Plank and Agić, 2018). Our approach does not require the above resources, only the source language's raw data. Tiedemann et al. (2014); Wang and Eisner (2018b); ; Rasooli and Collins (2019) reorder the source treebanks to make them similar to the target language of interest before training on the source treebanks. This is a sourceto-target reordering that requires the use of parallel corpus or automatic alignment tools. Instead, we are target-to-source reordering, and training only at source language.
Some previous work has seen word order as a trade-off, Ahmad et al. (2019a) modified transformer encoder by using undirected relative position to learn weak order information. Liu and Fung (2020) using Conv1d to capture local word order and taking the positional embeddings from mBert to initialize a frozen positional embeddings. Our reordering module can fully learn the source language word order based on the bag-of-words input. It's useful for tasks that are sensitive to global word order, such as parsing.

Conclusion
This work focus on the source-to-target word order adaption in zero-shot transfer. We build structured prediction models containing a novel reordering module with a bag of words input. The reordering module is distilled from a task-generic, unsupervised, and large-scale pre-trained reordering teacher. Experiments show that, our model can significantly improve cross-lingual performances on three tasks without obviously hurting source language performance. Future work contains two parts: extending to multi-source transfer and extending to more structured prediction tasks such as NER which requires span-level reordering.

A Other Reordering Classifiers
For assignment matrix M similar to M , we only modify the sigmoid function in Equation 4 to a column-normalised softmax function. P ij = Softmax j ( v i W v j +b 1 h i +b 2 d j ) (7) For directed adjacency matrices, we need to handle forward matrix − → M and backward matrix ← − M . Thus we use two MLPs to generate forward representation h i and backward representation d i . With these two representations, the forward edge probability P ij can be calculated by a bi-affine score function with a column-normalised softmax function, the backward edge probability P ij can be calculated by a bi-affine score function with a row-normalised softmax function.

B Hyper-parameters
The hyper-parameters we used in reordering teacher model (Table 5) and downstream structured prediction models (

C Details of Datasets
The statistics (number of sentences) of Universal Dependency (UD) treebanks are summarized in Table 7.