Typology Guided Multilingual Position Representations: Case on Dependency Parsing

Recent multilingual models benefit from strong unified semantic representation models. However, due to conflict linguistic regularities, ignoring language-specific features during multi-lingual learning may suffer from negative transfer. In this work, we analyze the relation be-tween a language’s position space and its ty-pological characterization, and suggest deploying different position spaces for different languages. We develop a position generation network which combines prior knowledge from typology features and existing position vectors. Experiments on the multilingual dependency parsing task show that the learned position vectors exhibit meaningful hidden structures, and they can help achieving the best multilingual parsing results.


Introduction
With the recent progress on multilingual text representations, there has been a growing interest in developing unified models for NLP tasks crossing different languages (Ammar et al., 2016;Zeman et al., 2018;Conneau et al., 2020).For high-resource languages, a unified multilingual model is faster to train and easier to deploy than a bunch of independent monolingual models.For low(zero)-resource languages, a multilingual model may help building positive knowledge transfer among languages.
Words and their positions in the text are two main features of any language's sentences.For words, various multilingual pre-trained models (Devlin et al., 2019) and alignment algorithms (Lample et al., 2017) have been devoted to unifying lexical semantic spaces among languages.For positions, however, there are much less study on their roles in joint multilingual learning: models simply adopt one universal position representation for all languages.Since word positions describe word orders, a single position space implies all languages are compiled under the same word order system, which is not true according to linguistic prior.For example, adjectives are usually placed before nouns in English, while in French they are almost after nouns.Such conflicts of linguistic regularities may break the effectiveness of word position features in multilingual learning (especially for those tasks sensitive to word order (e.g., syntactic and semantic parsing (Ji et al., 2021))).
In this paper, we study the connection between a language's position space and its typological characterization (especially, word order characterization).By jointly learning position spaces with a syntactic parsing task, we first have two findings.
• When position representations are separately learned on each language, they can effectively help identifying the language's typological feature on word order (e.g., noun-adjective or adjective-noun).Therefore, by replacing the universal position space with language-specific ones, we have more room for handling different linguistic regularities.
• The distances deduced from individually learned position representations correlate well with languages' typological distances (e.g., position spaces of SVO languages and SOV languages are apart).Therefore, customized position spaces provide a clear and acknowledged path for positive transfer in multilingual learning.
We next develop methods to construct multilingual position representations.Options may include attaching language ids to the existing universal position space (Östling and Tiedemann, 2017) and learning position representations from scratch (Bjerva et al., 2019).One main concern on those approaches is on handling unseen languages: if a language doesn't appear in the training set, its position representations are totally unknown.
Our key technical contribution is a generation network for positions.It explicitly takes word typological features of a language as input and outputs a set of position vectors for that language.For unseen languages, we are free to obtain their position vectors through prior on their typology.During the generation process, we take the universal position representations as basis vectors for each language's position space, which makes the learned vectors still carry the prior of "representing a position in texts".Under this setting, we are able to examine the shift of languages' position spaces with a unified coordinate system.
We take multilingual dependency parsing as our demonstrating task.The parser is trained on 13 languages from the universal dependencies treebanks, and is tested with both languages present (13) and absent (30) in the training set.The results show that, with the typological guided position vectors, the parser is able to achieve both significant improvements on seen (+4.1 LAS) and unseen (+1.2 LAS) languages compared with using universal position representations.

Preliminary
Our multilingual models take Transformer (Vaswani et al., 2017) as basic building blocks.There are two types of position spaces, absolute and relative.

Absolute Position Representation Given a sequence of word vectors x
for each absolute position i, there is a position vector a i .The vector could be obtained by a lookup function a i = lookup(E abs , i), where E abs is a learnable matrix (Gehring et al., 2017), or by a fix sinusoidal function (Vaswani et al., 2017), Absolution position vectors are usually added to the input word vectors.
Relative Position Representation Relative position is another widely applied positional feature (Shaw et al., 2018;Dai et al., 2019;Wang et al., 2020).Like absolute positions, for each relative position i, a relative vector r i is obtained from a lookup function r i = lookup(E rel , i).Commonly, relative positions are clipped in a small range {−k, −k+1, . . ., k}.Unlike absolute positions, relative position vectors usually access the Transformer block in self-attention layers, specifically, in the computation of attention scores α ij between two words i, j and output hidden vectors o i , where W Q , W K and W V are parameter matrices.Relative position representation can be shared among multiple heads and layers of all selfattention modules.We use the Transformer network to build the parser.The input word representations are collected from mBERT (Devlin et al., 2019), after passing a Transformer, we use the biaffine scorer (Dozat and Manning, 2017) to score each possible head dependant pair.The performances are evaluated by the head-dependent labeled attachment scores (LAS).

Position Spaces for Multilingual Learning
Existing mutlilingual models use a universal position space for all languages.It is questionable that whether one position space is enough to handle languages with different linguistic constraints.In order to inspect relations between position representations and typological features, we experiment a multilingual parser with language-specific position vectors.For each language, the model assigns a set of learnable vectors for each position (absolute or relative), and the position vectors are jointly learned with the parser.First, we examine whether the learned position vectors carry information about word order.Taking the order of subject(S), verb(V) and object(O) as an example, we merge datasets of English_en (SVO) and Hindi_hi (SOV), and train a binary probing classifier to discriminate two word orders.The classifier is a 2-layer MLP taking the mean pooling of the parser's final layer hidden vectors as input.Following the general probing workflow, we may use testing accuracies of the classifier to assert whether word order information is encoded.However, a high probing accuracy is not trustworthy here because the vocabulary overlapping of two languages is usually small, and the probing classifier is able to achieve high accuracies by simply ignoring the actual word order features and only recognizing differences of the two distinct vocabularies.
We adopt a different probing strategy.After training the probing classifier on English and Hindi, remaining languages are then divided into two groups, the SVO group (Chinese_zh, Finnish_fi, Hebrew_he, Italian_it, Russian_ru, Swedish_sv) and the SOV group (Basque_eu, Japanese_ja, Ko-rean_ko, Turkish_tr).We replace the position vectors of English with those of the two groups, and investigate the accuracy of SVO recognition on English.If position vectors have successfully learned the concept of word order across different languages, we can expected a better probing performances when the replaced vectors are from the same group.Figure 2 shows that on English, position vectors from SVO languages performs much better than SOV languages.The results on nounadjective order (NA or AN) are similar.
Second, we can further ask whether the distance between two position spaces reflects the typological distance between two languages.We choose a linguistic distance metric defined by Scholivet et al. (2019).For position spaces, we compute the average cosine similarities of two corresponding position vectors.Figure 1 shows that the two distances are highly correlated: similar languages have similar position spaces.It suggests that the customized position vectors may be consulted for avoiding negative transfer in multilingual learning.In fact, we perform another substitution experiment directly on the learned parser (Figure 1).By replacing English position vectors with distant languages (e.g., Japanese), the parsing performances drop a lot.Therefore if we unify languages with a universal position space, the conflict of language regularities may cause negative transfer.
Finally, we compare overall multilingual parsing performances when learning with universal position vectors and learning with different position vectors.Figure 1 shows that the latter always performs better (+2.8 average LAS).

Typology-guided Position Generation
Analyses above suggest us to apply different position spaces for different languages.However, naively assigning learnable vectors for positions can not generalized to unseen languages.In order to make the multilingual model applicable to languages not appear in the training set, we propose to generate position vectors under explicit guidance of typological features.
Table 1 lists the six features.For example, feature 81A indicates the order of subject, object and verb.It takes four values (SOV, SVO, VSO and Mixed).In English, 81A is SVO, while in Japanese, it is SOV.We assign a 3 dimension vector for each feature value.A typological vector l of a language is obtained by concatenating value vectors of the six features.The vectors are randomly initialized and will be learned with the multilingual model.
We also experiments with other two sets of typological features.The second set is an extension of above six features provided by (Scholivet et al., 2019) which contains 19 features from WALS.The third feature set is taken from the URIEL typology database (Littell et al., 2017), which is a collection of binary features extracted from multiple typological and phylogenetic databases (WALS, PHOIBLE (Steven et al., 2014), and Glottolog (Hammarström et al., 2021)).This set contains 103 syntactic typological features.

Position Generation
Given the typological vector l (l) of a language l, we train position generation networks (joint with the multilingual model) to output position vectors customized for l (absolute position a (l) i or relative position r (l) i ).Throughout the paper, we set the dimension of position vectors be 128, the range of absolute positions be {0, 1, . . ., 127}, and the range of relative positions be {−4, −3, . . ., 4}.We describe two position generation models, a sim- We can see that vectors learned from MLP exhibit loose similarity patterns, while those learned with prior position knowledge and self-attention (ATTN) still keep some property of the sinusoidal position prior (e.g., symmetry).We can also observe that comparing with sinusoidal vectors, self-attention vectors contain more stripes along the diagonal.It means that they contain more locality constraints: if position i is similar to position j, it may be similar to positions around j.
ple MLP network and a self-attention network enhanced with prior on positions.

MLP Position Generator
For each absolute position i, we deploy a two-layer MLP to learn nonlinear transformations from typology vector spaces to positional spaces.Specifically, the i-th position vector a where g(•) is a non-linear activation function and Self-attention Generator The MLP generator learns position vectors only based on position index i and typlogical vector l.It is possible (Figure 3) that the learned vectors no longer contains the semantic of "position" (e.g., vectors of two close positions are more similar than vectors of two distant positions).Therefore, we also try to include prior knowledge on positions to regularize learned vectors.We build a new position generator based on multi-head self-attention layers.
For absolute position vectors, we assign one head of the self-attention layer for each position i.The typological vector l l is considered as the query vector, and a set of prior position vectors [c 0 , c 2 , . . ., c 127 ] are key and value vectors.The absolute position representation a (l)  tained by weighted averaging over prior vectors. where are the parameter matrices of position i and are shared among all languages.The self-attention operation can be seen as a soft version of selecting a vector from an existing position vector set.In experiments, we set prior position vectors via sinusoidal functions (Equation 1).We can also build relative position vectors using the prior vectors.
Introducing prior vectors has another advantages regarding interpretability: they provide a coordinate system where we could compare the learned position spaces for different languages.In other words, for a newly learned position i, we can compare its attention patterns in two different languages.For example, if the learned absolute position vector a i shift left (attending more on its left positions [c i−1 , c i−2 , . ..] than its right positions [c i+1 , c i+2 , . ..]), this position feature may explicit guide the multilingual model to attend more on left contexts of i.We depict attention patterns of each position in Figure 4 and find that, • for almost all languages, positions near the front end (0) and back end (127) always attend inwards: the front positions shift right, and the back positions shift left.Therefore, for short sentences, position vectors will always push the model to see the whole input, and for long sentences, they will suggest the model to replay the input at the end.• for those middle positions, their attention patterns correlate well with its language branching type:1 for left-branching languages (i.e., head words follow their complements), they usually shift left, while for right-branching languages (i.e., head words proceed their complements) , they shift right2 .
• if we perturb the typological vector, the distribution of attention pattern can change accordingly.For example, on Italian when we freeze all its typological features but only change its nounadjective order feature from NA to AN, 5% of its position vectors change from shifting right to shifting left.Above observations may suggest that, guided by the typological features, position vectors are endowed with meaningful and language specific hidden structures, and these structures could be virtualized with the help of prior position vector bases.

Training and Testing
During training, we sample one batch from 13 high-resource languages with equal probability, which increases the diversity of training for multilingual positional encoding.During testing, we first generate the corresponding positional encodings for the languages at once and then use them directly as position vectors in the parsing task, which means that our generative network has almost no additional computational cost during testing.

Experiments
Dataset Universal Dependencies (UD) is a framework for consistent annotation of grammar (parts of speech and syntactic dependencies) across different human languages (Zeman et al., 2018).Following Kulmizev et al. (2019); Üstün et al. (2020), We choose 13 representative training languages (highresource) and 30 testing languages (zero-resource).Statistics for the treebanks are listed in the supplemental material.The crosslingual word representations are derived from mBERT (Devlin et al., 2019).Since the mBERT representation is subword-level, we follow previous work in taking the first subword as word-level representation.
Evaluation Parsing performance is measured with labeled attachment scores (LAS).We use the official evaluation scripts provided in the CoNLL 2018 shared tasks (Zeman et al., 2018).All of our results are averaged over three runs.
Supplemental Material Full results for the 30 zero-shot languages (A), experimental details (including hyperparameters, training time, model size (B)), more visualizations of position representations (C), and dataset statistics (D) are placed in the supplementary material.

Main Results
Baselines We denote T abs , T rel to represent Transformer with absolute position and relative position respectively, and denote the two position generation methods as MLP and ATTN.We conduct experiments with following six baseline methods, • udpipe (Straka, 2018), a monolingually trained multi-task parser; • uuparser (Kulmizev et al., 2019), a monolingually trained BiLSTM parser using mBERT as additional crosslingual feature; • udify (Kondratyuk and Straka, 2019), an all parameters fine-tuned mBERT parser could be trained both monolingually and multilingually; • udapter (Üstün et al., 2020), a multilingually trained parser which only fine-tunes additional adapter parameters in mBERT; • ID, it assigns each language a vector (which will be learned from scratch), and the vectors are added directly to the universal position vectors (Östling and Tiedemann, 2017); • Feat, it direclty adds typology feature vectors (constructed in Section 4) to the universal position vectors (Scholivet et al., 2019).

Results
We trained parsers monolingually (one model per language) and multilingually (one model for all languages) respectively (Table 2).
For the udify model which doesn't include any linguistic prior, its multilingual version underperforms its monolingual version on high-resouces, which witnesses a negative transfer.On the other side, the four methods (ID, Feat, MLP, ATTN) adding typological prior (URIEL) can reduce the gap to the best monolingual result, where language ID embeddings (ID) has the least effect, next to typological features (Feat), and our two proposed position generation methods are more effective.In particular, the ATTN method significantly improves the performance of the multilingual parser, boosting 4.0 LAS with T abs and 4.1 LAS on T rel (comparing with Feat).It also outperforms monolingual training by 1.2 LAS and 1.6 LAS.By simply adding the prior to all position vectors (approximates to a bias term), ID and Feat can hardly control the learning of the single prior parameter, so their performance gain is marginal.ATTN always outperforms MLP.It suggests that keeping a correct semantic of "position" could be crucial for learning an effective position space.
A major advantage of introducing language specific information in multilingual training is the ability to parse languages that have not been seen during training.On the 30 widely selected zeroresource languages (a subset is in Table 3), all methods except the ID method improve performances.The ATTN method still achieves the highest zeroresources parsing scores, which could be the effect of both effective way of encoding typological features (self-attention) and using a proper position prior.
The current best parser udapter is based on adapter fine-tuning.Similar to MLP, udapter uses a multi-layer perceptron to generate adapter parameters from generic typological information (URIEL).Unlike udapter, our methods focus on guiding the position vectors, which account for a smaller number of parameters in the parser.Comparing with udapter, ATTN leads 11 out of 13 high-resource languages, while for zero-resources, it further improves 0.5 LAS.These results suggest that explicitly associating typological information with the learning of position vectors makes a better use of typological information.

Analyses
The Effect of Typological Features To analyse the effect of typological features, we conducted experiments on four feature settings with two position generation methods (Figure 5(a)): • WALS-6: 6 word order features from WALS (Naseem et al., 2012) 14.1 28.9 54.4 11.8 36.9 33.2 29.9 Table 3: We select a subset of the zero-resource languages for demonstration.Eight languages are in the pre-training process of mBERT, six languages are not (complete zero-shot).
of word order typological features.Furthermore, we can observe that the largest boost from URIEL because it has the richest typological features.
Since our method adds additional network parameters (from MLP or positional attention networks), it may make comparisons unfair.So we conduct parameter size fairness experiments by modifying the values of URIEL to random values.It means that our generative networks are guided by nonsensical information.The results show that even though we retain the additional parameters, the random feature values severely hurt the gain from the generative networks.Therefore, our models' performance improvments are not due to the additional parameters.
The Effect of Sinusoidal Priors In the selfattention position generator (Equation 4), we introduce sinusoidal prior vector c t .As the sinusoidal prior describes some properties of positions, it can avoid the generated positions deviating too far from the basic position space (Figure 3).It might also be able to speed up convergence and improving the model's inductive bias.Figure 5 (b) compares loss function curves and LAS curves on the validation set of three models, ATTN, ATTN-sinusoidal, and MLP.ATTN-sinusoidal means replacing the sinusoidal priors with random initialized learnable vectors.The results show that ATTN-sinusoidal degenerates to be comparable to MLP.This demonstrates that the sinusoidal priors not only converge faster, but also help ATTN ending up with higher performances.

Related Work
Multilingual Parsing Dong et al. (2015); Johnson et al. (2017) identify the (positive) transfer -(negative) interference trade-off problem in multilingual neural machine translation.Early multilingual dependency parsing studies consider word representation as a negative transfer factor and learn delexicalized parsers (McDonald et al., 2013;Naseem et al., 2012;Duong et al., 2015).Although they avoid negative transfer, valuable lexical information was lost.As a result of the development of multilingual word representations, Ammar et al. (2016); Straka (2018) train multilingual parsers using multilingual word embeddings.Kondratyuk and Straka (2019); Üstün et al. (2020) train multilingual parsers using multilingual pretrained representations (mBERT (Devlin et al., 2019)).Once word representation became positive factor, recent studies found that word order became a new negative factor.Ahmad et al. (2019); Ji et al. (2021) observe the negative transfer phenomenon of word order in a zero-shot cross-lingual scenario.Previous work simply consider word order features as input (Östling and Tiedemann, 2017;Scholivet et al., 2019;Üstün et al., 2020).Instead, we explicitly associate it with order-related parameters (i.e., position representations) in the Transformer network.

Conclusions
We studied the role of position spaces in multilingual learning.By comparing a univeral position space and language-specific position spaces, we showed the latter could either handle linguistic constraints of different language efficiently or provide a clear path for positive transfer in multilingual learning.We developped a self-attention based position space generator.We showed that by utilizing typological prior and existing position space prior, the multilingual dependency parser could enjoy positive transfer on both high-resource and zeroresource languages.One future work is to investigate whether the obtained position vectors could help other multilingual and monolingual tasks.It is also interesting to compare the position spaces induced from different multilingual tasks (supervised or unsupervised).

Limitations
An obvious limitation is that our work relies on the typology features of languages.Some extremely rare languages might lack typology studies (its features are missing values in the WALS database).Our approach is limited for these languages.Another non-critical limitation is that the technical contribution of our work is limited.After detailed analyses of position vectors, our methods for generating position vectors are not that complex, but we believe that an effective method is not neccessarily complex, and designing experiments to reveal key properties of position features and their connection with linguistic knowledge could still make solid contributes to NLP community.

A Zero-Shot Results
Table 4 shows LAS scores on all 30 zero-resouce languages for two types of Transformer guided by three typological methods, respectively, UDapter (Üstün et al., 2020), and udify (Kondratyuk and Straka, 2019).Languages with "*" are not present in mBERT training data.Overall, our ATTN approach achieves state-of-the-art performance, especially for the Transformer abs model.Our MLP approach also has visible improvements and is able to compete with udapter.This suggests that position representation guided by typological features can be successfully transferred to unseen zero-shot languages.In addition, we specifically look at languages that are not in the mBERT training set, which implies that the cross-lingual word representations are not well aligned.The performance of these languages is almost always unacceptably low.This suggests that multilingual word representations are the foundation of multilingual position representation.

B Experimental Details
Implementation Our parser's implementation is based on the framework proposed by Kulmizev et al. (2019).The only difference is that we have replaced their BiLSTM context encoder with the recently popular Transformer context encoder.This is because the position representations we focus on are an important part of Transformer encoder.The hyperparameters of the parser classifier are identical to those of Udapter (Üstün et al., 2020)

C Additional Visualization
Figure 6 shows a visualization of the position vectors for three languages containing English, Italian and Chinese.We can see that they do not exhibit loose similarity patterns and still keep some properties of the sinusoidal position prior (e.g.symmetry).And these patterns clearly differ between languages.This further suggests that it is not reasonable to use the same position representation for different languages.It is necessary to guide multilingual position representations by appropriate methods (e.g.our word order features).

Figure 1 Figure 2 :
Figure1: a) The correlation between position space similarities (to English) and linguistic distance.b) The parsing performances on English when substituting with different languages' position vectors.The x-aix is the linguistic distance defined in(Scholivet et al., 2019).c) The parsing accuracies of customized multilingual position vectors (MPR) and universal position vectors (UPR).sv it ru fi he zh eu ko tr ja

Figure 3 :
Figure 3: Cosine similarity of the first 50 position vectors.We can see that vectors learned from MLP exhibit loose similarity patterns, while those learned with prior position knowledge and self-attention (ATTN) still keep some property of the sinusoidal position prior (e.g., symmetry).We can also observe that comparing with sinusoidal vectors, self-attention vectors contain more stripes along the diagonal.It means that they contain more locality constraints: if position i is similar to position j, it may be similar to positions around j.
b 1 i and b 2 i are independent parameters for each position.Relative positions are generated in a similar way.Technically, for α ij and o i in Equation 2, we use two different position vectors r K,(l) i , r V,(l) i generated by two MLPs.

Figure 4 :
Figure 4: Attention patterns of position vectors.The x-aix is the 128 positions, and the y-aix describes differences of attention scores allocated to the right and to the left (positive values means shift right).The x-aix is divided into 3 regions, the front part contains the first 30 positions, the back part contains the last 30 positions, and the middle part contains the remaining 68 positions.Percentages report how many positions in middle parts shift left/right.

Figure 5 :
Figure 5: (a) The average LAS of 13 high-resource languages for the four different typological features.(b) The value of the loss function during training as well as the performance on the dev set.

Figure
Figure Cosine similarity of the first 50 position vectors.We can see that vectors learned from English, Italian and Chinese keep some property of the sinusoidal position prior (e.g., symmetry).We can also observe that the three languages have distinctly different patterns.

Table 1 :
Six word order typological features from WALS (above), and typological vectors l of Arabic and Bulgarian (below).

Table 2 :
Multilingual parsing performances.Last two columns show average LAS of 13 high-resource (HR) and 30 zero-resource (ZR) languages respectively.

Table 5 .
without applying a new hyperparameter search.Unlike Udapter, which fixes the mBERT and trains the extra added adapter modules in it, we train the extra Transformer context encoder with multilingual position encoding on top of the fixed mBERT.Training Time and Model size In terms of training time, on the NVIDIA RTX3090 GPU, our parser takes about 15 minutes for one epoch over in → D hid → D out 18→256→768 MLP with WALS-19 D in → D hid → D out 57→256→768 MLP with URIEL D in → D hid → D out 103→256→768

Table 6 :
Statistics of the high-resource languages from UD v2.3.We chose the same treebank asKulmizev et al.