Evaluating morphological typology in zero-shot cross-lingual transfer

Cross-lingual transfer has improved greatly through multi-lingual language model pretraining, reducing the need for parallel data and increasing absolute performance. However, this progress has also brought to light the differences in performance across languages. Specifically, certain language families and typologies seem to consistently perform worse in these models. In this paper, we address what effects morphological typology has on zero-shot cross-lingual transfer for two tasks: Part-of-speech tagging and sentiment analysis. We perform experiments on 19 languages from four language typologies (fusional, isolating, agglutinative, and introflexive) and find that transfer to another morphological type generally implies a higher loss than transfer to another language with the same morphological typology. Furthermore, POS tagging is more sensitive to morphological typology than sentiment analysis and, on this task, models perform much better on fusional languages than on the other typologies.


Introduction
Cross-lingual transfer uses available annotated resources in a source language to learn a model that will transfer to a target language. Earlier work used machine translation (Mihalcea et al., 2007), parallel data (Padó and Lapata, 2009), or delexicalized models (Zeman and Resnik, 2008;McDonald et al., 2011;Søgaard, 2011) to bridge the gap between languages. However, recent improvements (Devlin et al., 2019) have reduced the need for parallel data, instead relying on multi-lingual language models, trained on the concatenation of monolingual corpora. Fine-tuning these multilingual language models on a task in a source language can lead to strong performance when applied directly to the targetlanguage task (zero-shot transfer). This progress has uncovered gaps in performance, as transfer is generally easier between similar languages, and some language families consistently perform worse (Artetxe et al., 2020;Conneau et al., 2020a). So far, however, the analysis of these differences has only been anecdotal, rather than centered as a research question of its own merit. For these cases, linguistic typology has important implications, as it gives us ways to quantify the similarity of languages along certain variables, such as shared morphological or syntactic features (Bender, 2013). While previous work has studied the effects of morphological typology on language modeling (Gerz et al., 2018;Cotterell et al., 2018;Mielke et al., 2019), this effect on cross-lingual transfer has not been looked at in detail.
In this paper we attempt to answer (RQ1) to what degree morphological typology affects the performance of state-of-the-art cross-lingual models, (RQ2) whether morphological typology has a stronger effect than other variables, e.g., the amount of data for pretraining the LM or domain mismatches between source and target, (RQ3) whether there is a different effect on a low-level structural task (POS tagging) vs. a semantic task (sentiment analysis).
To answer these questions we experiment with two state-of-the-art cross-lingual models: multilingual BERT and XLM RoBERTa. We fine-tune the models for part-of-speech tagging and sentiment analysis on 19 languages from four morphologically diverse typologies. Our results show that POS tagging is more sensitive to morphological typology than sentiment analysis and that the models perform much better on fusional languages, such as German, than on the other typologies. We release the code and data 1 in order to reproduce the experiments and facilitate future work in this area.
Although these approaches have led to large improvements on many cross-lingual tasks, it is clear that the success of zero-shot cross-lingual transfer depends on the typological similarity of the source and target language (Conneau et al., 2020b;Libovický et al., 2020). Pires et al. (2019) find POS performance correlates with word order features taken from the World Atlas of Language Structures (WALS) database (Dryer and Haspelmath, 2013). Similarly, morphologically complex languages tend to achieve poorer performance (Artetxe et al., 2020;Conneau et al., 2020a).
Similar to this work, Lauscher et al. (2020) perform zero-shot and few-shot transfer on 20 languages and 5 tasks. However, the choice of languages does not allow one to answer what is the effect of morphological typology.
The effect of morphological typology on NLP tasks is well known (Ponti et al., 2019), with several dedicated workshop series (Nicolai et al., 2020;Zampieri et al., 2018). More recently, attention has turned to larger scale analyses of morphological typology effects on language modeling (Gerz et al., 2018;Cotterell et al., 2018;Mielke et al., 2019).
In contrast to these previous works, we are interested in how morphological typology affects crosslingual transfer for two supervised tasks, namely part of speech (POS) tagging and sentiment analysis. We choose these two tasks as 1) they both have data available in typologically diverse languages, and 2) represent a lower-level structural and higher-level semantic task, respectively. Our experimental setup reduces some of the complexity of comparing test results across languages, as we compare relative differences, instead of absolute differences. At the same time, it is necessary to take into account several other variables, i.e., presence of the language in pretraining, the amount of training data, the effect of byte-pair tokenization, the length of train and test examples, and any domain mismatches across languages.
Although it is a simplification of the variation in morphological features (Plank, 1999), languages have traditionally been grouped into four morphological categories, i.e., isolating, fusional, introflexive, and agglutinative. 2 These categories describe a language's tendency to group concepts together into a single word or disperse them into separate words. Pure isolating languages have maximally one morpheme per word. In agglutinative languages, morphemes tend to be neatly segmentable and carry a single feature, whereas in fusional languages, a single morpheme often carries multiple grammatic, syntactic, and semantic features. Finally, in introflexive languages root words are based on consonant stems, where vowels introduced around and between them lead to syntactic and semantic changes (see Plank (1999);Bickel and Nichols (2005); Gerz et al. (2018) for a more in-depth discussion).

Data
We select five languages from each category except introflexive (four), shown in Table 1

Part-of-speech
We obtain the data for the part-of-speech tagging task from the Universal Dependencies project (Zeman et al., 2020), which currently gathers data annotated with universal POS tags for more than 90 languages, although there are differences in size and domain. For Algerian we use the annotations from Seddah et al. (2020). We found no training sets available for Thai and Cantonese, hence we use them for testing only. For more details on these datasets, see Table 5 in the Appendix.

Methods
We fine-tune both multilingual BERT (mBERT) (Xu et al., 2019) and XLM RoBERTa (XLM-R) (Conneau et al., 2020a) models on the available training data in each language, using a shared set of hyperparameters selected from recommended values according to the characteristics of our data. We set the learning rate to 2e-5, maximum sequence length of 256, batch size of 8 or 16 8 , and perform early stopping once the validation score has not improved in the last epochs, saving the model that performs best on the dev set. We then test each model on all languages, giving us a matrix of test scores, where the diagonal is in-language, and all others are cross-lingual. We use accuracy as our metric for POS and macro F 1 for sentiment, as the latter often contains unbalanced classes, and define  a baseline as the result of predicting the majority class.

Results
Once our scores matrix is built, we average 9 the score of each fine-tuned model, which we refer to as language-to-language cross-lingual scores, over the other languages in each morphological group, thus obtaining each model's average cross-lingual performance per target group (language-to-group cross-lingual scores). Next, we average again for each source language group. This yields the average cross-lingual performance values per training and testing language groups (group-to-group crosslingual scores), which we report in Table 3.
In the part-of-speech task, the best group-togroup cross-lingual performance always corresponds to models fine-tuned in a language of the same morphological group, regardless of the model's architecture. Fusional models, in particular, obtain a remarkably higher score when tested on other fusional languages (over 80%). On the other hand, the group-to-group cross-lingual scores where the target language is introflexive are considerably lower than the rest (always below 50%).
In contrast, both model architectures show different patterns in the sentiment analysis task. For the XLM-R models, the best group-to-group crosslingual scores are all achieved by those trained on a fusional language, while for the mBERT it is mainly models trained on an isolating language that achieve the best scores. In any case, all scores are within a similar range of values. In fact, the main difference in this task seems to be due to XLM-R's considerably higher scores.
In order to capture the cross-lingual phenomenon more accurately, we introduce transfer loss, a relative metric defined in Equation 1: where T L x→y is the transfer loss experienced by a model fine-tuned in language x when transferring to language y (language-to-language transfer loss) and S x→y is the score 10 achieved when testing a model fine-tuned in language x on language y. Thus, it is a measure of the performance lost in the zero-shot transfer process: the better the transfer between both languages, the lower it will be. We also define its averaged variants: where T L x→A denotes the average transfer loss from language x to languages belonging to morphological type A (language-to-group transfer loss), T L A→B refers to the average transfer loss experienced by languages from morphological group A to languages from group B (group-to-group transfer loss) and N A is the number of languages (other than x) included in the experiment that belong to group A. Table 4 shows the resulting group-togroup transfer loss values for each task.  Table 3: Group-to-group cross-lingual accuracy scores (%) in part-of-speech tagging (top) and macro F 1 scores (%) in sentiment analysis (bottom) for each fine-tuning (column) and testing (row) morphological group, and each model architecture. Maximum values in each test group and architecture are highlighted. Higher is better.
Models fine-tuned in all groups except agglutinative experience the lowest performance drop when transferring to fusional languages in the part-ofspeech task, whereas in the sentiment analysis task there is no clear pattern. It is also worth noting that the XLM-R models tend to transfer better compared to mBERT, only slightly in part-of-speech tagging but more drastically in sentiment analysis. Additionally, the cases of worst transfer happen when the target language is introflexive (especially for XLM-R).
Next, to address RQ1 more directly, we compare two different types of transfer: intra-group transfer, where both the fine-tuning and target languages belong to the same morphological group, and inter-group transfer, where the two differ in morphological type. We calculate an average for both types of transfer and for each training group, model architecture and task. We present the resulting values in Figure 1.
Generally, transfer to another morphological type implies a higher cost in terms of performance, except for the introflexive models. This difference in transfer loss appears to be similar for all groups in the sentiment task, yet it varies considerably in the part-of-speech task. More specifically, there are two extremes in this latter case: fusional models suffer large performance drops when switching morphological groups, whereas isolating models experience similar transfer losses in both conditions.
Finally, we average again to obtain a single trans-fer loss value for each task and model, and use it to establish a comparison in Figure 2. Here we observe that: (1) the difference in transfer loss between an intra-group and inter-group transfer is higher on the part-of-speech task, (2) transfer is also generally worse on this task 11 , (3) XLM-R models perform better cross-lingual transfers in general (especially on the sentiment analysis task), and (4) the difference between intra-group and inter-group transfer is similar on both model architectures.

Analysis
In this section, we run several statistical tests to verify our conclusion to RQ1 and detail several points of analysis that relate to RQ2 and RQ3. Namely, to what degree do other variables contribute to effects on cross-lingual transfer.

Testing the effect of transfer type
We run a set of statistical tests to validate the observations made from Figure 2 in Section 5. In the part-of-speech tagging task, an analysis of variance (ANOVA) reveals there is a statistically significant, although weak, difference in transfer loss between the intra-and inter-group conditions, for both model architectures (η 2 ≈ 0.06, p < 0.01 in both cases). In contrast, a Kruskal-Wallis analysis of variance 12 finds no significant difference  Table 4: Group-to-group transfer loss (in percentage points) in the part-of-speech tagging (top) and sentiment analysis (bottom) tasks for each fine-tuning (column) and testing (row) language's morphological group, as well as each model architecture. Minimum values in each fine-tuning group and architecture are highlighted. Lower is better.

Intra-Group Inter-Group
Transfer Type between the two types of transfer in the sentiment analysis task, in neither mBERT or XLM-R models (p > 0.01 in both cases). We also test for differences in transfer loss between model architectures and find a significant difference in the sentiment analysis task (Kruskal-Wallis, p < 0.01), but not in the part-of speech tagging task (ANOVA, p > 0.01). This is all consistent with our previous observations.

Linear regression model for transfer loss
Additionally, we model language-to-language transfer loss with a linear regression model, using transfer type, as well as other variables, as possible predictors. This allows us to (a) test whether the intra-/inter-group difference retains its statistical significance in the presence of other variables and (b) evaluate its effect in comparison to other predictors. First, we select a set of variables that might be relevant in cross-lingual transfer, and remove those that are highly correlated with the rest to avoid multicollinearity in the model (see Table 7 in the Appendix for the final list of selected variables). We standardize all of the remaining features so that their units are comparable and, consequently, so are their regression coefficients.
Again, we find transfer type (intra-/inter-group) to be a significant predictor in both regression models for part-of-speech tagging (p < 0.01), but not in sentiment analysis. In the former case, it has the second strongest effect with a standardized coefficient of 8.6 13 , the first being presence of the target language in pretraining with a coefficient of -25.9. In other words, transferring to a language on which the model has not been pretrained implies an additional performance drop of 25.9 percentage points, while transferring to another morphological group incurs an additional 8.6.
The remaining predictors for this task are average test example length (measured in tokens, coefficient of 4.0) and in-language score (3.3). The first is a complex variable because differences in text length can be due to their domain or to the lan-guages themselves but, in either case, its coefficient confirms our intuition that longer sequences generally make the task more difficult. The second could indicate some overfitting to the fine-tuning language, as higher in-language score entails slightly poorer transfer.
XLM-R adds another predictor: the proportion of words that have been split into subword tokens in the test data (2.1). This variable is related to the size of the pretraining corpus for each language 14 : a richer pretraining vocabulary will ensure more words are considered frequent during Byte Pair Encoding and, therefore, assigned a single token, instead of being broken down into subword tokens by the tokenizer. This means that high-resource languages will have a lower word split probability and, hence, it will be slightly easier to transfer to them. However, it is worth pointing out that this bias has little effect and is only statistically significant in XLM-R.
In the case of sentiment analysis, relevant predictors are: presence of the fine-tuning (coefficient of -11.8 for mBERT and -18.7 for XLM-R) and target (-10.3 and -16.3) languages in pretraining, inlanguage score (6.8 and 6.5), proportion of words split into subword tokens in the training data (3.3 and 2.7) and proportion of examples labeled as positive in the test set (-2.8, XLM-R only).
Curiously, sentiment analysis is more sensitive to variables related to the training data compared to part-of-speech tagging, whereas sequence length only affects the latter. On the other hand, language inclusion in pretraining and in-language score are useful predictors in both tasks, yet the former is far stronger in POS and the latter is more relevant in sentiment analysis. In summary, we verify that transferring to a different morphological type has a relevant effect in part-of-speech tagging but not in sentiment analysis, regardless of the model architecture.

Testing pretrained languages only
Given the considerable effect pretraining seems to have on transfer loss (discussed in Section 6.2), we re-evaluate our results after removing the languages that were not present during the pretraining of either of the two model architectures (Cantonese, Algerian and Maltese) and check whether there are relevant differences with our previous results.

Intra-Group
Inter Of course, we observe an improvement in crosslingual scores involving either an isolating or an introflexive language, because these are the groups the excluded languages belong to. Overall, however, re-running the statistical tests does not modify our previous conclusions (see Figure 3).

Balanced in-language scores
Since in-language score is relevant in all regression models considered in 6.2 (and the value of transfer loss is relative to it), we decide to re-train all models, this time preventing them from increasing said score above a fixed threshold value (we choose the minimum in-language score achieved previously in each task and model architecture) and re-evaluate our previous conclusions. The intra-/inter-group difference in transfer loss is still statistically significant in part-of-speech tagging and not in sentiment analysis. Similarly, there is still a statistically significant difference in transfer loss between both models only in the sentiment analysis task. All of this can be seen in Figure 3. The only remarkable difference is in the part-ofspeech task, where the average inter-group transfer loss values for all morphological groups seem to converge to the same value (see Figure 5 in the Appendix). For more information, see Figures 5  and 6, as well as Tables 8 and 9, all of which can be found in the Appendix. We also test the effect that training with considerably more data has on cross-lingual transfer. We select two languages, each with around 150,000 examples available: German for the part-of-speech tagging task and Korean for sentiment analysis. We train four models with increasingly more data and then test them on all languages.
In German, we notice an important decline in cross-lingual scores when increasing data size from 80,000 to 150,000 examples (see Figure 4). More specifically, in mBERT models there is an average decrease of 15.6 and 9.0 points when the crosslingual transfer is intra-and inter-group, respectively. In XLM-R, the corresponding values are 25.4 and 19.5. Hence, it appears that a phenomenon of language specialization takes place, one to which XLM-R is more susceptible and that has more important consequences in intra-group transfer. To ensure this is a language and not a domain/dataset specialization, we test these models on another German dataset (PUD) and find no decrease in performance.
In contrast, average Korean cross-lingual scores remain relatively constant (see Figure 4). Therefore, the language specialization phenomenon could be more characteristic of part-of-speech tagging than sentiment analysis.

Domain effects
Conneau et al. (2020b) find that domain mismatch in pretraining of multilingual LMs is more problematic than domain mismatch in fine-tuning. Yet given the variety of domains present in the sentiment data, we decided to test its effect. Proxy A-distance (Glorot et al., 2011) measures the generalization error of a linear SVM trained to discriminate between two domains. We translate 1000 sentences from each dataset to English using GoogleTranslate and then compute the proxy A-distance. 15 For POS tagging, there are small but insignificant negative effects of proxy A-distance on results for both models (a Pearson coefficient of -0.07, p > 0.01 and -0.07, p > 0.01 for mBERT and XLM-R, respectively). On the sentiment task, there is no significant domain effect for mBERT (-0.06, p > 0.01), while there is a small negative effect for XLM-R (-0.27, p < 0.01). This suggests that most of the transfer loss is not due to domain mismatch.

Discussion and Future Work
In this paper, we have conducted an extensive analysis of the effects of morphological typology on cross-lingual transfer and attempted to isolate these factors from other variables. We have compared performance of two state-of-the-art zero-shot cross-lingual models on two tasks (part-of-speech tagging and sentiment analysis) for 19 languages across four morphological typologies. We have found that transfer to another morphological type generally implies a higher performance loss than transfer to another language with the same morphological typology. Additionally, part-of-speech tagging is more sensitive to morphological differences than sentiment analysis, while sentiment analysis is more sensitive to variables related to the fine-tuning data and is less predictable in general.
We have tested this sensitivity to morphology after balancing other influential factors, such as 15 Implementation adapted from the code available at https://github.com/rpryzant/ proxy-a-distance.
in-language score, and, still, the intra-/inter-group difference remains. However, the effect of morphological typology, while significant, is not strong, given that most of the variability in transfer loss is due to other factors.
We have also confirmed that XLM-R generally transfers better than mBERT, especially on sentiment analysis. In part-of-speech tagging, we have reported considerably better transfer within fusional languages, as well as easier transfer from the other groups towards the fusional type. Moreover, we have found a case that suggests that finetuning on large training sets might lead to language specialization and, consequently, be detrimental to cross-lingual transfer.
It is worth noting that we do not explore whether the type of script used by the languages has an effect on cross-lingual transfer. This is hard to control in our experimental setup, as there are some scripts that are either unique to a language or only have one with enough data to represent it, making it impossible to make comparisons.
The recent cross-lingual suite Xtreme (Hu et al., 2020) includes a number of benchmark tasks in 40 languages. While this dataset is a useful collection of cross-lingual tasks, it is unfortunately not sufficient for our purposes. The POS data is the same as we use, while other tasks either a) do not contain a representative sample of language typologies b) use translation, introducing problems of 'translationese', or c) are automatically created and not manually curated Named Entity Recognition data. Our experimental setup avoids these problems by focusing on binary sentiment analysis, which is a task that has data available in many languages and does not require translation to get multilingual data.
Finally, this work ties in with the increasing interest in typological questions in NLP (Takamura et al., 2016;Ponti et al., 2019;Bjerva et al., 2019;Nooralahzadeh et al., 2020;Bjerva and Augenstein, 2021), which often try to directly predict typological features, or use these to analyze model performance.
In the future, it would be interesting to train multi-lingual language models on specific language families in order to find maximal benefits from shared morphology. Finally, as typology seems to affect tasks differently, it would be interesting to explore other tasks, e.g., dependency parsing or semantic role labeling.  Table 6: Detailed description of the data used in sentiment analysis. "Train %" and "Dev/Test %" indicate what percentage of the language's training and validation/test data, respectively, comes from the dataset in question.

Intra-Group Inter-Group
Transfer Type to languages that belong to the other groups (inter-group) in the part-of-speech tagging task after balancing inlanguage scores. Lower is better.   Table 9: Group-to-group transfer loss (in percentage points) in POS (top) and sentiment analysis (bottom) tasks (after balancing in-language scores) for each fine-tuning (column) and testing (row) language's morphological group, as well as each model architecture. Minimum values in each fine-tuning group and architecture are highlighted. Lower is better.

Intra-Group Inter-Group
Transfer Type : Average transfer loss (in percentage points) to other languages of the same group (intra-group) and to languages that belong to the other groups (inter-group) in the sentiment analysis task after balancing in-language scores. Lower is better.