Quantifying Synthesis and Fusion and their Impact on Machine Translation

Theoretical work in morphological typology offers the possibility of measuring morphological diversity on a continuous scale. However, literature in Natural Language Processing (NLP) typically labels a whole language with a strict type of morphology, e.g. fusional or agglutinative. In this work, we propose to reduce the rigidity of such claims, by quantifying morphological typology at the word and segment level. We consider Payne (2017)’s approach to classify morphology using two indices: synthesis (e.g. analytic to polysynthetic) and fusion (agglutinative to fusional). For computing synthesis, we test unsupervised and supervised morphological segmentation methods for English, German and Turkish, whereas for fusion, we propose a semi-automatic method using Spanish as a case study. Then, we analyse the relationship between machine translation quality and the degree of synthesis and fusion at word (nouns and verbs for English-Turkish, and verbs in English-Spanish) and segment level (previous language pairs plus English-German in both directions). We complement the word-level analysis with human evaluation, and overall, we observe a consistent impact of both indexes on machine translation quality.


Introduction
One of the first barriers to develop language technologies is morphology, i.e., how systematically diverse their word formation processes are. For instance, agglutination and fusion are two morphological kind of processes that concatenate morphemes to a root with explicit or non-explicit boundaries, respectively. Processing morphologically-diverse languages and evaluating morphological competence in NLP models is relevant for language generation and understanding tasks, such as machine translation (MT). It is unfeasible to develop models * Work started when the first author was doing a research internship with JB at Aalborg University, Campus Copenhagen with capacity large enough to encode the full vocabulary of every language, and it is a must to rely on subword segmentation approaches that help to constrain the capacity when generating rare, or even new words (Sennrich et al., 2016). Hence, understanding morphology is essential to develop robust subword-based models and evaluate the quality of their outputs (Vania and Lopez, 2017). Nevertheless, there is a potential gap between the probing of whether an NLP model can handle "morphological richness", and what is a proper measure of "morphological richness" from linguistic typology.
In most of the recent NLP literature, different types of languages (e.g. agglutinative, polysynthetic) are chosen to test a more diverse handling of morphological richness (Ponti et al., 2019). There is, however, a debate as to whether languages can indeed be classified into discrete morphological categories. Payne (2017) provided a morphological typology measurement in a continuous spectrum using the indices of synthesis and fusion. Synthesis measures if a segment is highly analytic or synthetic (from 1 to more), whereas fusion measures whether it is highly agglutinative or fusional (from 0 to 1). And surprisingly, with respect to NLP publications, it is possible to identify English segments with a very low fusion index, meaning that they are highly agglutinative 1 .
From a more applied perspective, if the references of an evaluation set (in any language generation task) are labelled with the indices, we could perform a stratified analysis (e.g. low fusion and high fusion) to determine how well an NLP model handles morphology for multiple languages. For example, we could assess whether a machine translation model is failing in generating more fusional than agglutinative segments for a specific target language. Knowing and quantifying that problem concerning morphology is the first step towards proposing a solution. Our contributions then are listed as follows: • We present the first computational quantification of synthesis and fusion using standard NLP evaluation sets. • We analyse the relationship between the two indices and machine translation quality at word-level, and observe that a higher degree of synthesis or fusion usually corresponds to less accurate translations in specific word types (studying nouns and verbs in English-Turkish, and verbs in English-Spanish). • We complement this evaluation with manual annotation of synthesis and fusion. • We extend the analysis at segment-level, using the aforementioned language pairs plus English-German in both directions, and identify that some synthesis and fusion-based predictors are significant for MT system outputs. Furthermore, we release all the annotated data and evaluation results 2 .
2 Background and related work

Morphological typology
The field of morphological typology characterises languages in terms of their word and sentence building strategies (Payne, 2017), such as agglutination or fusion. In current NLP literature, Turkish is labelled as a highly agglutinative language for the explicit boundaries between their morphemes, whereas Spanish is labelled as fusional for the opposite reason.
However, early typological studies started to quantify these strategies with parameters, and avoided to characterise languages with a single type in a holistic way (e.g. Sapir (1921); Greenberg (1960); Comrie (1989)). In this context, Payne (2017) recently highlighted the indices of synthesis and fusion, which are defined as follows.

Synthesis
The index of synthesis offers a scale to contrast highly analytic or synthetic languages. This implies whether a word is composed by one (analytic) or several (synthetic) morphemes (Payne, 2017). Synthesis can be computed as the ratio of number of morphemes per words, it is closer to 1 when the language is more analytic (e.g. Mandarin, or English to a less degree), and gets higher the more synthetic the language is (e.g. Turkish, Inuktitut). Polysynthesis can be present when the synthesis degree is higher than 3, although the boundary is arguable. Besides, as we claim in this study, any language can present different levels of synthesis if we evaluate them at a more fine-grained level.

Fusion
Fusion is the ratio of the fusional morphemes joints 3 per the total number of joints. This index goes from 0 to 1, or from highly agglutinative (e.g. Turkish) to highly fusional (e.g. Spanish) cases. However, we noticed that the computation of fusion is complex to automatise. For instance, Payne (2017) indicates potential cases to identify fusional joints, such as in prefixes, suffixes, infixes, circumfixes, compounding, non-concatenative processes (reduplication, apophony, substractive morphology) or autosegmental morphemes. Current automatic tools are not designed to identify these cases for most languages.

Morphological typology on NLP
A survey by Ponti et al. (2019), on computational typology for NLP, pointed out that morphological knowledge is potentially helpful for analysing the difficulty in generation tasks such as language modelling and neural MT for both unsupervised and supervised settings. More specifically, they suggested that the degree of fusion (related to the index of fusion proposed by Payne (2017)) impacts in the rate of less frequent words, which is a relevant parameter for generation tasks.
Besides, the studies that address morphological typology are related to either the development of morphological analysis systems or the evaluation of typologically diverse languages in terms of morphology (e.g. Vania and Lopez (2017); Xu et al. (2020)). However, the typology used to distinguish languages varies across different studies. For instance, Vania and Lopez (2017) considers four phenomena to label languages: fusionality, agglutination, reduplication and root-pattern; whereas Xu et al. (2020) considers more fine-grained elements such as affixation (prefixation, infixation and suffixation) or partial reduplication. Similarly, a fine-grained analysis on non-concatenative morphology for MT was performed by Amrhein and Sennrich (2021). It is important to note that none of the previous studies have addressed the phenomena as a continuous index but as discrete features.
Furthermore, other studies refer only to morphological typological features as part of the task of typological feature prediction from linguistic databases (Bjerva and Augenstein, 2018;Bjerva et al., 2019aBjerva et al., ,b, 2020Bjerva and Augenstein, 2021), and further applications of general typological concepts on MT are scarce and do not focus on morphology (Oncevay et al., 2020).

Morphological segmentation and analysis
Morphological segmentation (Harris, 1951) aims to split a word into morphemes. There are both supervised (e.g. pointer generator networks (Mager et al., 2020)) and unsupervised approaches (e.g. the Morfessor family of methods (Creutz and Lagus, 2002;Poon and Domingos, 2009) or Adaptor Grammars (Eskander et al., 2019)), where the former ones have outperformed the latter ones.
Besides, the most widespread unsupervised segmentation methods (Byte-Pair-Encoding (BPE; Sennrich et al., 2016) and a method based on unigram language modelling (Kudo, 2018)) are not linked at all to morphological segmentation, but they are used to constrain the vocabulary size for neural generation tasks.
Finally, it is important to note that the index of synthesis can be computed with a robust morphological analyser or segmentation model (to count the number of morphemes), but neither of them are built to compute the index of fusion directly.
3 How to compute Synthesis and Fusion?

Synthesis: automatic computation
To automatically compute the index of synthesis, we require to perform a robust morphological segmentation. A rule-based morphological analyser and disambiguator might be the best option if available (which we use later for Turkish in §4.2), but for the purpose of the study, we compare wellknown supervised and unsupervised methods: • Byte-Pair-Encoding (BPE) and Unigram Language Model (uniLM) 4 from SentencePiece (Kudo and Richardson, 2018). • Morfessor (Poon and Domingos, 2009 • Pointer Generator Network (PtrNet) from the implementation of Mager et al. (2020).

Datasets and evaluation
We used the CELEX dataset of segmented words for English and German (Steiner, 2016(Steiner, , 2017, where we randomly split training and evaluation data (80-10-10). Besides, for the unsupervised methods, we use the newscommentary-v15 (Barrault et al., 2019) and EuroParl-v10 (Koehn, 2005) corpora 5 . Furthermore, we define two metrics to assess the performance on computing synthesis: • Accuracy count: Evaluates if the number of obtained morphemes in the hypothesis segmentation is the same as in the reference. • Exact segmentation precision: Analyses if the split morphemes are the same. We first perform an automatic alignment between the hypothesis and reference segments with the parallel Needleman-Wunsch algorithm for sequences (Naveed et al., 2005), and then compute the exact match at morpheme level. Table 1 shows the scores on morphological segmentation for both English and German. We observe that both BPE and uniLM under-perform when it is not expected to split the word (column "1"). This is a pattern observed by Bostrom and Durrett (2020), where they noted that unsupervised segmentation methods tend to over-split the roots of words. They both improve their accuracy and precision when Example (es): Hablaremos de la propuesta con la que se condenó a la ex primer ministra y fue apoyada por 147 diputados en la votación.  Table 2: Annotation example in Spanish. We first identify the verbs (in bold) and obtain their morphological features (using spaCy and the UniMorph schema). Then, we split the verb into its morphemes (segmentation), and identify which features are fused in each morpheme (feats. per morph). Finally, we compute the index of fusion by dividing the fusional morpheme joints by the total joints (which includes the agglutinative or explicit boundaries). On a side note, examples of verbs with zero fusion are in the infinitive (e.g. hablar (to talk)) and gerund (e.g. hablando (talking)) forms.

Results and discussion
the number of expected morphemes is larger. Unexpectedly, Morfessor also under-performs in the "1" case for both languages, and only surpasses the other unsupervised methods when we measure precision for many morphemes. Furthermore, The PtrNet supervised method outperforms the rest in almost all scenarios. We conclude that, to compute synthesis, we should prioritise, besides a rule-based morphological analyser, a supervised segmentation method like PtrNet if data is available. We take advantage of this for the segment-level analysis in §5.

Fusion: Semi-automatic computation
Calculating fusion should be approached in a case by case scenario, as there are different considerations provided by Payne (2017). Therefore, there is not an automatic tool that can obtain the fusion score directly. We decided to focus on Spanish 6 as a case study, where verbs and auxiliary verbs contains the highest degree of fusion of all the parts-of-speech (POS).
Procedure We observed that we could perform an annotation per paradigm and the termination of the verb (-ar, -er, -ir), as the fusion degree will remain the same regardless of the lemma 7 . Then, on a chosen Spanish corpus: 1. Perform an automatic annotation of POS and morphological features 8 .
2. Review the automatic annotation of special cases. For instance, there are specific verb forms that are missed as adjectives. We corrected the POS and morphological annotation of those cases in a manual step. 3. Obtain a set of all unique verb paradigms and morphological features in the corpus, considering the three different types of verb terminations in Spanish as different elements 9 . Now there is a list of unique verb paradigms and terminations that can be annotated both in synthesis and fusion. The steps are as follows: 1. For each unique verb paradigm and termination, segment a verb sample into its morphemes. E.g. the verb habló ('talked'), is split in habl-ó, and habláramos ('we were to speak') in habl-ára-mos. 2. Analyse how many morphological features are fused in each morpheme: if you change a value of a feature, will the surface form or morpheme will change? E.g. in habl-ó, -ó participates in 5 features (mode (indicative), subject person (third person), subject number (singular), tense (past) and aspect (perfective)). For habl-ára-mos, -ára includes the past and subjunctive, whereas -mos denotes the person and number. If any of aforementioned feature changes its value, the surface will change too. 3. Count and aggregate the results per mor-news_trf. It has an accuracy of 0.99 in POS and morphological tagging in the UD Spanish AnCora dataset (Taulé et al., 2008), which contains news texts mostly. 9 Using the Unimorph database (McCarthy et al., 2020) is another alternative for extracting all the possible unique inflections. We aligned and considered both tag sets for the annotation, as shown in Table 2. phemes and obtain the fusion for each verb paradigm. E.g. the fusion for habl-ó is 4/5 = 0.8, and for habl-ára-mos is 2/4 = 0.5. Finally, with the annotation in the unique list of verb inflections and terminations, we can extend the degree of fusion to all the verbs in the original Spanish corpus. An example of the annotation process is shown in Table 2.

Word-level analysis of Synthesis and Fusion in Machine Translation
In this analysis, we ask the following question: how difficult is translating a word concerning its index of synthesis or fusion? For evaluating synthesis, we work with Turkish 10 nouns and verbs, and for fusion, we keep working on Spanish verbs. For both cases, English is the source language in the translation task.

Experimental design
The experiment consists of comparing a gold standard reference with machine translation system outputs at the word level: 1. For both the reference and system output, we automatically tag all the words with a morphological analyser (the Boun morphological analyser and disambiguator (Sak et al., 2008) for Turkish and an spaCy model trained on the Ancora Universal Dependency parser (Taulé et al., 2008) for Spanish). The POS is needed to filter the target words. For synthesis in Turkish, the number of morphemes works as a proxy, as we are working at the word level. For fusion in Spanish, we need the inflection to obtain the degree of fusion from the annotated unique list (see 3.2). 2. Align the words between the reference and system output. We use the awesome-align (Dou and Neubig, 2021) tool by fine-tuning the multilingual BERT (Devlin et al., 2019) model for word-alignment, using the reference and system output as parallel corpora. 3. Calculate the translation accuracy (exact match of the word, 0 or 1) for the target POS. We then fine-grain the results concerning the degree of synthesis (number of morphemes) or fusion.
10 Turkish presents high synthesis and agglutination (Zingler, 2018), meaning that there are words composed with several morphemes and the morpheme boundaries are explicit, respectively. We focus on verbs and nouns, which usually contain more morphemes than other parts-of-speech. We chose this language due to the availability of an open-source rulebased morphological analyser and an expert annotator.  Additionally, we control different confounds: frequency of the word in the training set, and whether the full word is part of the vocabulary input of the model or not. Finally, we complement the analysis with a human evaluation (see §4.4).

Synthesis analysis: English→Turkish
Data We use the NEWSTEST2018.EN-TR evaluation set from WMT (Bojar et al., 2018), with 3,000 samples. In the Turkish side there are 45,944 tokens, and Table 3 shows the distribution of the number of morphemes obtained with Sak et al. (2008).
Model We use an English-Turkish system trained with the TIL corpus of 39.9M parallel sentences (Mirzakhalov et al., 2021).
On the NEWSTEST2018.EN-TR set, the performance is 13.06 and 49.54 in BLEU and chrF, respectively.
Results and discussion Figure 1 shows the average accuracy (exact translation, 0 or 1) of nouns and verbs in NEWSTEST2018.EN-TR, where the number of morphemes is a proxy for the index of synthesis. In most cases, especially with a higher training frequency, we observe that the average accuracy drops as the number of morphemes increases from 1 to more. This is clearer in nouns than in verbs, which have fewer cases to analyse overall. Between 2, 3 or more than 4 morphemes the differences are not significant, and sometimes is not consistent (e.g. verbs with the highest frequency). However, we can argue that analytic nouns (synthesis=1) are easier to translate than synthetic nouns (synthesis>1) for the English→Turkish direction. The pattern holds for whether the word is part of the vocabulary of the model or not, although rare words (frequency in [0, 10 3 ] have generally lower translation accuracy than more frequent words (frequency > 100).
Results and discussion Figure 2 shows the average accuracy of verbs in NEWSTEST2013.EN-ES for verbs without and with some degree of fusion. In the two higher frequency subplots (middle and right), we can observe that the average accuracy of the non-fusional verbs is higher than the fusional ones, and the pattern holds whether the verb is present in the vocabulary input of the model or not. The exception is for the least frequent verbs, although this is explained as the model do not have enough information to learn from, regardless of their degree of fusion.

Human evaluation
Exact translation accuracy has limitations, as there are potential translations that could be acceptable given a specific context (e.g. a synonym). For that reason, we performed a human evaluation of a sample of sentences on (10%) of each evaluation  set, focusing on two scores 11 : 1. Semantic score: evaluates the meaning of the word used in the automatic translation (system output) and how it compares with the gold standard translation. Scale goes from 1 (no relationship at all) to 4 (it is the same lemma). 2. Grammar score: evaluates the grammatical form and how it compares with the gold standard translation. Scale goes from 1 (different inflection) to 3 (same inflection).

Synthesis
In Figure 3, we show the annotation scores for the semantic and grammar metrics, for both nouns (top) and verbs (bottom). We also divide the analysis w.r.t. the frequency of the word in the training data. For nouns, we observe similar patterns as in the automatic analysis, where the amount of words with one morpheme (synthesis=1) has a higher semantic or grammar score than the rest, suggesting they are easier to generate for the  model, except in the least frequent block, which still cannot be well translated. The verbs tend to have more distributed scores suggesting the difficulty of generating inflected forms may remain equally high even when the words are more frequent. Single morpheme verbs are very rare in Turkish and generally contain exceptional forms which reflects in the low translation accuracy (see Figure 1). We also observe that a good proportion of translated words with 'zero' accuracy (not the exact translation, see the orange inner bubbles) has been annotated with highest semantic (same lemma) or grammar (same inflection) score, suggesting in some cases that the model is successful in generalization, although we see this case when the words are relatively short (1 to 3 morphemes).
Fusion Figure 4 shows the semantic and grammar annotation scores for Spanish verbs. For the semantic scores (top), in all levels, the gap between the non-fusional and fusional verbs is reduced, for all the frequency groups. This means that the model is indeed able to generalise and offer alternative translations (not the exact verb), which is more complex to measure with automatic metrics. In the grammar scale (bottom), however, we still note a slight advantage in the maximum score (3) of the non-fusional verbs against the fusional ones for the two highest frequency subplots (middle and right). This indicates that, with highly frequent verbs, it is still more difficult to translate correct forms with a fusion degree higher than zero. Similarly as for synthesis, we observe that there is a significant proportion of 'zero' accuracy cases (orange inner bubbles) for the highest scores in most cases. This indicates that the model could generalise and translate verbs with similar meanings and not the exact, but close, forms.

Segment-level Analysis of Synthesis and Fusion in Machine Translation
Following up the word level analysis, we study the relationship between machine translation difficulty and the degree of synthesis or fusion at the segment level. For this purpose, we process a set of translation systems for the language pair we want to evaluate. The general steps are as follow: 1. For each system output, we compute automatic evaluation metrics (BLEU (Papineni et al., 2002), chrF (Popović, 2015) and/or  Figure 5: Overview of significant predictors for degree of synthesis across our TR-EN and EN-TR models.
COMET (Rei et al., 2020)) with respect to the reference set, per sentence. 12 2. For each sentence of the evaluation set, we compute potential predictor variables for the automatic metric, such as the degree of synthesis or fusion. We complement the predictor variable list with other heuristics, such as the length of the sentence in characters (char.count) or words (word.count). The full list of all the predictors per language pair is in the Appendix. 3. With the previous inputs, we generate generalized linear models per system output and evaluation metric, in which each model's output is set to the predictor variables. The goal is to identify which predictors affect each method's performance. 4. Following model creation, we extract the significant predictors of each model. This provides an indication of which variables can be used to predict the outcome of the model's dependent variable -in our case the degree of synthesis or fusion, or any heuristic. 13 Synthesis on En-Tr and Tr-En We first start evaluating the English-Turkish and Turkish-English language pairs. The evaluated models are EnTr1, EnTr2, and TrEn2 (details in the Appendix). Also, as we are studying synthesis in Turkish, all predictors are computed on the Turkish side, regardless of the translation direction. Figure 5 presents an overview of the significant predictors on En-Tr and Tr-En systems, where we 12 Based on the analysis of Kocmi et al. (2021), we prefer to report COMET and chrF over BLEU. 13 For simplification purposes, in the following analysis and plots, we only show the predictors that show a significant effect on the system outputs.  observe a large impact of the synthesis variable on the chrF scores of two different systems (EnTr1 and TrEn2). The only other heuristic that achieves a notable impact on a system output is morph.count, or the length of Turkish sentence in morphemes, split by a morphological analyser. Other predictors have a minor effect.

Fusion on En-Es and Es-En
In a similar way, we evaluate the impact of fusion in English-Spanish (EnEs1, EnEs2) and Spanish-English (EsEn1, EsEn2) models (see the Appendix for details). Again, as we are studying fusion in Spanish, all predictors are computed on the Spanish side, regardless of the translation direction. Figure 6 presents an overview of the significant predictors, where we observe that R.fusion.verb, or the ratio of the degree of fusion over the number of verbs in the sentence, is the predictor that has the highest impact in most system outputs (EnEs1, EnEs2 and EsEn2). Additionally, R.fusion.swEsEn2 (or the ratio of the degree of fusion over the number of subwords input in the EsEn2 model) also has a high impact in one system output (EnEs2, which uses the same segmentation model as EsEn2).
Analysis on En-De and De-En Finally, we extend the analysis to English-German and German-English language pairs, using the respective evaluation sets of the WMT2018 campaign (Bojar et al., 2018), and the system outputs provided for all the participants (measured in BLEU). For computing synthesis, we use the different segmentation methods we compared in §3.1. However, for fusion, we only use a shallow proxy with the number of morphological features that are tagged using a morphological analyser. In this case, the predictors are  computed for both the source and target side.
We present an overview of these significant predictors for German-English in Figure 7 (and the Appendix contains the results for English-German in Figure 8). We can observe that ref. SYN.uniLM and ref.SYN.PtrNet are the predictors that impact most of the different system outputs. These variables refer to the synthesis computed on the reference side (English) using uniLM or PtrNet as the morpheme segmentation method, respectively. Furthermore, we observe that src-ref.R.feat.token has also some effect over one system output, which is a shallow proxy for the fusion degree in the source w.r.t. to reference segment (using the ratio of number of features per number of tokens).

Discussion
It is important to note the limitations of this study. Overall results do not suggest that translating into more analytic languages (e.g. Chinese) or more agglutinative ones (e.g. Turkish) is easier than their counterparts. Highly analytic ones present the significant issue of word coverage and vocabulary size of the model. Besides, we cannot isolate the fusional degree from synthesis entirely. For instance, Turkish is a highly agglutinative language, but also highly synthetic, and there are languages that present both agglutinative and fusional traits, like Navajo. Moreover, the language scope is another limitation: is it possible to extend it to further languages in a practical way? Synthesis can be calculated directly only if the morphological analyser splits the word into morphemes, and fusion poses several issues as mentioned before. Furthermore, Payne (2017) also indicated that the discourse can impact the computed degrees due to the diversity of the vocabulary. This study focuses on news data only, and it will be relevant to extend it to a multi-domain approach.
To address the limitations, we consider that our word level analysis, that targets specific POS, has been fundamental to allow the study of the indexes, and to partially isolate them from each other (e.g. Spanish verbs do not present more than three morphemes, keeping a low synthesis value across all the analysis). Moreover, to rapidly extend the evaluation for new languages and domains, we could follow a less fine-grained analysis in each index. For instance, we can compare synthesis=1 vs. syn-thesis>1, or fusion=0 vs. fusion>0, as in this work.

Conclusion and future work
In conclusion, we proposed methods to quantify the indices of synthesis and fusion in automatic and semi-automatic ways, respectively. Besides, for the chosen language pairs, we observed that the studied degrees have an impact in machine translation performance at both word and segment level, where we included a human evaluation of the former case.
Our analysis opens the possibility for new finegrain evaluation approaches for MT and other NLP generation tasks. For instance, as future work, we can ask: are we improving the automatic translation of highly fusional words? or, are our proposed models more aware of fusional joints (non explicit boundaries)? Following our methodology, that targets specific POS tags, could aid to analyse whether new models are improving their performance in highly fusional words or segments. This could also be helpful for evaluation approaches in morphological segmentation. Furthermore, another potential research avenue is to aid model training for MT: e.g. knowing which segments are more or less synthetic and/or fusional could be beneficial for sampling strategies.
The annotations in this paper were compensated accordingly (see Appendix). Also, for all the datasets used in the research, we stick to the ethical standards giving credit to the original author. We encourage future work that take advantage of these resources, to cite also the original sources of the data. We also see other ethical risks of this work: for the down-stream task of MT, a translation system should not be deployed with low quality translations, as it can mislead the user, and have implicit biases.

A.1 Annotation Protocol
This study measures the translation quality of translations generated by a translation system. You are given a list of sentences where one column lists each word in the gold standard (correct) translation and the corresponding column the systemgenerated translations. The evaluation of the translations will rely on the two scores described below. The scores to use in the evaluation are: Semantic score evaluates the meaning of the word used in the automatic translation (system output) and how it compares with the gold standard translation.
Please assign each word in the output one of the scores you find most appropriate: Grammar score evaluates the grammatical form and how it compares with the gold standard translation.
Please assign each word in the output one of the scores you find most appropriate: 1. The word is inflected in a different way and it is not necessarily correct 2. The word has different inflection but it is still grammatically correct 3. The words have the same inflection, and it is correct Please annotate all words in the translations in the file shared with you. In your evaluation try assigning the two scores to each word independently. The inflection of the word measures the morphological feature and should also be evaluated independently from the analyzer output which is automated and may contain errors.
The file contains example annotations for your reference, please ask any questions related to unresolved annotation examples by contacting the project coordinators.

A.2 Annotators
For both Turkish and Spanish, the annotators were contacted directly due to their expertise in morphology (both of them are PhD students in Linguistics and Computational Linguistics, respectively), besides requiring that they are native speakers of the target languages. Also, they were paid more than the minimum wage per hour of annotation of their country of residence, and were told that the annotated data will be released upon acceptance of the study.

B Segment-level Analysis of Synthesis and Fusion
B.1 List of machine translation systems • EnTr1: the same system used in §4.2 • EnTr2: Transformer-base model (Vaswani et al., 2017) with joint vocabulary size of 8k pieces (unigram language modelling from SentencePiece (Kudo and Richardson, 2018),   and trained with a sample (10%) of the corpus of EnTr1. • EnEs1: the same system used in §4.3 • EsEn1: similar configuration than EnEs1 but in the opposite direction • EnEs2: same configuration as EnEs1 (model and vocabulary) but with smaller training data. It uses only newscommentary-v8 data, with around 300k sentences). • EsEn2: similar configuration than EnEs2 but in the opposite direction.

B.2 List of predictors
Tables 4, 5 and 6 describes all the predictors used at the segment level analysis of English-Turkish, English-Spanish and English-German (both directions), respectively. Figure 8 shows the analogous results for English to German, where the synthesis-based variables presents a high impact w.r.t. the other predictors.