A Study of Morphological Robustness of Neural Machine Translation

In this work, we analyze the robustness of neural machine translation systems towards grammatical perturbations in the source. In particular, we focus on morphological inflection related perturbations. While this has been recently studied for English→French (MORPHEUS) (Tan et al., 2020), it is unclear how this extends to Any→English translation systems. We propose MORPHEUS-MULTILINGUAL that utilizes UniMorph dictionaries to identify morphological perturbations to source that adversely affect the translation models. Along with an analysis of state-of-the-art pretrained MT systems, we train and analyze systems for 11 language pairs using the multilingual TED corpus (Qi et al., 2018). We also compare this to actual errors of non-native speakers using Grammatical Error Correction datasets. Finally, we present a qualitative and quantitative analysis of the robustness of Any→English translation systems.


Introduction
Multilingual machine translation is commonplace, with high-quality commercial systems available in over 100 languages (Johnson et al., 2017). However, translation from and into lowresource languages remains a challenge (Arivazhagan et al., 2019). Additionally, translation from morphologically-rich languages to English (and vice-versa) presents new challenges due to the wide differences in morphosyntactic phenomenon of the source and target languages. In this work, we study the effect of noisy inputs to neural machine translation (NMT) systems. A concrete practical application for this is the translation of text from nonnative speakers. While the brittleness of NMT sys-tems to input noise is well-studied (Belinkov and Bisk, 2018), most prior work has focused on translation from English (English→X) Alam and Anastasopoulos, 2020).
With over 800 million second-language (L2) speakers for English, it is imperative that the translation models should be robust to any potential errors in the source English text. A recent work (Tan et al., 2020) has shown that English→X translation systems are not robust to inflectional perturbations in the source. Inspired by this work, we aim to quantify the impact of inflectional perturbations for X→English translation systems. We hypothesize that inflectional perturbations to source tokens shouldn't adversely affect the translation quality. However, morphologically-rich languages tend to have free word order as compared to English, and small perturbations in the word inflections can lead to significant changes to the overall meaning of the sentence. This is a challenge to our analysis.
We build upon Tan et al. (2020) to induce inflectional perturbations to source tokens using the unimorph inflect tool (Anastasopoulos and Neubig, 2019) along with UniMorph dictionaries (McCarthy et al., 2020) ( §2). We then present a comprehensive evaluation of the robustness of MT systems for languages from different language families ( §3). To understand the impact of size of parallel corpora available for training, we experiment on a spectrum of high, medium and low-resource languages. Furthermore, to understand the impact in real settings, we run our adversarial perturbation algorithm on learners' text from Grammatical Error Correction datasets for German and Russian ( §3.3).

Methodology
To evaluate the robustness of X→English NMT systems, we generate inflectional perturbations to the tokens in source language text. In our methodology, we aim to identify adversarial examples that lead to maximum degradation in the translation quality. We build upon the recently proposed MORPHEUS toolkit (Tan et al., 2020), that evaluated the robustness of NMT systems translating from English→X. For a given source English text, MORPHEUS works by greedily looking for inflectional perturbations by sequentially iterating through the tokens in input text. For each token, it identifies inflectional edits that lead to maximum drop in BLEU score.
We extend this approach to test X→English translation systems. Since their toolkit 2 is limited to perturbations in English only, in this work we develop our own inflectional methodology that relies on UniMorph (McCarthy et al., 2020).

Reinflection
UniMorph project 3 provides morphological data for numerous languages under a universal schema. The project supports over 100 languages and provides morphological inflection dictionaries for upto three part-of-speech tags, nouns (N), adjectives (ADJ) and verbs (V). While some UniMorph dictionaries include a large number of types (or paradigms) (German (≈15k), Russian (≈28k)), many dictionaries are relatively small (Turkish (≈3.5k), Estonian (<1k)). This puts a limit on the number of tokens we can perturb via UniMorph dictionary look-up. To overcome this limitation, we use the unimorph inflect toolkit 4 that takes as input the lemma and the morphosyntactic description (MSD) and returns a reinflected word form. This tool was trained using UniMorph dictionaries and generalizes to unseen types. An illustration of our inflectional perturbation methodology is described in Table 1.

MORPHEUS-MULTILINGUAL
Given an input sentence, our proposed method, MORPHEUS-MULTILINGUAL, identifies adversarial inflectional perturbations to the input tokens that leads to maximum degradation in performance of the machine translation system. We first iterate through the sentence to extract all possible inflectional forms for each of the constituent tokens. Since, we are relying on UniMorph dictionaries, we are limited to perturbing only nouns, adjectives and verbs. 5 Now, to construct a perturbed sentence, we iterate through each token and uniformly sample one inflectional form from the candidate inflections. We repeat this process N (=50) times and compile our pool of perturbed sentences. 6 To identify the adversarial sentence, we compute the chrF score (Popović, 2017) using the sacrebleu toolkit (Post, 2018) and select the sentence that results in the maximum drop in chrF score (if any). In our preliminary experiments, we found chrF to be more reliable than BLEU (Papineni et al., 2002) for identifying adversarial candidates. While BLEU uses word n-grams to compare the translation output with the reference, chrF uses character n-grams instead; which helps with matching morphological variants of words.
The original MORPHEUS toolkit follows a slightly different algorithm to identify adversaries. Similar to our approach, they first extract all possible inflectional forms for each of the constituent tokens. Then, they sequentially iterate through the tokens in the sentence, and for each token, they select an inflectional form that results in the worst BLEU score. Once an adversarial form is identified, they directly replace the form in the original sentence and continue to the next token. While a similar approach is possible in our setup, we found their algorithm to be computationally expensive as it prevents from performing efficient batching.
It is important to note that neither MORPHEUS-MULTILINGUAL nor the original MORPHEUS exhaustively searches over all possible sentences, due to memory and time constraints. However, our approach in MORPHEUS-MULTILINGUAL can be efficiently implemented and reduces the inference time by almost a factor of 20. We experiment on 11 different language pairs, therefore, the run time and computational costs are critical to our experiments.

Experiments
In this section, we present a comprehensive evaluation of the robustness of X→English machine translation systems. Since it is natural for NMT models to be more robust when trained on large amounts of parallel data, we experiment with two sets of translation systems. First, we use state-ofthe-art pre-trained models for Russian→English and German→English from fairseq (Ott et al., 2019). 7 Secondly, we use the multilingual TED corpus (Qi et al., 2018) to train transformer-based translation systems from scratch. 8 Using the TED corpus allows us to expand our evaluation to a larger pool of language pairs.

WMT19 Pretrained Models
We evaluate the robustness of best-performing systems from WMT19 news translation shared task (Barrault et al., 2019), specifically for Russian→English and German→English (Ott et al., 2019). We follow the original work and use new-stest2018 as our test set for adversarial evaluation.
Using the procedure described in §2.2, we create adversarial versions of newstest2018 for both the language pairs. In Table 2, we present the baseline and adversarial results using BLEU and chrF metrics. For both the language pairs, we notice significant drops on both metrics. Before diving further into the qualitative analysis of these MT systems, we first present a broader evaluation on MT systems trained on multilingual TED corpus.  7 Due to resource constraints, we only experiment with the single models and leave the evaluation of ensemble models for future work. 8 For the selected languages, we train an MT model with 'transformer iwslt de en' architecture from fairseq. We use a sentence-piece vocab size of 8000, and train up to 80 epochs with Adam optimizer (see A.2 in Appendix for more details)

TED corpus
The multilingual TED corpus (Qi et al., 2018) provides parallel data for over 50 language pairs, but in our experiments we only use a subset of these language pairs. We selected our test language pairs (X→English) to maximize the diversity in language families, as well as the resources available for training MT systems. Since we rely on UniMorph and unimorph inflect for generating perturbations, we only select languages that have reasonably high accuracy in unimorph inflect (>80%). Table 3 presents an overview of the chosen source languages, along with the information on language family and training resources.
We also quantify the morphological richness for the languages listed in Table 3. As we are not aware of any standard metric to gauge morphological richness of a language, we use the reinflection dictionaries to define this metric. We compute the morphological richness using the Type-Token Ratio (TTR) as follows, In Table 3, we report the TTR lg scores measured on UniMorph dictionaries as well as on the UniMorph-style dictionaries constructed from TED dev splits using unimorph inflect tool. Note that, TTR lg , as defined here, slightly differs from the widely known Type-Token ration used for measuring lexical diversity (or richness) of a corpus.
We run MORPHEUS-MULTILINGUAL to generate adversarial sentences for the validation splits of the TED corpus. We term a sentence adversarial if it leads to the maximum drop in chrF score. Note that, it is possible to have perturbed sentences that may not lead to any drop in chrF scores. In Figure  1, we plot the fraction of perturbed sentences along with adversarial fraction for each of the source languages. We see considerable perturbations for most languages, with the exception of Swedish, Lithuanian, Ukrainian, and Estonian.  Table 3: List of language chosen from multilingual TED corpus. For each language, the table presents the language family, resource level as the Type-Token ratio (TTR lg ). We measure the ratio using the types and tokens present in the reinflection dictionaries (UniMorph, lexicon from TED dev) In preparing our adversarial set, we retain the original source sentence if we fail to create any perturbation or if none of the identified perturbations lead to a drop in chrF score. This is to make sure the adversarial set has the same number of sentences as the original validation set. In Table 4, we present the baseline and adversarial MT results. We notice a considerable drop in performance for Hebrew, Russian, Turkish and Georgian. As expected, the % drops are correlated to the perturbations statistics from Figure 1.

Translating Learner's Text
In the previous sections ( §3.1, §3.2), we have seen the impact of noisy inputs to MT systems. While, these results indicate a need for improving the robustness of MT systems, the above-constructed ad- Figure 2: Schematic for preliminary evaluation on learners' language text. This is similar to the methodology used in Anastasopoulos (2019). versarial sets are however synthetic. In this section, we evaluate the impact of morphological inflection related errors directly on learners' text.
To this end, we utilize two grammatical error correction (GEC) datasets, German Falko-MERLIN-GEC (Boyd, 2018), Russian RULEC-GEC (Rozovskaya and Roth, 2019). Both of these datasets contain labeled error types relating to word morphology. Evaluating the robustness on these datasets will give us a better understanding of the performance on actual text produced by second language (L2) speakers.
Unfortunately, we don't have gold English translations for the grammatically incorrect (or corrected) text from GEC datasets. While there is a related prior work ) that annotated Spanish translations for English GEC data, we are not aware of any prior work that provide gold English translations for grammatically incorrect data in non-English languages. Therefore, we propose a pseudo-evaluation methodology that allows for measuring robustness of MT systems. A schematic overview of our methodology is presented in Figure 2. We take the ungrammatical text and use the gold GEC annotations to correct all errors except for the morphology related errors. We now have ungrammatical text that only contains morphology related errors and it is similar to the perturbed outputs from MORPHEUS-MULTILINGUAL. Since, we don't have gold translations for the input Russian/German sentences, we use the machine translation output of the fully grammatical text as reference and the translation output of partially-corrected text as hypothesis. In Table 5, we present the results on both Russian and German learners' text.
Overall, we find that the pre-trained MT models from fairseq are quite robust to noise in learners' text. We manually inspected some examples, and found the MT systems to sufficiently robust to morphological perturbations and changes in the output translation (if any) are mostly warranted.   Viewing these results in combination with results on TED corpus, we believe that X→English are robust to morphological perturbations at source as long as they are trained on sufficiently large parallel corpus.

Analysis
To better understand what makes a given MT system to be robust to morphology related grammatical perturbations in source, we present a thorough analysis of our results and also highlight a few limitations of our adversarial methodology.
Adversarial Dimensions: To quantify the impact of each inflectional perturbation, we perform a fine-grained analysis on the adversarial sentences obtained from multilingual TED corpus. For each perturbed token in the adversarial sentence, we identify the part-of-speech (POS) and the feature dimension(s) (dim) perturbed in the token. We uniformly distribute the % drop in sentence-level chrF score to each (POS, dim) perturbation in the adversarial sentence. This allows us to quantitatively compare the impact of each perturbation type (POS, dim) on the overall performance of MT model. Additionally, as seen in Figure 1, all inflectional perturbations need not cause a drop in chrF (or BLEU) scores. The adversarial sentences only capture the worst case drop in chrF. Therefore, to analyze the overall impact of the each perturbation (POS, dim), we also compute the impact score on the entire set of perturbed sentences explored by MORPHEUS-MULTILINGUAL. Table 8 (in Appendix) presents the results for all the TED languages. First, the trends for adversarial perturbations is quite similar to all explored perturbations. This indicates that the adversarial impact of a perturbation is not determined by just the perturbation type (POS, dim) but is lexically dependent.
Evaluation Metrics: In the results presented in §3, we reported the performance using BLEU and chrF metrics (following prior work (Tan et al., 2020)). We noticed significant drops on these metrics, even for high-resource languages like Russian, Turkish and Hebrew, including the state-ofthe-art fairseq models. To better understand these drops, we inspected the output translations of adversarial source sentences. We found a number of cases where the new translation is semantically valid but both the metrics incorrectly score them low (see S2 in Table 6). This is a limitation of using surface level metrics like BLEU/chrF. Additionally, we require the new translation to be as close as possible to the original translation, but this can be a strict requirement on many occasions. For instance, if we changed a noun in the source from its singular to plural form, it is natural to expect a robust translation system to reflect that change in the output translation. To account for this behavior, we compute Target-Source Noise Ratio (NR) metric from Anastasopoulos (2019). NR is computed as follows, The ideal NR is ∼1, where a change in the source (s →s) results in a proportional change in the target (t →t). For the adversarial experiments on TED corpus, we compute the NR metric for each language pair and the results are presented in Table 4. Interestingly, while Russian sees a major drop in BLEU/chrF score, the noise ratio is close to 1. This indicates that the Russian MT is actually quite robust to morphological perturbations. Furthermore, in Figure 3, we present a correlation analysis between the size of parallel corpus available for training vs noise ratio metric. We see a very strong negative correlation, indicating that high-resource MT systems (e.g., heb, rus, tur) are quite robust to inflectional perturbations, inspite of the large drops in BLEU/chrF scores. Additionally, we noticed that morphological richness of the source language (measured via TTR in Table 3) doesn't play any significant role in the MT performance under adversarial settings (e.g., rus, tur vs deu). The scatter plot between TTR and NR for TED translation task is presented in Figure 4.
Morphological Richness: To analyze the impact of morphological richness of source, we look deeper into the Slavic language family. We ex- Figure 4: Correlation between Target-Source Noise Ratio (NR) on TED machine translation and Type-Token Ratio (TTR lg ) of the source language (from UniMorph).
The results indicate that the morphological richness of the source language doesn't necessarily correlate to NMT robustness.
perimented with four languages within the Slavic family, Czech, Ukranian, Russian and Slovenian. All except Slovenian are high-resource. These languages differ significantly in their morphological richness (TTR) with, TTR ces < TTR slv << TTR rus << TTR ukr . 9 As we have already seen in above analysis (see Figure 4), morphological richness isn't indicative of the noise ratio (NR), and this behavior is also true for Slavic languages. We now check if morphological richness determines the drop in BLEU/chrF scores? In fact, we find that this is also not the case. We see larger % drop for rus as compared to slv or ukr. We instead notice that the % drop in BLEU/chrF is dependent on the % edits we make to the validation set. The % edits we were able to make follows the order, δ rus >> δ ces > δ slv >> δ ukr (see Figure 1). While NR is driven by size of training set, and % drop in BLEU is driven by % edits to the validation set. The % edits in turn depends on the size of UniMorph dictionaries and not on morphological richness of the language. Therefore, we conclude that both the metrics, % drop in BLEU/chrF and NR are dependent on the resource size (parallel data and UniMorph dictionaries) and not on the morphological richness of the language.
Semantic Change: In our adversarial attacks, we aim to create a ungrammatical source via inflectional edits and evaluate the robustness of systems for these edits. While these adversarial attacks can help us discover any significant biases in the transla- Figure 5: Elasticity score for TED languages tion systems, they can often lead to unintended consequences. Consider the example Russian sentence S1 (s) from Table 6. The sentence is grammatically correct, with the subject Тренер ('Coach') and object игрока ('player') in NOM and ACC cases respectively. If were perturb this sentence to A-S1 (s), the new words Тренера ('Coach'), and игрок ('player') are now in ACC and NOM cases respectively. Due to case assignment phenomenon in Russian, this perturbation (s →s) has essentially swapped the subject and object roles in the Russian sentence. As we can see in the example, the English translation,t (A-T1) does in fact correctly capture this change. This indicates that our attacks can sometimes lead to significant change in the semantics of the source sentence. Handling such cases would require deeper understanding of each language grammar and we leave this for future work.
Elasticity: As we have seen in discussion on noise ratio, it is natural for MT systems to transfer changes in source to the target. However, inspired by (Anastasopoulos, 2019), we wanted to understand how this behavior changes as we increase the number of edits in the source sentence. For this purpose, we first bucket all the explored perturbed sentences based on the number of edits (or perturbations) from the original source. Within each bucket, we compute the fraction of perturbed source sentences that result in same translation as the original source. We define this fraction as the elasticity score, i.e. whether the translation remains the same under changes in source. Figure 5 presents the results and we find the elasticity score dropping quickly to zero as the # edits increase. Notably, ukr drops quickly to zero, while rus retains reasonable elasticity score for higher number of edits. Aggressive edits: Our algorithm doesn't put any restrictions on the number of tokens that can be perturbed in a given sentence. This can lead to aggressive edits, especially in languages like Russian that are morphologically-rich and the reinflection lexicons are sufficiently large. As we illustrate in Figure 6, median edits per sentence in rus is 5, significantly higher than the next language (tur at 1). Such aggressive edits in Russian can lead to unrealistic sentences, and far from our intended simulation of learners' text. We leave the idea of thresholding # edits to future work.
Adversarial Training: In an attempt to improve robustness of NMT systems against morphological perturbations, we propose training NMT models with augmenting adversarially perturbed sentences. Due to computational constraints, we evaluate this setting only for slv. We follow the strategy outlined in Section 2 to obtain adversarial perturbations for TED corpus training data. We observe that the adversarially trained model performs marginally poorer (BLEU 10.30 from 10.48 when trained without data augmentation). We hypothesize that this could possibly due to small training data, and believe that this training setting can better benefit models with already high BLEU scores. We leave extensive evaluation and further analysis on adversarial training to future work.

Conclusion
In this work, we propose MORPHEUS-MULTILINGUAL, a tool to analyze the robustness of X→English NMT systems under morphological perturbations. Using this tool, we experiment with 11 different languages selected from diverse language families with varied training resources.

T1
Target (t) The coach fully supported the player.
A-T10 Target (t) My grandfather is incredibly harmful. (0.335) We evaluate NMT models trained on TED corpus as well as pretrained models readily available as part of fairseq library. We observe a wide range of 0-50% drop in performances under adversarial setting. We further supplement our experiments with an analysis on GEC-learners corpus for Russian and German. We qualitatively and quantitatively analyze the perturbations created by our methodology and presented its strengths as well as limitations, outlining some avenues for future research towards building more robust NMT systems.