Sometimes We Want Ungrammatical Translations

Rapid progress in Neural Machine Translation (NMT) systems over the last few years has focused primarily on improving translation quality, and as a secondary focus, improving robustness to perturbations (e.g. spelling). While performance and robustness are important objectives, by over-focusing on these, we risk overlooking other important properties. In this paper, we draw attention to the fact that for some applications, faithfulness to the original (input) text is important to preserve, even if it means introducing unusual language patterns in the (output) translation. We propose a simple, novel way to quantify whether an NMT system exhibits robustness or faithfulness, by focusing on the case of word-order perturbations. We explore a suite of functions to perturb the word order of source sentences without deleting or injecting tokens, and measure their effects on the target side. Across several experimental conditions, we observe a strong tendency towards robustness rather than faithfulness. These results allow us to better understand the trade-off between faithfulness and robustness in NMT, and opens up the possibility of developing systems where users have more autonomy and control in selecting which prop-erty is best suited for their use case.


Introduction
Recent advances in Neural Machine Translation (NMT) have resulted in systems that are able to effectively translate across many languages (Fan et al., 2020a), and we have already seen many commercial deployments of NMT technology. Yet some studies have also reported that NMT systems can be surprisingly brittle when presented with out-of-domain data (Luong and Manning, 2015), or when trained with noisy input data containing small orthographic (Sakaguchi et al., 2017;Belinkov and Bisk, 2018;Vaibhav et al., 2019;Niu et al., 2020) or lexical perturbations (Cheng et al., 2018). Uncovering these sorts of errors has lead the research community to develop new NMT models that are more robust to noisy inputs, using techniques such as targeted data augmentation (Belinkov and Bisk, 2018) and adversarial approaches (Cheng et al., 2020). Unfortunately an approach that (over-)emphasized robustness can lead to "hallucinations"-translating source input to an output that is not faithful to the source, and sometimes is even factually incorrect (Vinyals and Le, 2015;Koehn and Knowles, 2017;Wiseman et al., 2017;Nie et al., 2019;Kryscinski et al., 2020;Maynez et al., 2020;Tian et al., 2020;González et al., 2020;Xiao and Wang, 2021). Moreover, such an approach hinges on the key assumption that orthographic, lexical or grammatical variants in the input are mistakes, to be corrected by the translation system. This ignores the wealth of applications where it may be preferable for a system to offer more faithfulness to the original text.
It is worthwhile to consider the diversity of applications where having a faithful translation (opting literal translation over paraphrasing) is desirable. First, consider an automatic language tutoring system: a (human) second-language learner will often produce language that has grammatical mistakes of various types. This learner can be empowered by having a (AI-produced) faithful translation, so that s/he can see what mistakes were made vs. what would be the more common phrasing. Second, recall that many languages, including English, use word order to encode argument structure information (cf. Isabelle et al. 2017): while "the dog bit the man" might be more frequent compared to "the man bit the dog", the latter has a very clear meaning that we may wish to preserve in some (albeit rarer) cases. Third, consider poetry: it is often the case that unusual word order is used to influence rhythm and rhyme. It would be a shame if all our state-of-the-art NMT systems lost such poetic beauty in translation.
In short, by their very design, NMT systems Although models with different sizes were analysed, we did not find a strong correlation between the robustness or faithfulness to the model sizes. But, M2M-100-1.2B showed a higher tendency to be robust when compared with M2M-418M or mBART (smaller that M2M-1.2B model). preferentially output "normative" language (regardless of whether the nonstandard languages affects spelling, word order, or choice of vocabulary). Isozaki et al. (2010) note that word order is an important problem in distant-language translation. When we increase model robustness (at least with the solutions proposed to date), we generally enforce even stronger tendencies towards the norm, at the expense of diversity of language, of thought, and, perhaps, of our very culture. Although Bisazza et al. (2021)'s observation on word order flexibil-ity only minimally affect the performance of NMT systems is encouraging towards building robust systems, the trade-off on preserving diversity in expression is seldom understood. We believe it will be necessary in future to propose solutions that can explicitly enable a better trade-off between robustness and faithfulness, and can give the user autonomy and control in specifying their preference. It is therefore our goal with this work to draw attention to this important compromise, and to provide tools to detect, quantify, and compare such aspects of NMT systems.
More specifically, this paper is not only the first to deeply analyze the effects of particular perturbations on existing NMT systems, but is the first to investigate their effects in the sphere of generation. We investigate 16 unique perturbations that fall into three categories-Dependency tree based, PoS-tag based and Random Shuffles. We introduce two novel metrics for evaluating machine translation models' preference for robustness or faithfulness. Taking English as the common source, we run a case study with three widely used Transformer-based machine translation mod-els -Helsinki/OPUS machine translation model (Tiedemann and Thottingal, 2020), the multilingual BART model (Liu et al., 2020a), and the Many-to-Many Multilingual translation model (Fan et al., 2020a) (in two sizes)-into 7 target languages from several families (German, French, Spanish, Italian, Russian, Chinese, and Japanese).
Across several experimental conditions, we observe a strong tendency towards robustness rather than faithfulness (Figure 1) that varies somewhat depending on the particular perturbation ( Figure 2). More specifically, we observe that (1) state-of-the art NMT systems tend to produce translations that are unaffected by the noisy source (more robust), (2) accuracy (BLEU score) correlates with model robustness, (3) certain perturbations involving partof-speech-based word reordering tend to further encourage robustness, and (4) results vary by somewhat by target language, with the models producing translations of Japanese that are more faithful than for the other languages (except for Helsinki-OPUS). Overall, our analysis suggests that overemphasizing accuracy and robustness may limit richer development and broader usefulness of NMT systems.

Related Work
The idea to randomly shuffle linguistic elements to evaluate NLP model performance goes back fairly far ( Barzilay and Lee, 2004;Barzilay and Lapata, 2008), and has even been used to determine which tasks are "syntax-light" in human sentence processing (Gauthier and Levy, 2019). Recent work on classification tasks, such those on the GLUE benchmark (Wang et al., 2018), has shown that pre-trained Transformer-based models trained with a masked language modeling objective are shockingly insensitive to word order permutations. (Si et al., 2019;Sinha et al., 2020;Pham et al., 2020;Gupta et al., 2021;Sinha et al., 2021). Given these recent findings, we might expect insensitivity to word order permutation in the sphere of generation as well, leading to robust machine translations.
The mismatching of default word orders between target and source has long been an important consideration for multilingual tasks including automatic machine translation. Ahmad et al. (2019) find that word order agnostic models (recurrent neural networks) trained to dependency parse can transfer better than word order sensitive ones (selfattention) to distantly related languages. Also in the context of transfer, Zhao et al. (2020) propose for reference-free MT that the delta between originally ordered and permuted sentences be used as an evaluation technique. Even when considering multilingual sequence labeling tasks in general, Liu et al. (2020c); Kulshreshtha et al. (2020) find that limiting word order information in the multilingual setting can enable models to achieve better zeroshot cross-lingual performance. Taken together, these works also suggest that our models tend to overfit on source word order to the detriment of that of the target, which might lead one to predict that our models will be more robust than they are faithful in our case as well.
However, NMT systems have use cases in diverse applications that require the preservation of word order, local syntax and other linguistic components (Zhang et al., 2020). Translation systems that are contingent on preserving syntax and semantics are used as interpretors to decode the interaction between components of a neural network (Andreas et al., 2017). Further, in practical applications like translating a sentence that is a mixture of two different languages requires the MT systems to strike some balance between preserving L1 syntax and/or word-order and correctly adhering to the grammatical rules of L2 (Renduchintala et al., 2016).
In NLP tasks, where the end-user could be a human, benchmarking the robustness of NLP systems is done by evaluating a model's performance on willfully perturbed examples that could potentially expose fragility of the systems (Goodfellow et al., 2014;Fadaee and Monz, 2020). Towards averting such scenarios, efforts along the lines of building robust models with adversarial training have been a common topic of study in natural language processing (Rajeswar et al., 2017;Wu et al., 2018).
Our word order perturbations also share some points of synergy with work across NLP that aims to devise supplementary heuristics to explicate the inner workings of our machine learning systems. For specific NLP tasks, probe tasks are engineered to measure specific kinds of linguistic knowledge encoded in the systems (Conneau et al., 2018;Sheng et al., 2019;Kim et al., 2019;Jeretic et al., 2020;Parthasarathi et al., 2020;Ribeiro et al., 2020). Swapping the arguments of verbs is a classic way to measure the effects of word order both in humans (Frankland and Greene, 2015;Snell and Grainger, 2017) and in models, largely because changing the order of verbal arguments maintains high word overlap between related examples (Wang et al., 2018;Kann et al., 2019;McCoy et al., 2019); However, although limited word order permutation is applied in this case, it is generally restricted to licit, grammatical sequences of words. When perturbation has been used to evaluate model performance, the utilized perturbation functions have been predominantly fairly simple, including reverse and word shuffle, and usually target only single sentences (Ettinger, 2020;Li et al., 2020;Sinha et al., 2020). For tasks like dialogue prediction that requires multiple input sentences, perturbation functions like reordering the conversation history have been adopted (Sankar et al., 2019). To the best of our knowledge, the set of perturbation functions we propose is the most detailed set explored thus far, perturbing not only tokens, but PoS and dependency structure.
Changing the order of words in the context of NMT also has its roots in classical, syntactically sophisticated models that used parses (of various kinds) to pre-order abstract syntactic representations as an early step in a multi-step translation pipeline from source to target (Collins et al. 2005;Khalilov et al. 2009; Dyer and Resnik 2010; Genzel 2010; Khalilov and Sima'an 2010; Miceli-Barone and Attardi 2013 i.a.). Our approach differs from these approaches in that our main aim is not to incorporate word order changes into the translation pipeline itself, but, instead use them to better understand the behavior of NMT models.

Metrics
Let g x be a sentence where x takes one of two values: e if it is a sentence in the source language (English) or o if it is a gold target sentence. Let Φ e→o denote a translation pipeline from the English source (e) to a target language (o) and Let Ψ denote a perturbation function such that g − x ← Ψ (g x ); then let the translation of perturbed . Let κ (s i , s j ) be a scoring function that rates the similarity between two sentences (s i and s j ), where s i , s j ∈ L x . The choice of κ can be any of the widely used sentence similarity metrics like BLEU (Papineni et al., 2002a), METEOR (Lavie and Agarwal, 2007), ROUGE (Lin, 2004), or Levenshteindistance (Levenshtein, 1966). For our purposes, we experiment with BLEU-4, BLEURT (Sellam et al., 2020), BERT-Score (Zhang et al., 2019) and Levenshtein score as choices of κ denoted by a B or L in the superscript respectively (but see §7 for discussion of other κ). The value of κ linearly scales with the similarity between s i and s j .
We define three metrics β 1 , β 2 , and α. β 1 is our measure of robustness to perturbation by quantifying the similarity according to κ between the translation of a perturbed sentence in source into the target, and the gold sentence in target: where N denotes the number of samples 1 perturbed by Ψ that we used (see Table 1 in the Appendix for more information on N by perturbation and language). β 2 is computed as a similarity score between the translation of the perturbed source sentence and applying the same perturbation operation on the target sentence to measure degree of faithfulness of translations by machine translation system: The difficulty of the perturbation function is measured with α, which scores the similarity between perturbed sentence and the unperturbed sentence in the source language: i . β measures the standard translation performance metric on any given source-target sentence pair:

Perturbations
We propose 16 different functions to perturb the structure of an input sentence. The perturbations can be broadly classified in three categories-Random Shuffles, PoS-tag Based and Dependency Tree Based-comprised of 4, 8, and 4 perturbation functions respectively. The functions vary in complexity and linguistic sophistication so that we can score whether a model translates faithfully or stays robust to the perturbed inputs. We applied all perturbations in seven languages-de, fr, ja, ru, zh, it, and es-and describe each perturbation in turn below. See Figure 3 for a selection of examples.
Some of the perturbations we explore are "possible", in the sense that applying them will result (in most cases) in a grammatical sentence TreeMirrorPost: to live a decent place he could n't find Tom said .
TreeMirrorPre: said find place live to a decent he could n't Tom .
TreeMirrorIn: live to place a decent find he could n't said Tom .
RotateAroundRoot: live find said Tom he could n't a decent place to .
WordShuffle: place to could live said decent a Tom n't find he .
Reversed: live to place decent a find n't could he said Tom .

(a)
VerbSwaps: Tom live he find n't said a decent place to could .
NounSwaps: Tom said a decent place could n't find he to live .
NounVerbSwaps: said Tom could he n't a decent place find to live .
NounVerbMismatched: live a decent place find could n't he said to Tom .
ShuffleFirst: he Tom find could said n't a decent place to live .
ShuffleLast: Tom said he could n't find a decent live place to . (either in the source language, or in some version of another existing language that is instead supplied with the words of the source). Others are "impossible" (Moro, 2015(Moro, , 2016. For example, it has been long noticed that human grammar rules operate on hierarchical structure resulting in rules of the form "move the hierarchically closest auxiliary" as opposed to "move the linearly closest auxiliary" when forming questions (Chomsky 1962(Chomsky /2013Ross 1967;Crain and Nakayama 1987, i.a.). Standard American English exemplifies this: when we form a question from "The man who is tall was happy", we say "Was the man who is tall happy?" not "Is the man who tall was happy?" (McCoy et al. 2020, cf. Chomsky 1957. To explore more fully the behavior of the NMT models, we include several permutations that neither adhere to the descriptive rules of the source language nor to any grammars across all known human languages (i.e., are "impossible").

Random Shuffles
The perturbations in the Random bin treat the sentence as though it were a mere sequence of tokens; they reorder the tokens without any reference to their higher order linguistic properties (i.e., PoS or dependency information). Thus, random perturbations can be seen as the most basic type of "impossible" word order perturbation. We use three different random shuffles-Word-Shuffle, Shuffle-First-Half, Shuffle-Last-Half and Reversed-none of which result in any recognizable linguistic structure. Word-Shuffle shuffles the entire sentence at random (cf. Sinha et al. 2020); for a sentence of length n, there are (n − 1)!, possible random permutations. Shuffle-Firstand Shuffle-Last-Halves shuffle only the corresponding half of a sentence while keeping the other half unperturbed. Reversed reverses the token ordering in a sentence.

Part-of-Speech tag Based Perturbations
This set of perturbations uses the PoS tags from a parser to generate perturbations for a sentence, so that we can localize any effects of robustness or faithfulness to particular linguistic categories.
PoS Swaps. When a sentence has more than one token with a particular PoS, the positions of those tokens are exchanged without affecting the rest of the sentence structure. 2 Although the meanings of the sentences are altered, the result generally is grammatical (or near grammatical, see Figure 3(b)), meaning that these swaps are "possible". In this class of permutations, we consider Noun swaps and Verb swaps.
The position of a token with a particular PoS tag X ∈ {noun, adv} is interchanged with the linearly closest token with PoS tag Y ∈ {verb, adj} leaving the rest of the sentence unperturbed. In this class, we consider Adverb-Verb swaps and Noun-Adjective swaps (which tend to result in grammatical sentences), Functional Shuffle. Functional tokens (i.e., conjunctions, prepositions and determiners) are reordered so that they occupy the original position of another functional token in the perturbed sentence.
Verb-At-Beginning. This perturbation moves a verb to the beginning of the sentence as a prefix without disturbing the remaining relative positions within the text. If the sentence has multiple verbs, the first verb found when parsing the sentence will be moved to the beginning.

Dependency Tree Based
The dependency tree structure of a sentence conveys its grammatical structure. Perturbing the dependency tree in a language like English-which expresses verb-argument relationships largely via word order-could have several effects: the semantics of the sentence will be changed, and the base word order might now be indicative of a different family of languages. Therefore, we investigate dependency tree perturbations with an eye towards determining whether perturbations that result in sentence structures from another family (e.g., Japanese) will be more faithfully translated.
Tree Mirror (Pre/Post/In). While an In-Order traversal of a sentence's dependency tree (Figure 4) provides the right parse of the sentence, we perform Pre-Order, Post-Order and In-Order traversals on the mirrored dependency tree. Although the perturbed sentences largely preserve each word's position with respect to its local neighbors, since they are ungrammatical, their meanings (if there are any) are much harder to understand.
Rotate Around Root. The sentence is perturbed by rotating the tree around its root and then subsequently performing an In-Order traversal.

Distribution
We observe in Figure 5 that the dependency treebased perturbation functions have less overlap with the PoS tag-based perturbations across languages, but higher intra-category similarity scores. Similarly the PoS tag-based functions have understandably higher similarity with other PoS tagbased functions than with Shuffle or Dependency tree perturbation functions.

Experiments
We experiment with some of the state of the art translation models -OPUS translation models (Tiedemann and Thottingal, 2020), MBART (Liu et al., 2020b), Facebook's M2M (Fan et al., 2020b) (both 418M and 1.2B models). We construct the perturbed dataset using the eval set of OPUS corpus (Tiedemann and Thottingal, 2020) in 7 different languages paired with English as source -French (fr), German (de), Russian (ru), Japanese (ja), Chinese (zh), Spanish (es), and Italian (it). Our experiments 3 have a twofold objective: (1) compute the robustness (β 1 ) and faithfulness (β 2 ) of the translations in different languages when the input is perturbed, and (2) analyse the β 1 and β 2 scores with different levels of perturbations.

Faithfulness vs. Robustness
For each language paired with English, we perturb the source English and the gold target language with the perturbation functions proposed in §4. We measure β 1 and β 2 with BLEU-4 (Papineni et al., 2002b), BLEURT (Sellam et al., 2020) and BERT-Score (Zhang et al., 2019) as the choice for κ. As BERT-Score and BLEURT were forgiving to the flaws 4 in predictions towards being robust, we base our analysis with BLEU-4 as the choice of κ.
We observe that β 1 scores are generally higher than β 2 scores across the perturbation functions and across all the languages, indicating that the translation system is largely unfazed when presented with unnatural, ungrammatical input (see Figure 1) 5 . Given these results, the model acts as though it makes an intermediate "hallucination" that somehow either recreates the unperturbed input before translating it, or "hallucinates" an unperturbed target without much reference to the perturbed source.
6.2 Patterns in β 1 and β 2 , and Length Given our results, we would like to know whether there are any particular properties of particular examples or of permutations which lead models to be more or less robust. Towards that end, we observe the correlations between (a) β vs β 1 /β 2 (b) β 1 vs β 2 , and (c) β 1 /β 2 vs Length of source sentence.
β vs β 1 /β 2 . We find that our β 1 does correlate with BLEU-4 on the translation of the original, unperturbed gold English sentence and gold target language. We show correlations of β 1 and β 2 with β in Figure 7. The Spearman's rank correlation between β 1 and β is larger than between β 2 and β; in the former we observe a medium strength effect and in the latter a small effect, although language does play a role (e.g., Chinese has the largest β 1 correlation with BLEU, but among the smallest β 2 correlation with BLEU). β 1 vs β 2 . Figure 6(a) shows that the correlation between robustness and faithfulness to be present, but weak. By definition, the model can either be faithful or robust and when it is both, then that suggests only a higher α e or a lower perturbation difficulty. Usually this occurs when sentences are very short-for short sentences, fewer permutations are possible, and different permutation functions are more likely to collapse onto the same word orders.
β 1 /β 2 vs Length. The length of the source sentence has different effects on the scores depending largely on language. But, it is intuitive to understand that the model is better able to fix a word order perturbation when the sentences are short, resulting in higher β 1 score for shorter sentences. The opposite is true for β 2 where longer sentences generally have higher β 2 score.
There is some relationship between which permutation function generated a permuted example and its α E score ( Figure 12). The top 5 permutation functions with high α E scores-{shuffleHalvesLast, shuffleHalvesFirst, verbAtBeginning, nounVerbSwap, nounVerbMismatched}and with low α E scores-{treeMirrorPost, word-Shuffle, reversed, treeMirrorIn, treeMirrorPre}. The mix of examples from different perturbation categories at different levels of α E score, as well as the fact that β 1 scores are higher than β 2 , suggests that models' attempting to correct the perturbed input may not be because they understand language, but instead it might be due to correlations between certain n-grams in the sentence. We also observe that β 1 decreases with increasing α L E , which also supports this argument.

Discussion
Languages Vary. One way to think about the models' tendency towards behaving robustly is to take them to be hallucinating an unperturbed response even when the word order of the original is perturbed. The difference between β 1 and β 2 (Figure 1) shows a ranking across languages, and with perturbation functions. Among the languages analysed, Japanese in Helsinki is generally more robust than the other languages. However, we note that our findings could also be attributed to the strength of the translation system-Japanese in Helsinki has the highest performance (Table 2) Figure 6: We observe the length of the source sentences to differently correlate with the two scores. The robustness score, β 1 , is higher for shorter source sentences, while the opposite is true for β 2 suggesting that the model's ability to see through the syntactic errors has a limitation on the length. Also, the model being able to stay faithful in longer sentences can be explained with higher α e hinting at their lower difficulty.
correlation between β and β 1 support the argument. Also, the weak β 1 and β 2 scores of Chinese translation model could also be attributed to the general poor performance of the translation systems for the language ( Table 2 shows that the β scores of the Chinese model are too low).
Perturbation Functions. Among the perturbation functions, FunctionalShuffle evoked the most robust generation across all languages while models were most faithful on TreeMirrorIn and Reversed. Recall, however, the fact that all languages fall to the left of 0 in Figure 1 and 2 means that all models are reasonably robust. More work is needed to suggest clear ways of training a model to control its faithfulness or robustness. We believe our perturbation methods can be used to guide model selection by helping to determine just how faithful or robust a model should be based on specific downstream requirements.
Across Models. Although models have different numbers of parameter, we observe in Figure 1 that the models are in general more robust than faithful. The performance of the non-Helsinki models suggests slightly higher NMT performance could be attributed to the greater representational capacity of the model. In Figure 1 we observed the robustness to correlate largely with the NMT performance (β).
Alternate choices for κ. To further understand the role of metric on our results, we explored a few other translation metrics, including BERT-Score and BLEURT. But, we found that these metrics 6 overlook minor errors towards being robust to perturbed sentences. It makes it unclear whether that is the model's tendency or the metric that is improving the robustness. Hence, we found BLEU to be a more stable metric for the study. Unnatural translations. Although rare, examples for which reordering the source results in a better target translation do exist. Similarly to the prediction flips observed by Sinha et al., a fraction of the translations have β 1 scores greater than β 7 . This suggests that the model might require the source sentences to be in a particular order to attain the expected translation. Our work opens up potential avenues for probing datasets for flips as a way to measure "unnaturalness" of models' translation algorithms.

Conclusion.
Overall, it is important to understand how NMT systems behave on such malformed input-should a model be robust and risk "hallucinating" an input, or should it be faithful, taking the input at face-value, and provide word-byword translations. Particular examples might differ in whether a robust or a strongly faithful approach is warranted; for example, we wouldn't want to badly translate poetry that was using nonstandard word order for creative effect. Our novel metrics and perturbation functions allow one to quantify how systems strike a balance between robustness 7  In some corner cases, we observed the β 1 to be greater than β. This suggests that the model, at least in those cases, opts an unnatural understanding of the syntax for the translation.  German 0.40 ± 7.77 × 10 −6 0.30 ± 7.10 × 10 −6 0.25 ± 7.96 × 10 −6 0.34 ± 8.80 × 10 −6 Russian 0.39 ± 9.51 × 10 −6 0.24 ± 8.00 × 10 −6 0.23 ± 8.36 × 10 −6 0.28 ± 8.53 × 10 −6 French 0.45 ± 7.66 × 10 −6 0.35 ± 7.15 × 10 −6 0.30 ± 6.89 × 10 −6 0.37 ± 8.33 × 10 −6 Japanese 0.69 ± 4.01 × 10 −6 0.07 ± 1.64 × 10 −6 0.07 ± 1.77 × 10 −6 0.10 ± 2.72 × 10 −6 Italian 0.39 ± 9.74 × 10 −6 0.37 ± 9.67 × 10 −6 0.30 ± 9.93 × 10 −6 0.35 ± 9.52 × 10 −6 Spanish 0.47 ± 8.34 × 10 −6 0.30 ± 7.47 × 10 −6 0.34 ± 7.75 × 10 −6 0.39 ± 9.96 × 10 −6 Chinese 0.08 ± 2.95 × 10 −6 0.09 ± 3.25 × 10 −6 0.07 ± 2.96 × 10 −6 0.10 ± 5.07 × 10 −6  Table 3: Number of flips by language and model. We found no relation between the number of flips a model might exhibit when presented with perturbed data to its size or performance in NMT task (β). At this point we think this is just a noise and might have more to do with the dataset than the models themselves. The scores computed using BLEU-4 records the differences by better showcasing that harder perturbations having lower β 1 and β 2 scores, while on the other perturbations the models being robust is highlighted well.   Figure 10: The BERT-score can be observed to be too forgiving of the perturbations in the text thereby not having any difference to the scores across languages. The sheer lack of discriminating perturbed vs unperturbed makes BERT-score a less suitable candidate for the task.   Figure 13: Models ignore precise word order they are presented with: Compare the heat maps showing higher β 1 than β 2 values on average across languages. Models tend to recover more when faced with PoS tag-based perturbations: Figure 12 generally shows darker shades for PoS tag-based perturbations than for the others. This means that models find it harder to ignore word order for sentences perturbed with Dependency tree-based and Random perturbations than with PoS tag-based ones.  Figure 15: Models tend to be more robust and more faithful for easier perturbations (α e is higher). The longer sentences having higher α e has more to do with most of our perturbation functions targeting specific sentence constituents, leaving majority of the sentence unperturbed. [Length is normalized with the length of the longest sentence in every language +1 to compute a value between [0, 1  Table 4: Samples from across different languages and perturbations where the models translated better when the source sentence was perturbed (a lá Sinha et al. 2020). Although such flips made only a small fraction, we observed the unnaturalness understanding of the syntactic structure in translation task.  Table 5: Samples from across different languages and perturbations where the models translated better when the source sentence was perturbed. Although such flips made only a small fraction, we observed the unnaturalness in the understanding of the syntactic structure in translation task. This is similar to the observations made by Sinha et al. (2020).