Pulling Out All The Full Stops: Punctuation Sensitivity in Neural Machine Translation and Evaluation

,


Introduction and Related Work
Since the advent of the Transformer models (Vaswani et al., 2017), machine translation (MT) has seen tremendous improvement in performance, with several claims of parity with human translations (Wu et al., 2016;Hassan et al., 2018;Popel et al., 2020).However, one issue that is common to most deep learning models but does not hinder humans is sensitivity to small changes in the input, or a lack of robustness.
Robustness in machine translation refers to the ability of the models to produce consistent translations that preserve the meaning of the source sentence regardless of any noise in the input (Heigold Niu et al. (2020) Se kyllä tuntuu sangen luultavalta.Se kyllä tumtuu sangen luultavalta.Michel et al. (2019) Si seulement je pouvais me muscler aussi rapidement.Si seulement je pouvais me muscler asusi rapidement.Ebrahimi et al. (2018) ... er ist Geigenbauer und Psychotherapeut.... er ist Geigenbauer und Psy6hothearpeiut.Tan et al. (2020) When is the suspended team scheduled to return?When are the suspended team schedule to returned?et al., 2018).Changes in the input that preserve the semantics should not significantly change the output of the models.This can be a particularly critical quality for commercial machine translation systems, which are expected to translate real-world data including social media or internet text, which tend to be non-standard and noisy (Li et al., 2019).Models are typically tested for robustness by changing the input to introduce noise, called a perturbation, and checking whether the output is different.Several works have documented the sensitivity of machine translation models to various kinds of noise which commonly occurs in real-world data (Belinkov and Bisk, 2018;Niu et al., 2020;Tan et al., 2020).There has also been work on adversarial attacks, where algorithms with access to model gradients try to find optimal perturbations that result in a significant performance drop, or manipulate the model into producing malicious output (Ebrahimi et al., 2018;Wallace et al., 2020;Zhang et al., 2021).Most of these works have concentrated on robustness to variations in orthography and grammar.Table 1 shows some examples.
There has also been some work on MT evaluation metric robustness that has included similar pertur-bations at the character and word-level, and other linguistic phenomena such as synonyms, named entities, negation, numbers, and others (Sun et al., 2022;Freitag et al., 2022;Karpinska et al., 2022).
However, Michel et al. (2019) argue that many of these perturbations do not preserve the meaning on the source side.They propose that "meaningpreserving" perturbations should be limited to nearest neighbours in the embedding space and out-ofvocabulary word-internal character swaps.
In this work, we take a further step back from meaning-preserving spelling and grammatical perturbations, and ask: are machine translation models robust to trivial changes in sentence-final punctuation?Are the metrics used to evaluate machine translation robust to the same changes?
To investigate this, we test basic punctuation variation for which robustness may have been taken for granted.We perform simple sentence-final punctuation perturbations, restricting the experiments to two settings: insertion and deletion.Mimicking a very common form of natural noise, we insert or delete full stops, exclamation marks and question marks at the end of the input sentence ( §2; see Table 1 for an example).Unlike common perturbation strategies, we make no changes to the content, words, or characters which may cause outof-vocabulary or unseen tokens in the input.Our goal in this work is not to induce as drastic a drop in performance as possible, but to investigate the changes in translation that result from extremely minimal perturbations, and whether we are adequately able to detect these changes.
We test commercial MT systems from Google, DeepL and Microsoft on 3 language pairs from across resource levels and scripts: German (De), Japanese (Ja) and Ukrainian (Uk) to English (En).These systems are intended for real-world use, and can therefore be expected to already be robust to common noise in real-world data.
We first investigate whether commonly used evaluation metrics are robust to our perturbations, in order to ensure that our subsequent evaluation of the MT systems is fair ( §3).We find that both stringbased and model-based evaluation metrics are not robust to trivial sentence-final punctuation perturbations, significantly penalizing text with mismatched full stops, question marks or exclamations, sometimes more than text with more severe perturbations such as insertion or deletion of random characters.
Based on these results, we deviate from the stan-dard robustness testing regime of perturbing the inputs and expecting the translations of both the original and the perturbed source text to match exactly.In the MT setting, adding a punctuation to the source text can naturally induce the model to also produce the corresponding punctuation in the translation.We therefore reset the punctuation changes in the translations in order to perform evaluation, and call for a review of standard MT robustness evaluation in such settings.
More importantly, we show that even commercial machine translation systems are extremely sensitive to trivial punctuation changes, particularly in languages such as Japanese and Ukrainian ( §4).We show that both insertion and deletion of punctuation causes performance drops, which indicates that models may be biased to expect (or not expect) punctuation in certain types of sentences.We conduct a manual analysis and find that in more severe cases, a mere punctuation change can cause complete changes in the meaning of the translation or introduce hallucinations such as negation, with less severe changes including pronouns, named entities, tense, number, and others ( §5).Søgaard et al. (2018) provide some common examples of punctuation variation in real-world data and demonstrate how dependency parsers are sensitive to such punctuation differences.Ek et al. (2020) demonstrate the sensitivity of neural models to punctuation in Natural Language Inference tasks.Though there has also been some work on punctuation-based perturbation for machine translation (Bergmanis et al., 2020;Karpinska et al., 2022), the tendency has been to make more extreme perturbations than we adopt.Unlike previous work, we do not combine all punctuation changes into one bucket, and instead analyse each punctuation separately.We find that models are more sensitive to some punctuation than others.We also unify the usually independent work on machine translation robustness and evaluation metric robustness, and adjust our evaluation based on our observations.
Our work exposes serious real-world use-case implications and serves to show that while great strides have been made in both machine translation and its evaluation, we are a long way from building systems that are reliable for real-world use.

Test Set Creation
In this section, we describe the original test sets and perturbation operations we perform to build our test sets.Our perturbations reflect natural noise in punctuation occurrence: we only insert or delete punctuation such as full stops, exclamation marks and question marks from the ends of sentences.

Original Test Data
In order to build our perturbation test sets, we need a large test set with naturally occurring noise, e.g., sentences which originally do not have full stops at the end (for insertion) or sentences ending with question marks (for deletion).Test sets typically have a majority of sentences ending with full stops, while other punctuation or punctuation-less sentences occur less often.In order to maximize these sentences, we combine test sets across FLORES101 (Goyal et al., 2021) and WMT 2020-2022 (Barrault et al., 2020(Barrault et al., , 2021;;Kocmi et al., 2022) in both directions for German (De, high-resource), Japanese (Ja, medium-resource) and Ukrainian (Uk, mediumresource) to English (En). 2 We choose these 3 language pairs to optimize for diversity in resource levels and scripts, while ensuring we have adequate test data and commercial MT system support.FLO-RES101 and WMT2022 are general domain test sets, while WMT2020-2021 are news domain.
We then split the final combined test set based on whether the sentences originally end with a (i) full stop, (ii) exclamation mark, (iii) question mark, or (iv) no punctuation.In order to balance the test set sizes, we randomly choose 1000 sentences ending with a full stop.All test set sizes are given in Appendix A.1.

Perturbation Tests
Insertion.For the insertion perturbation, we start with the test set split that originally occurs with no ending punctuation, and then insert at the end of each sentence: a (i) full stop, (ii) exclamation mark, (iii) question mark, or a (iv) random character for comparison.The insertion of a single punctuation mark at the end of a sentence is an extremely minimal perturbation that does not change any content.We contrast this with the insertion of a random character at the end which changes the final word.
Deletion.For the deletion perturbation, we start with the test set splits that originally occur with a punctuation at the end of the sentences (full stop, exclamation or question mark) and delete them.We also contrast this with deleting the final character

Evaluation Metrics
Before we evaluate the machine translation systems on our punctuation perturbation test sets, we first evaluate the evaluation metrics themselves to see if they are robust to these variations.This metaevaluation is crucial; if the metrics are not reliable, we cannot be sure if changes in the scores are due to changes in translation content.We include the string-based metric BLEU (Papineni et al., 2002) for convention, and based on the recommendations from Kocmi et al. (2021), we use chrF (Popović, 2015), which is another string-based metric, and COMET (Rei et al., 2020), which is a model-based metric shown to have high correlations with human judgements, and also include BLEURT-20 (Sellam et al., 2020) and BERTScore (Zhang* et al., 2020).Metric versions can be found in Appendix A.2.

Meta-Evaluation
Typical robustness tests for machine translation evaluate the translations of both the original and the perturbed source texts against the original reference text (Belinkov and Bisk, 2018;Michel et al., 2019;Bergmanis et al., 2020).The implicit assumption here is that given that the semantics are preserved, the ideal MT system should produce the same or a similar translation for both, and that the automatic metrics used to perform evaluation against the original reference translation will accurately measure the translation quality.
However, adding or deleting punctuation from the source input can lead to a predictable corresponding presence or absence of punctuation in the machine translation -which the reference translation lacks, since it may match the punctuation in the original source.In such circumstances, it is unclear if this significantly influences the evaluation quality perceived by the metrics.

Setup.
In order to investigate whether automatic metrics are robust to the "translation of perturbed source but original reference" discrepancy, we conduct experiments comparing the scores produced by the metrics using the original and perturbed source texts as the "reference" and "translation" texts.More concretely, given the original source text X, its perturbed version X ′ , and a scoring metric f (Y, R) where Y is the translation and R is Table 2: Results comparing the punctuation insertion perturbed source texts against the original source texts using various metrics and showing the difference in scores.All comparisons use the original source text as the "reference" translation.COMET, BERTScore and BLEURT are reported x100 to match all score scales.Note that COMET and BLEURT do not always produce a score of 100.0 for perfect matches as they were not trained to produce scores within a specific range.
the reference, we compute the score f (X, X) (perfect match) and f (X ′ , X) (single punctuation mismatch). 3We conduct this comparison for both the insertion and deletion tests, across all 4 languages (De, Ja, Uk and En).
The goal here is to measure, given all else is equal, whether punctuation insertion/deletion at the end of the sentence significantly affects the scores produced by the automatic metrics, and how this compares against a more typical perturbation of inserting or deleting a random final character.Ideally, the metrics should not produce different scores that are statistically significant given trivially perturbed inputs.We can then rely on scores produced by the metrics to perform robustness evaluations.
Insertion Results.The meta-evaluation results for the punctuation insertion tests are shown in Table 2.We see that the metrics produce significantly different scores even though the only difference is a single additional punctuation mark at the end of the sentence.The difference is particularly stark for BLEU, COMET and BLEURT while less pronounced for chrF and BERTScore, and is equally poor across languages.More interestingly, we see that while string-based matching metrics such as Table 3: Results comparing the punctuation deletion perturbed source texts against the original source texts using various metrics and showing the difference in scores.All comparisons use the original source text as the "reference" translation.COMET, BERTScore and BLEURT are reported x100 to match all score scales.Note that COMET and BLEURT do not always produce a score of 100.0 for perfect matches as they were not trained to produce scores within a specific range.
BLEU and chrF treat all punctuation equally, modelbased metrics assign drastically lower scores for exclamation and question marks.In the case of BERTScore, punctuation insertion results in lower scores than random character insertion for all languages except English.
Deletion Results.The meta-evaluation results for the punctuation deletion tests are shown in Table 3.A similar trend is seen here, where the lack of a single punctuation at the end of the sentence causes a significant drop in scores across all metrics.We also see the same trend where missing exclamation or question marks result in more significant drops in scores.Furthermore, punctuation deletion more often results in lower scores than deleting a random final character compared to punctuation insertion.Surprisingly, results for Uk are relatively more sta- ble than for En, particularly for COMET.Note that all score differences here will register as statistically significant: the original source will always "win" against the perturbed source in all comparisons performed by tests such as paired bootstrap resampling or randomization.
Some issues with BLEU have been highlighted previously (Reiter, 2018;Kocmi et al., 2021); COMET, BLEURT and BERTScore presumably suffer from robustness issues as neural models.ChrF scores display smaller variations that are consistent across punctuation and languages, and therefore seem more reliable for robustness evaluations, corroborating the findings from Michel et al. (2019).Overall, we expand the metric sensitivity issues highlighted in Karpinska et al. (2022) for English in finer-detail for punctuation, and further confirm them for German, Japanese and Ukrainian.
Comparison with severe translation errors.We performed a manual segment-wise analysis of a subset of machine translation outputs.We find that in several cases, particularly for shorter sentences, translations with punctuation differences are penalized similar to translations with severe errors.See Table 4 for an example.
Broader Implications.More broadly, these results indicate that (i) statistically significant differences can be obtained merely by changing a single punctuation, (ii) models that fail to match the reference punctuation may be penalized more than they should be, and (iii) models that mistranslate a single word but correctly match the punctuation may be getting more credit than they should.For example, we found that up to 5% of the sentences in the WMT2022 Uk-En test set and up to 10% of the sentences in the WMT2022 Ja-En test set had mismatches between the ending punctuation in the source and reference.This could mean that model performance on these instances may be undervalued if the model reproduces the source punctuation.Conversely, we also found many instances of models producing acceptable punctuation that was not present in the original source (e.g., ≈ 13% of Microsoft's Uk-En output for full stop deletion perturbation test set had full stops), which may also get unfairly penalized.
More importantly, it may be worthwhile to reexamine how machine translation models are evaluated in robustness tests and after adversarial training, since resultant differences in scores may not be a reflection of actual translation quality.

Machine Translation Experiments
We now test the publicly available commercial machine translation systems of Google, DeepL and Microsoft through their paid APIs on our test sets.4Some of these commercial systems have previously been claimed to have reached human parity (Wu et al., 2016;Hassan et al., 2018).Commercial systems are generally expected to deal with nonstandard inputs as they are targeted for real-world use cases.We therefore expect that these systems have already been trained to be somewhat robust to various kinds of input noise.
For the insertion tests, we compare the translation of the original source text without punctuation against the translation of the perturbed source with sentence-final punctuation.For deletion tests, we compare the translation of the original source text with punctuation against the translation of the perturbed source without sentence-final punctuation.

Evaluation
Results from our meta-evaluation in §3.1 mean that we cannot get reliable results from evaluation metrics if we directly use the perturbed source translations and original references for evaluation; it will be hard to identify if changes in scores originate from translation differences or merely punctuation changes.One solution is to add the same punctuation perturbation to the reference that we add to the source.We find that this increases the overall scores since there is now an additional character that matches the reference in each sentence, rendering the score incomparable to the original translation.
Another solution is to reset the punctuation changes in the translations.We therefore remove corresponding sentence-final punctuation produced in the translations for the source inputs perturbed through insertion that are not also produced for the original source inputs, and vice versa for deletion, thereby making the two translations comparable.Henceforth we use chrF scores due to its relative robustness and include COMET scores as it has been shown to have high correlations with human judgements (Kocmi et al., 2021;Freitag et al., 2022).
Inconsistency.Apart from measuring whether perturbations cause degradation in translations compared to a reference, another important criterion is the consistency.That is, given the original and the perturbed sources as input, we measure how different the translations produced for each are.Since here we want to also account for surfacelevel changes, we choose the string-based matching metric chrF based on results in §3.1 and findings from Michel et al. (2019).Given a source X and its translation Y, and the perturbed source X ′ and its translation Y ′ , we measure consistency at the sentence-level as the score chrF(Y ′ , Y), where Y acts as the "reference".We designate a score < 75 to be a significant deviation in translation, and measure percentage of inconsistency by counting the number of Y ′ which have chrF< 75.

Results
The results for the punctuation insertion perturbation tests are given in Table 5.We see that in general, the insertion of sentence-final punctuation results in a statistically significant drop in scores, but also some significant improvements.The results for the punctuation deletion perturbation tests are given in Table 7. Overall, deletion causes more drops in performance than insertion, and far fewer improvements in scores.

Effect of Language.
Unsurprisingly, based on inconsistency, we see that the models are far more robust to insertion perturbations for the high resource language pair De-En, with generally < 10% inconsistency.More interestingly, we see that while Ja-En and Uk-En are both medium resource, the models are far more robust for Uk-En at 0 − 23% inconsistency, as compared to Ja-En which has between 18 − 35% inconsistency across models.
We see the same inconsistency trends for deletion as for insertion: models are more robust to perturbations in De (0 − 23%) and Uk (0 − 25%) source texts than Ja (10 − 37%).Overall, deletion leads to a higher range of inconsistency than insertion.
Effect of Punctuation.We see that the models are more likely to be robust to full stop insertion than exclamation and question marks: statistically significant differences in performance occur more often for the latter.In fact, DeepL and Microsoft models seem to benefit from having full stops and exclamation marks added, with results improving for Ja-En and Uk-En.In the case of question marks, we see that it causes a universal drop in scores across models and languages.For Uk-En, question mark insertion almost always causes more significant drops in scores than inserting a random character.
Unlike insertion, full stop deletion causes significant drops in scores, particularly for the DeepL and Microsoft models for Ja-En and Uk-En.Interestingly, question mark deletion does not cause a significant score drop in Ja-En for all models.This is possibly because the question mark is mostly optional in Ja, which uses the particle 'か' as a question marker.
Pre-processing.We see that both insertion and deletion can cause degradation in performance.This means that while pre-processing of the inputs to ensure consistent punctuation may lead to more consistent translations, it is unlikely to result in better quality translations.

Analysis and Discussion
Some examples of translation changes caused by the perturbations are given in Table 6.Both insertion and deletion cause a wide range of translation changes, with a few severe errors where the meaning is completely changed, such as by hallucinating or omitting negation.Others include changes in number, tense, pronouns, named entities, etc.
Reordering.Often, inserting or deleting punctuation leads to a reordering of the words in the sentence.In many cases the reordering leads to mostly similar but slightly off translations (Example 4), with some cases causing significant differences in meaning (Example 8).
While we might expect punctuation perturbation to ideally cause no other changes in translation apart from the difference in punctuation itself, there could be cases of valid translation changes caused by the perturbation.For example, while "1) Heben Sie die Autorisierung des Lesegeräts auf" is originally translated as "1) Deauthorize the reader", adding a question mark does not produce "1) Deauthorize the reader?" but instead "1) Are you deauthorizing How many times a day should I send LINE? -8.5 +14.1 25.9 Table 6: Examples of changes in translation caused by perturbations.Punctuation perturbations at the end of the sentence are highlighted in blue , original translations are highlighted in yellow and the changes in the translations are highlighted in red .Given a translation Y of the original source X, a translation Y ′ of the perturbed source X ′ and a reference R, the ∆ scores show the differences in chrF and COMET scores obtained as ), while Con.chrF measures the consistency through chrF scores between the two translations, obtained as chrF (Y ′ , Y ).Best viewed in color.
the reader?".This word reordering for an interrogative sentence, typical particularly for English, can be considered a valid change even though the chrF (65.5 − → 52.7) and COMET (28.2 − → −25.8) scores drop.There are also cases when the resultant reordering actually improves the scores of the translation despite being wrong e.g., adding a question mark to "Und ich muss nochmal Versandkosten Table 7: Results for the punctuation deletion task for De/Ja/Uk-En for Google, DeepL and Microsoft MT systems, showing the differences in scores of the translations for perturbed source texts.Lg. indicates language pair, while %Inconsistent is the percentage of sentences which have chrF< 75 with respect to the original translation.Results in bold are statistically significant (paired bootstrap resampling, p < 0.05).
zahlen" changes the translation from "And I have to pay shipping again" to "And do I have to pay shipping costs again?" (instead of "And I have to pay shipping again?") and improves both chrF (17.5 − → 23.4) and COMET (26.3 − → 45.6) scores, presumably due to the presence of the word "costs" that now matches the reference ("And I still need to pay the delivery costs").Similarly for question mark deletion, removing the question mark from "Заняття в понедiлок i середу вiдрiзняються?" changes the translation from "Are Monday and Wednesday classes different?" to "Monday and Wednesday classes are different", dropping the chrF (76.2 − → 74.9) and COMET (91.0 − → 83.7) scores.Expecting translations of both original and perturbed source texts to match is a standard evaluation setting for robustness tests, even for more severe perturbations resulting in drastic changes and out-of-vocabulary inputs (see Table 1).Given these results, we reiterate our call from §3 to reexamine this evaluation setup for settings similar to this work.
However, there are several cases where the interrogative nature of the source is not dependent on the question mark and the model correctly produces a translation that is also interrogative but different.For example, deleting the question mark from "ありますか？" changes the translation from "Is there?" to "do you have".Example 12 shows another case where the model correctly recognizes the perturbed source as a question, but produces a significantly different translation.Example 5 and Example 6 are also cases of translation differences that are more severe than reordering.
Sentence Style Association.Although we see some critical translation changes due to perturbing full stops (Example 2), a majority of the translations underwent a change in sentence style.In particular, we found that inserting a full stop resulted in models producing longer, complete sentences, while deleting the full stop resulted in shorter, headlinestyle sentences.This was observed across systems (Example 1 and 7), which indicates that this stylistic change presumably comes from what is commonly seen in training data: the models have seemingly learnt to associate a lack of full stop with article headlines from news domain data.In the case of Example 7, the changed translation better matches the reference ("Water heater temp and bath issue."),leading to improvement in scores in both chrF (46.8 − → 58.6) and COMET (36.7 − → 75.3).
Robustness.Previous works have correlated consistency with robustness (Niu et al., 2020;Wallace et al., 2020), the implication being that less consistent outputs are lower in translation quality.We find that this is not necessarily the case for our perturbation setting.For instance, Example 1 shows a translation that has high consistency (91.7 chrF compared to original translation), while Example 7 has low consistency (56.0 chrF).However, in both cases the translations of the perturbed sources score significantly higher than the original translations.
Similarly, Example 12 has a very low consistency score (25.9) but the chrF reduces (−8.5) while the COMET increases (+14.1).COMET is more reflective of the translation quality here: given the reference (''How many LINE messages are okay to send in a day?"), the translation of the perturbed source is closer to the actual translation.Conversely, instances with relatively high consistency (Example 3, 5, 8, 10) all drop in scores and have significant translation issues.
Other Changes.Some other changes in the translations include changes in number, tense, pronouns, named entities, capitalization, and so on.Some of the less severe errors such as changes in capitalization or extra demonstratives also incur heavy drops in chrF and COMET scores.Some more examples of the translations produced for perturbed inputs can be found in Appendix A.3.

Conclusions
In this work, we unite the robustness evaluation of both machine translation systems and their evaluation metrics, and discuss ways in which both fail to be adequately robust to trivial punctuation change.This shows that models and metrics are in fact far more sensitive and a lot less reliable in real-world use cases than is commonly expected.We show that both metrics and machine translation systems treat each punctuation differently, with machine translation systems showing associations between punctuation and sentence styles.We also highlight the implications of these sensitivities for robustness research and evaluation for machine translation.Although it may not necessarily be a hard task to train systems that are robust to punctuation, our goal is to highlight one of the issues that has possibly been overlooked due to its triviality.We hope that future research in robustness, evaluation metrics and machine translation accounts for these sensitivities while performing evaluation and model training.

Limitations
Test Set Size.One of the main limitations of our work is relatively smaller test set sizes.This stems from the way our perturbation experiments are set up -we can only use existing test sentences which already end with specific punctuation in order to measure the effect of deleting them, or start with sentences which do not have sentence final punctuation in order to measure the effect of inserting them.In general, a majority of the official test sets have sentences ending in full stops; this results in having a smaller test set to work with.This is also the same issue that presumably gives rise to sensitivity issues in the trained models.
However, given that our focus has been on each particular punctuation, instead of merging them all together, we find that our test sets are larger than the ones used in previous work for each punctuation.Combined with the fact that we ensure to perform significance testing and manual analysis, we believe our results are reliable.Appendix A.1 includes details and a discussion.
Target Language.Although we test models across several source languages, the target language is always English.This makes our analysis of induced errors limited to phenomena that occur in English, for example, changes in number, reordering of words for question marks, or changes in capitalization, etc. Languages without capitalization or number marking but with morphological richness and other phenomena are likely to have different errors.For example, inserting a full stop changes the translation to include 'Please' and makes the sentence more polite in Example 1 in Table 6.For languages like Japanese, which have complex systems of marking varying levels of honorifics, punctuation perturbations may result in more interesting changes to the translations.
A vast majority of previous work has performed perturbations on languages using the Latin alphabet, so we consider our work a step forward, considering that we also evaluate metrics on Japanese and Ukrainian texts.However, it is also important to evaluate sensitivity when both directions are non-English, for example, Ukraininan to Japanese translation.A lack of adequate parallel data in such directions usually precludes such experiments.We hope to undertake this in future work.The test set sizes reflect a general imbalance in sentence-final punctuation in parallel corpora that may be causing the sensitivity in the models.In order to be able to insert or delete punctuation, we are limited to sentences which originally have no punctuation or the specific punctuation we intend to delete.This is a requirement unique to our extremely minimal setup, since more indiscriminate punctuation perturbations can be possibly carried out on a larger scale.
For comparison, the FLORES101 dataset has 1012 sentences, and the WMT2020-2022 datasets range from 785 to 2037 sentences.Some challenge sets for metrics in the WMT Metrics Tasks (Freitag et al., 2022) included 50 sentences per phenomenon for 3 language pairs (Alves et al., 2022) and 721 sentences covering 5 error types for Zh-En (Chen et al., 2022).

A.2 Metric Versions
Metric signatures and versions used for evaluation are given in Table 9.

Table 1 :
Common perturbations used in various robustness tests compared to our punctuation insertion.Original words in bold, perturbed text highlighted in red.

Table 5 :
Results for the punctuation insertion task for De/Ja/Uk-En for Google, DeepL and Microsoft MT systems, showing the differences in scores of the translations for perturbed source texts.Lg. indicates language pair, while %Inconsistent is the percentage sentences which have chrF< 75 with respect to the original translation.Results in bold are statistically significant (paired bootstrap resampling, p < 0.05).

Table 8 :
Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi.2020.Bertscore: Evaluating text generation with BERT.In International Conference on Learning Representations.Testset sizes Test set sizes for our perturbation tests are given in Table 8.Note that all punctuation insertion tests use the No Final Punctuation split, while the deletion tests use the respective ending punctuation splits.Random insertion and deletion both use the No Final Punctuation split.

Table 9 :
(Post, 2018) some more examples of translation changes in response to perturbations.We see more instances of changes in sentence style, fluency, hallucination and others.Metric versions and signatures.We use the sacreBLEU(Post, 2018)implementations for BLEU and chrF, and the huggingface implementations for BLEURT and BERTScore.DeepL In Ukraine , military service is compulsory for all men aged 16 to 29.In our country , military service is compulsory for all men aged 16 to 29.15Source Яблуко вiд яблунi недалеко, як вiдомо, пада.. .Яблуко вiд яблунi недалеко, як вiдомо, пада.. Google As you know, the apple does not fall far from the apple tree...As you know, the apple falls far from the apple tree.Russian and Czech and I am studying intensively.I speak Ukrainian, Russian and Czech intensively studying .19Source Гришко вже пiшов у яслi але не все так просто. . .Гришко вже пiшов у яслi але не все так просто. . .Дуже сильно плаче Дуже сильно плаче .DeepL Grishko has already gone to the nursery, but it's not so easy ... Hryshko has already gone to the nursery, but not everything is so simple He cries a lot . . .

Table 10 :
Examples of changes in translation caused by perturbations.Punctuation perturbations at the end of the sentence are highlighted in blue , original translations are highlighted in yellow and the changes in the translations are highlighted in red .