As Easy as 1, 2, 3: Behavioural Testing of NMT Systems for Numerical Translation

Mistranslated numbers have the potential to cause serious effects, such as financial loss or medical misinformation. In this work we develop comprehensive assessments of the robustness of neural machine translation systems to numerical text via behavioural testing. We explore a variety of numerical translation capabilities a system is expected to exhibit and design effective test examples to expose system underperformance. We find that numerical mistranslation is a general issue: major commercial systems and state-of-the-art research models fail on many of our test examples, for high- and low-resource languages. Our tests reveal novel errors that have not previously been reported in NMT systems, to the best of our knowledge. Lastly, we discuss strategies to mitigate numerical mistranslation.


Introduction
Just as neural machine translation (NMT) systems have achieved tremendous benchmark results, they have been proven brittle when faced with irregular inputs such as noisy text (Belinkov and Bisk, 2018;Michel and Neubig, 2018) or adversarial inputs (Cheng et al., 2020). Among such errors, mistranslation of numerical text constitutes a crucial but under-explored category that may have profound implications. For example, in the medical domain, mistranslating the number of confirmed cases of a contagious disease like COVID-19 may exacerbate public health misinformation. Numerical errors made in financial document translation, e.g., an extra or omitted digit or decimal point, could lead to significant monetary loss. Surprisingly, we find that numerical mistranslation is a general issue faced by state-of-the-art NMT systems, including commercial and research systems, with evidence * This work was conducted while author was working at Facebook AI Table 1: Numerical errors discovered by our method when behavioural testing two popular commercial translation systems using their public APIs. present across contexts: for both high and low resource languages, and for both close and distant languages.
De facto standard metrics such as BLEU (Papineni et al., 2002) may fail to flag a numerical translation error, which only contributes a very minor penalty, as it is typically a single-token mistranslation. To facilitate the discovery of numerical errors made by NMT systems, we propose a black-box test method 1 for assessing and debugging the numerical translation of NMT systems in a systematic manner. Our method extends the CheckList behavioural testing framework (Ribeiro et al., 2020) by designing automatic test cases to assess a suite of fundamental capabilities a system should exhibit in translating numbers.
Our tests on state-of-the-art NMT systems expose novel error types that have evaded close examination (Table 1). These error types greatly extend the number category (NUM) of the catastrophic errors  of NMT systems with richer error types. Finally, the abuse of these errors constitute vectors of attack: error-prone numerical tokens injected into monolingual data may corrupt back-translation-based training, as the resulting back-translated sentences are very likely to contain the desired errors.

Method
We follow Ribeiro et al. (2020)'s CheckList in designing our evaluation suite for NMT systems: we present several basic capabilities an NMT system should be expected to exhibit in translating common everyday numerical text; we then generate test examples specific to each capability to benchmark performance and find bugs in NMT systems.

Capabilities of Translating Numbers
We explore four capabilities (see Table 2), demonstrating expected translation ability of a system on common types of numerical text. Concretely, the Integers and Decimals represent basic capabilities; they can be manifested by testing on sequences of digits with variable lengths (e.g., 100 vs. 10000) or decimals with the decimal mark placed at varying locations (1.001 vs. 10.01). We find that the tested NMT systems are more likely to malfunction when translating larger integers and decimals with longer fractional parts. The Numerals capability pertains to whether a system is able to translate numbers that are presented as words. The Separators capability checks if a model can deal with numbers containing decimal or thousands separators. 2 Systems that fail to manifest one or more of these capabilities may produce wrong numbers that can be inconspicuous to users and become a ready, exploitable source of misinformation.
2 While Decimals and Separators may have overlapping instances (e.g., the decimal mark), their specific formats in our testing are different (Table 2), which leads us to find nonoverlapping error types: most Decimals errors involve translating numbers into wrong digits, whereas Separators errors pertain to mistranslation in localisation usage (e.g., German and English use different decimal and thousand separators).

Test Examples
To efficiently test the identified capabilities across multiple systems on distinct language pairs, we generate desired test examples using templates. For example, to test the Numerals capability, we use a template sentence such as "CNBC reported there were at least [NUM] cases worldwide.", where "[NUM]" is a number with the format "ddd.ddn", consisting of multiple digits and a numeral (e.g., 100.01 million).
We experiment with formats of various lengths and decimal-point positions. We fill a format with random digits and numerals, and explore 25 different formats across all capabilities. This allows us to generate a diversity of numbers at scale, akin to fuzzing a program with random inputs to uncover bugs. We also note that all the numbers created for a format can be seen as a set of "adversarial" examples, as they are small perturbations of each other. Details about the test examples for each capability and the testing process can be found in Supplementary material.

Evaluation
Before presenting experimental results and discussion of our test framework, we first detail our evaluation setup.
Language pairs. We test both high-resource (HR) and low-resource (LR) scenarios. For HR, we consider two language pairs: English-German and English-Chinese, and for LR, we focus on English-Tamil and English-Nepali. We test both translation directions for each pair.
SOTA systems. We conduct behavioural testing against two popular commercial translation systems (denoted by A and B). As research systems, we use pre-trained models that were shown to perform well in WMT competitions (denoted by R), specifically, fairseq's transformer for English-German (Ng et al., 2019), English-Tamil (Chen et al., 2020), and English-Chinese/Nepali (Fomicheva et al., 2020).
The evaluation metric. For each capability we curate a list of test examples (sentences containing numbers), which are taken from various sources, including existing corpora or manually crafted (details in Supplementary material). To these sentences we remove the number component, and replace it with a number based on the specific capability being tested. This test collection is then input to a translation system, and we report the Pass Rate (PR), the fraction of inputs where the system translates the numerical component perfectly. 3 Table 3 shows the results of testing the three SOTA systems across the HR/LR language pairs. Among the four capabilities, Numerals turns out to be the most challenging across the systems tested, with the average PR of 70.8%. This is probably because, compared to other forms, numbers are less frequently written as words, resulting in insufficient examples available for training. At the other extreme, Integers, which tests on pure digits, is the easiest capability, as expected. Despite this, it is not a 'solved problem', given all systems report imperfect PR < 100 on at least one language.

Testing Performance
Across the systems, the research system R (PR: 77.8%) underperforms the two commercial ones (PR A : 80.6%, PR B : 90.4%). This is largely caused by the fact that the research system fails markedly on the En→Ne direction.
Per language, the results are similar in both translation directions, implying that numerical translation is a symmetric problem. Note that the results on LR are not always worse than that on HR (PRs on En-Ta are surprisingly the highest of all). This suggests that the size of training data is not the sole factor for high-quality numerical translation.

Error Analysis
We present analysis of novel types of mistranslations discovered from testing.
Decimal/thousands separators. We find that the decimal/thousands separators are prone to be mistranslated in localisation scenarios, when conventions differ between the languages (e.g., "," and "." are the thousands and decimal separators in English while they are swapped in German). A common type of error is that a separator remains the same after translation (Table 4, row 1). This is probably due to the lack of sufficient training data to learn the translation of the separators in the target language.
Cardinal numerals. Cardinal numerals are commonly used in commercial and financial contexts. For example, the financial characters (e.g., "壹" meaning one) are typical in Chinese financial documents. However, we find that the tested translation systems perform fairly poorly in translating cardinal numerals (Table 4, row 2). Common errors include mistranslation or under-translation of the unit words (e.g., hundred) or the number words (e.g., "陆拾"). Most often, the errors appear to be caused by the unique unit words used in different languages (e.g., "万" in Chinese equals to 10 thousand), where a system needs to "compute" the correct amount for translation.
Digits. The pure digit translation (10→10) is expected to be easy, since a system may opt to copy the entire number as the translation. However, we find that the digit translation between English and low-resource languages can be far from satisfactory. An example is the translation between English and Nepali (Table 4, row 3). One reason for this result is that Nepali has its own numerals for digits. As a result, a system would try to convert a digit into a Nepali digit (instead of keeping it unchanged) when translating numbers, which is difficult given limited training resources (Guzmán et al., 2019). Another common issue in digit translation is handling repeats of the same digit. A system is prone to omit or add one or more digits in the translation.
Units. This error often occurs when translating numbers accompanied by units of measurements (e.g., 10 meters), especially when the target unit is unique to the language, e.g., "角" in Chinese means "10 cents". In such cases (Table 4, last row), the system may need to learn the implicit conversion rules and then use them to "calculate" the correct numbers with the target unit of measurement. For example, when translating "10.01 million" into "1001万" in Chinese, the system has to convert "10.01" into "1001" and then use the correct unit "万". An error may occur if the system fails either or both stages of this process (i.e., mistranslating the numbers and/or units).

Potential Mitigation Strategies
Finally, we discuss several strategies that may mitigate the above errors discovered by our method 4 .   Separate treatment of numbers. Although NMT models have been shown capable of performing basic arithmetic or bracket matching (Suzgun et al., 2019), this paper demonstrates that handling the various forms of numerical text in reality is still challenging. It may be worth separating numerical translation out into an individual process, as in Statistical MT (Koehn, 2009), that identifies numbers in the input, applies specific translation rules to them, and incorporates the translation into the output (Tu et al., 2012).

Data augmentation.
Training with more quality data leads to better translation quality (Barrault et al., 2020). In our testing, we observe a large proportion of errors (e.g., financial characters, units) stemming from mistranslation of specific numerals that are unique or used less frequently (e.g., "角", decimetres) in a language. Such errors could potentially be reduced if more numeral-specific instances were added to training.
Tailoring BPE segmentation. The Byte Pair Encoding (BPE) has been used by most leading NMT systems. However, long sequences of digits or numbers with separators (e.g., ",", ".") are often split into varying sized fragments by BPE. This would render learning more difficult, as the system has to account for the dependency between the partitions. To circumvent this, one may wish to segment numbers differently, e.g., to encode all numbers as character sequences, or as meaningful groupings of components (e.g., segment into groups of 3 digits when processing English.) Sanity checks. It is helpful to post-check whether all numbers in a translation are correct by comparing them to the inputs. This could be automated in the same way as we measure the Pass Rate ( §3), and once again drawing parallels to software testing, could be fully automated via continuous integration of NMT systems.

Conclusion
In this paper, we propose an evaluation method to systematically assess four fundamental capabilities of NMT systems in translation numbers by virtue of a variety of test cases. Our tests reveal novel types of errors that are general across multiple SOTA translation systems for both high and low resource languages. We hope that our study will help improve numerical translation quality and reduce misinformation caused by numerical mistranslation.

Impact Statement
This work aims to improve the performance of NMT systems. The impact of poor numerical translation may go beyond poor user experience, potentially leading to financial loss, medical misinformation, and even a vector for poisoning NMT systems. This paper's behavioural testing could be used by an attacker to uncover flaws in a commercial NMT system. However, as in attack research in the security community, responsible highlighting of such flaws serves the purpose of improving systems: knowledge of systemic flaws in numerical translations helps vendors improve their systems to mitigate these effects in the first place, while concerted attackers are likely to discover vulnerabilities independently.