Automating Behavioral Testing in Machine Translation

Behavioral testing in NLP allows fine-grained evaluation of systems by examining their linguistic capabilities through the analysis of input-output behavior. Unfortunately, existing work on behavioral testing in Machine Translation (MT) is currently restricted to largely handcrafted tests covering a limited range of capabilities and languages. To address this limitation, we propose to use Large Language Models (LLMs) to generate a diverse set of source sentences tailored to test the behavior of MT models in a range of situations. We can then verify whether the MT model exhibits the expected behavior through matching candidate sets that are also generated using LLMs. Our approach aims to make behavioral testing of MT systems practical while requiring only minimal human effort. In our experiments, we apply our proposed evaluation framework to assess multiple available MT systems, revealing that while in general pass-rates follow the trends observable from traditional accuracy-based metrics, our method was able to uncover several important differences and potential bugs that go unnoticed when relying only on accuracy.


Introduction
Automatic evaluation metrics such as BLEU (Papineni et al., 2002) and COMET (Rei et al., 2020) are the primary means of measuring the translation quality of MT systems.Researchers and practitioners rely on them for comparing systems, detecting regressions, and making deployment decisions.This poses an important concern: such metrics typically aggregate the performance of systems across a set of sentences into single scores.Unfortunately, these metrics by design tend to overlook specific infrequent but important error cases, making it difficult to reliably detect such issues in practice.
Behavioral testing, originally developed as a type of software testing (Beizer and Wiley, 1996), has been proposed as an approach that can alleviate such kinds of problems in natural language processing (Ribeiro et al., 2020).Behavioral tests focus on assessing a system's fine-grained linguistic capabilities by validating input-output behavior in a controlled fashion.
Table 1 shows examples of typical issues of MT systems that could be covered by behavioral tests.We argue that the availability of a comprehensive behavioral test suite for MT would be of high practical use: It would allow us to understand how exactly two available MT models differ, or to block an MT system from being deployed if a passing threshold for a certain linguistic capability is not met.
However, there are currently two major limitations that arise when attempting to apply behavioral testing to MT.
First of all, behavioral testing was originally designed for evaluating systems characterized by a relatively small output space.For instance, Ribeiro et al. (2020) investigate sentiment classification, duplicate question detection, and machine comprehension.In contrast, the output space of MT systems grows exponentially as tokens are generated.Secondly, behavioral testing often requires rigid templates to create examples and their correspond-

Test Set Creation MT Model Evaluation
Figure 1: Pipeline of the proposed approach.Left: For each property type, a test set is created via a LLM, composed of source sentences x with property values x v ( §3).Subsequently, a candidate set of valid translations of each property value C xv is generated ( §4).Right: During evaluation, the translation ŷ generated by an MT model is compared against the candidate sets, and a pass-fail decision is made ( §5).
ing labels, which involves a costly human effort to develop and expand to additional use cases.Otherwise, the diversity of sentences in the resulting test suite is too limited.Several recent works have partially addressed these limitations.For example, Wang et al. (2021);Raunak et al. (2022) propose MT-specific test sets which include the ability to handle large output space.Yang et al. (2022) address the limitation of rigid templates.To the best of our knowledge, no prior work has addressed both limitations for MT.
In this study, we aim to bridge this gap by leveraging LLMs with in-context learning to help automate the design of behavioral tests in MT for the first time.The main contributions of this work are as follows: • We use LLMs to automate the generation of a diverse set of source sentences for behavioral testing.Sentences are generated to exhibit the specific language property that is being tested.
• We verify whether an MT system's output contains an accurate translation of the language property that is being tested.To this end, we propose using LLMs to generate candidate sets of ground-truth translations of the property values in cases where exhaustive candidate sets are plausible.Otherwise, we generate contrastive candidates and evaluate via semantic similarity measures.
• We present an evaluation framework to robustly compute pass rates of MT models across various language properties, and show results for widely used open-source models on three language pairs.

Behavioral Testing for MT
Behavioral testing, as proposed by Ribeiro et al. ( 2020), uses input-output pairs tailored to evaluate a model capability to correctly handle certain language properties.The goal is to complement traditional aggregated accuracy scores, which, while useful by themselves, often fail to capture longtail phenomena.In practice, manual inspection of system outputs is often crucial to make up for this shortcoming.Automated behavioral testing provides a more reliable and less cumbersome alternative that can reduce or eliminate the need for manual inspection, provided that a sufficient range of language properties is covered.Test results are presented to the researcher or practitioner in the form of a table of pass rates (one pass rate for each tested property) that is informative to decide on consequent steps, e.g.whether bugs must be addressed before deployment.Note that the creation of a sufficiently comprehensive behavioral test suite depends crucially on whether its creation can be automated to a high degree, which is also our main design goal in this work.
We are particularly interested in a specific type of behavioral tests introduced by Ribeiro et al. ( 2020), minimal functionality tests (MFTs). 1 In the context of MT, an MFT measures how well a model is able to translate particular property values that appear naturally embedded in a given set of source sentences.
Figure 1 illustrates our proposed framework.First, a source sentence x = {x 1 • • • , x |x| } that contains a tagged property value x v ⊆x is generated ( §3).For instance, if our test property is physical unit translation, we might have x="I ran 3 miles." You are an assistant that generates sentences where only appears one B = {property}.Don't be repetitive, change the topic and B between sentences.Write every B inside [].B must happen only once in each sentence and can only contain {property}.
-{Source sentence demonstration #1} -{Source sentence demonstration #2} -{Source sentence demonstration #3} Now write 10 more diverse sentences itemizing them with '-': and x v =miles.A main challenge comes from the fact that there is a potentially large space of correct translations.However, note by design MFTs need only to check whether the property under test is translated correctly, while unrelated translation errors should be ignored.In many cases, this reduces the space of correct translations to a manageable size.We therefore propose to automatically generate a candidate set C xv (either exhaustive or contrastive; see §4) and then apply a pass-fail detector that uses either string matching or semantic similarity measures ( §5).In our example, we might generate an exhaustive candidate set C xv ={Meilen, mi} for the case of translating into German.We now aim to evaluate an MT model f : x → ŷ.To do so, we compare ŷ against C xv .A correct translation ŷ="I lief 3 Meilen."would match the candidate set and therefore pass the test, while a typical incorrect translation ŷ="I lief 3 km."does not match the candidate set and therefore fails the test.
Given this general overview of our method, we now turn to a more precise description of each proposed step in the following sections.

Source Sentence Generation
To create source sentences for testing a certain language property, we pose several desiderata: Sentences should be diverse (e.g.not rely only on a handful of templates), natural, numerous enough to yield reasonable statistical significance, and contain a property value associated with our tested property.
Note that existing approaches often struggle to generate diverse test sets due to the reliance on hand-crafted templates (Wang et al., 2021).To overcome this shortcoming, we design a general template for prompting LLMs, in our case ChatGPT 2 , OpenAI's model built on InstructGPT (Ouyang et al., 2022).This allows us to gener-2 gpt-3.5-turboAPI accessed on May 2023.

Candidates Examples
kilometers → kilómetros, km watts → vatios, W meters per second → metros por segundo, m/s ate diverse source language sentences that contain property values suitable for testing different capabilities (see prompt3 in Figure 2).We instantiate the prompt once for every language property that we wish to include in our test suite.
To simplify the later verification step, we generate sentences that contain exactly one such property value x v .4Moreover, we generate source sentences with brackets around the property value such that it can be easily parsed.For example, for the property of translating decimal numbers, test sentences might look like this: The company received [ property value 4200.4 ]C. (1) Note that brackets are removed before passing the sentence to the MT model.We apply basic filters to remove duplicated sentences, examples with more than one property value, or those composed of more than one sentence.We repeatedly feed the same prompt to the LLM, and stop the generation process when reaching 1,000 sentences after filtering.Our experiments ( §9) indicate that ChatGPT is able to generate a large number of source sentences without becoming repetitive.

Candidates Generation
Next, in order to be able to verify whether an MT system correctly translated the property value in the source sentence, we automatically generate valid translation candidate sets for each property value.For some properties, such as number translation, we create exhaustive or near-exhaustive candidate sets.For other properties where the number of valid translations would be too big to do so, we instead create contrastive candidate pairs that demonstrate desired and undesired behavior.Note that candidate sets only need to be created once and can then be re-used for every tested system.

Near-Exhaustive Candidate Sets
In this approach, we follow Raunak et al. ( 2022) in creating a set of all valid translations of each property value in the test (see example in Table 2).However, instead of manually designing candidate sets, we propose using the in-context learning (Brown et al., 2020) and multilingual capabilities of instruction-tuned LLMs (Wei et al., 2022) to accomplish the task.For each property value x v , we generate a set of translation candidates C xv with ChatGPT (gpt-3.5-turbo)(see prompt5 in Figure 3).We tried to design demonstrations to encompass both correctness and completeness, including possible inflections.An example of demonstrations used for the currencies test can be seen in Appendix A. Note that while we aim for completeness, i.e. all valid translations should be included in the candidate set, in practice we found that some rare translation choices may not be included in the automatically generated candidate sets.However, this will not impact pass-rates much because by nature rare translation choices appear in the MT system's output only in rare situations.In §9 we perform a human assessment of the reliability of the generated candidate sets.

Contrastive Candidate Pairs
Some property values can span multiple words on the source side, potentially increasing the number of valid translations drastically.An example is idiomatic expressions, where there is an increased risk that the candidate set cannot exhaust all possibilities.To mitigate this issue, we propose an alternative approach, which we call contrastive candidate sets.
Given    Specifically, we consider as pass an exact caseinsensitive substring matching.Following Example 1, where x v = 4200.4,if we are evaluating the En→De decimal numbers translation capabilities of the model, we would consider the model passes the test if it outputs '4200,4', or '4.200,4'. 7

Semantic Similarity for Contrastive Candidate Pairs
For measuring the closeness of the property value translation to the contrastive candidates, we propose relying on the semantic similarity of word sequences representations extracted by a multilingual encoder (Reimers andGurevych, 2019, 2020).8However, directly measuring the similarity between the translation of the property value and the candidate sets may be unreliable since they may differ in length and the location of the translation is unknown due to lack of word-level alignment.Instead, we propose that, for each candidate c xv corr or c xv foil , we split the model's translation into n-grams, where n is the number of words of the current candidate.Then, we measure the similarity between each of the n-grams and the candidates.Given a translation and the contrastive candidate set C xv contra formed by the correct and foil candidates, we define the pass-fail function as: (3) In Algorithm 1 we show the computation of max_sim function, depicted as an example in Fig- ure 4.

Evaluation Metrics
Having established pass-fail detection for individual sentences, the final step is to compute aggregated pass rates across test sets.Appealingly, pass rates are naturally expressed as percentages, making them intuitive to interpret.

Macro Pass Rate
Let us assume that we have computed pass-fail results across a behavioral test set consisting of N test cases (sentences).From a statistical viewpoint, we have access to a sample X = {c(ŷ n , C  drawn from some unknown distribution over test cases, F .The expectation of the true pass rate can be computed as follows: One issue that arises in practice is that property values themselves follow a long tail pattern: Certain values appear relatively frequently, while many other values appear only once across the generated test set.This can make pass rates overly sensitive to whether models happen to perform well for these particular values.To mitigate this issue, we assume a generative story in which property values are drawn from a uniform distribution, and consequently compute the expected pass rate as the macro average across property values: where V refers to the set of distinct property values, and N v to the number of examples associated with each specific property value.

Confidence Intervals
Although previous work performing behavioral testing for MT shows point estimate scores, confidence intervals provide a more reliable approach to statistical analysis, as they quantify the uncertainty associated with that estimate, as well as ensure the sample size is large enough.To compute confidence intervals for our estimator MPR we use the Bootstrap method (Efron, 1979), which performs sampling with replacement from X , generating K resamples {Y 1 , • • • , Y K }, from which we compute their corresponding macro pass rates } to construct the bootstrap distribution MPR boot .Assuming the distribution of X is a reasonable approximation of the population distribution F , confidence intervals can be derived from MPR boot .For that purpose, we compute the percentile bootstrap interval for α = 0.05 provided by SCIPY library (Virtanen et al., 2020).

Paired Bootstrap
The paired bootstrap is a statistical resampling technique used to assess the uncertainty and make inferences about the difference between two samples.Paired bootstrap allows us to compare the property's sample of passes/fails for two different models (Koehn, 2004).By following the resampling process outlined in the previous section, if a model consistently outperforms the other in 95% of the iterations, we can assert with 95% statistical significance that it is superior.

Properties to Test
In this work, we design different tests and use our proposed framework to evaluate MT models in multiple properties.These properties have two important qualities that make them useful for evaluating translation systems: vital for producing highquality translations, yet posing a challenge when assessing through conventional evaluation metrics.
Numbers.We conduct independent assessments for integers (e.g.1887), decimals (e.g.154.32), and large numbers (e.g.200 billion).Large numbers have the format "integer/decimal million/billion/trillion". We create near-exhaustive candidate sets of valid number translations and check if the translation matches any candidate.
Physical Units.We build near-exhaustive candidate sets for evaluating the translations of diverse units including those related to weight, length, time, or temperature inter alia (e.g.inches).Translations are evaluated by string matching.
Emojis, Names, and Web Terms.Via string matching we check whether the translated text re-tains the same property instantiation found in the source text.Candidate sets for these tests are thus considered to be exhaustive.
Currencies.We consider currencies appearing in the ISO code format (e.g.EUR).Near exhaustive candidate sets are built allowing translations into the same ISO code, variations of the currency name or its symbol (e.g. for En→Es: EUR/euro/euros/C), then a string matching passfail detection is employed.
Idioms.Idiomatic expressions pose significant challenges for MT systems due to their non-literal nature and potential large sequence length.We use idioms as a test bench for the use of contrastive candidate pairs (incorrect literal translation candidate vs. correct meaning translation) and semantic similarity detection procedure.

Models Comparison
In this section, we introduce the tested models and present results obtained via standard metrics as well as our proposed framework.

Experimental Setup
We test widely-used open-source MT models, as well as a commercial system.We aimed to select models that perform very strongly, while also differing in some important aspects (e.g.bilingual vs. multilingual).
• No Language Left Behind project model (NLLB) (Team et al., 2022).We experiment with the 600M and the 3.3B parameters models.
• Many-to-Many (MLM-100) family of Multilingual models (Fan et al., 2021).We conduct our experiments with the 418M and 1.2B parameters models.

OPUS MT (Bil)
The article I read on www.scientificjo urnal.orgwas very informative.
Commercial system ... our town's population was counted as 12,577.
Table 4: Examples flagged as failed translations.

General Translation Accuracy
We first measure general translation performance across language pairs for standard reference-based metrics (Table 3).The commercial system performs best across the board, followed by the WMT21 model.In the following sections, we dive deeper into the different capabilities.

Behavioral Tests Results
As an illustrative example, macro pass rate confidence intervals across property types and models for the En→De direction are presented in Figure 5.The complete results can be found in Appendix D.
Commercial system is most consistent across properties.This is especially true for emoji translations, where open-source models lack most emojis in their vocabulary.However, it is noteworthy that its performance is subpar in the context of En→Es integers and En→Ja large numbers.After manual inspection (see examples in Table 4), we attribute the lower integers translation performance to the fact that it uses the comma as the thousands separator.Note that this behavior can be acceptable depending on the country; behavioral tests must be designed to reflect the intended behavior.
Bilingual models struggle with web terms.Although the multilingual models mostly manage to preserve web terms without alteration, both tested  bilingual models (for En→Es and En→De) underperform in that property (see Figure 5 and Figure 6 top).Most fail cases contain Spanish words inside the translated web terms (Table 4).We hypothesize that this occurs because they are trained to exclusively translate into Spanish, which consequently hinders their ability to generate content in other languages, and is therefore an intrinsic limitation of bilingual models.
Scaling models help increase capabilities.In most of the settings, scaling the model of the same family shows increased performance, for instance, physical units and idioms in Figure 5.However, there are some counter-examples, like in the case of En→Ja decimals and integers tests.We make a 95% statistically significant conclusion that the WMT21 system is better than the rest of the models.(Dankers et al., 2022).It is worth noting that results are similar in the three language directions, with the commercial system and NLLB 3.3B showing comparable performance.

Reliability of the Proposed Approach
To assess the reliability of the proposed approach, in this section, we analyze the robustness of the source sentence generation and the pass-fail detection in detail.

Analysis of Source Sentence Generation
One potential concern with the proposed method is whether the generated source sentences are di-  verse enough and do not become repetitive after a few rounds of generation.12A standard method for quantifying the diversity in a corpus is distinct n-grams (Li et al., 2016), which computes the ratio of unique n-grams to the total number of ngrams present.In our case, we are interested in assessing the diversity of each generated source sentence compared to the previous generations.To that end, we propose a metric that enables precise measurement of this aspect.Given the set of unique n-grams generated up to sentence x t (G n x<t ), we measure the proportion of unique n-grams in each newly generated sentence (G n xt ) that are not present in G n x<t : Figure 8 shows 3-gram diversity along 1000 generated sentences after fitting a polynomial regression.We observe that the diversity drop is mild even after 500 sentences, where for most of the tests, 60% of newly generated 3-grams are novel.
Furthermore, we observe that the sentence generator produces sentences that comply with instructions, indicated by the high proportion of the original sentences that pass filtering.In the majority of cases, over 70% of the LLM-generated sentences successfully pass the filtering steps outlined in §3, as seen in Table 6 (middle column).The right column shows the percentage of unique values, which naturally vary strongly depending on the property.

Analysis of Pass-Fail Detection
The reliability of the proposed pass-fail detection depends mainly on whether candidate sets are (1) complete and (2) do not contain wrong candidates.
We analyze this by sampling 100 random test cases that were marked as pass (positives), and an-  other 100 examples marked as fail (negatives).We manually annotate whether test results were correct or incorrect.Figure 7 shows false positives and false negatives (FP initial and FN initial).We observe that while for most properties these were low, for some test cases (namely physical units, large numbers, currencies) there were a significant number of FNs, which would lead to underestimated pass rates.We argue that erring on the side of FNs is generally preferable, because it prevents us from overestimating the strength of models, and because it would trigger a debugging effort which would quickly surface issues stemming from FNs.
To obtain more accurate pass-rates for all properties, we can manually remove candidates causing a FP and add missing candidates producing a FN.We do this for the test cases analyzed above, and then draw another random sample from both pass and fail categories.Figure 7 shows that the updated FPs and FNs are now negligible.
While in our experience, human intervention as outlined above is only a minor effort, the issue remains as to whether systems can be compared to one another without the need for human intervention, even in the presence of existing FNs.To understand this better, we plot macro pass rates with confidence intervals across annotation iterations in Figure 6.As expected, for physical units, large numbers, and currencies, pass rates move upwards.However, the effect is general across models, suggesting that relative ordering between models can be reasonably approximated in the initial attempt, i.e. without human intervention.
In addition, we assess the pass-fail detection of idioms.Given that the decision is made via semantic similarity for contrasting pairs, addressing issues in the candidate sets is more challenging.Consequently, we conducted a single evaluation it-eration for En→Es and En→De pairs.For En→De, we observed 59 FPs and 16 FNs, while for En→Es, the results showed 50 FPs and 11 FNs.We hypothesize high FPs are due to having the idiom and its figurative meaning present within the source sentence, interfering with the n-grams comparison.We intend to explore this avenue in future research.

Related Work
Recent works have applied behavioral testing for evaluating machine translation systems.Wang et al. (2021) designed tests for numerical translation capabilities by relying on fixed templates for source sentence generation.Raunak et al. (2022) proposed SALTED, a set of manually designed error detectors that are applied to millions of sentences from standard datasets.Although useful, these evaluation tools tend to require major human efforts to create tests and expand to other languages.Although there have been attempts to automatize the creation of behavioral tests (Yang et al., 2022), this has been limited to simple NLP tasks.
Our work also relates to the use of LLMs as evaluators for Machine Translation systems (Kocmi and Federmann, 2023), as well as for text generation in a broader sense (Liu et al., 2023;Xu et al., 2023), which extend the growing body of research on multi-dimensional text generation evaluation (Zhong et al., 2022;Yuan et al., 2021).
Behavioral testing aims to evaluate the behavior of systems under realistic conditions, contrasting it from the literature on adversarial data generation (Belinkov and Bisk, 2018;Zhang et al., 2021).

Conclusions
In this work, we have presented a method that automates the creation of behavioral tests to perform fine-grained evaluation of MT systems capabilities.We use Large Language Models to generate source sentences composed of fragments of specific language properties (integers, web terms, etc.), as well as translations of these properties.For property types formed by multiple words, we further extend the proposed method into a contrastive setting and show its usefulness in evaluating idiomatic expressions.To the best of our knowledge, our research represents the first attempt to develop MT behavioral tests by leveraging LLMs.Finally, we apply the proposed framework to evaluate open-source models on three language pairs.

Limitations
While the proposed evaluation framework seeks to address a broad spectrum of languages, the experiments conducted in this study are limited to three language pairs.Due to its reliance on the capacity of LLMs to produce high-quality candidate translations, we cannot guarantee accurate results when applied to language pairs involving a lowresource language using current LLMs.Moreover, the method is designed to work only on properties that appear as a continuous chunk of text in both source and target languages and are not scattered across a sentence.

Figure 2 :
Figure 2: General template of the prompt used for generating batches of source sentences.

Figure 3 :Figure 4 :
Figure 3: General template of the prompt used for generating near-exhaustive sets of candidate translations.

Figure 6 :
Figure 6: From top to bottom, En→Es Confidence Intervals after each annotation Iteration.

Table 1 :
Subset of linguistic properties tested with our proposed method, and examples (source → translation) of translation errors found in En→De MT models.

Table 2 :
Examples of En→Es set of candidates generated by ChatGPT.
g) if sim(g emb , c emb ) > max_sim then max_sim ← sim(g emb , c emb ) Equipped with these candidate sets, we now wish to mark every MT-translated sentence as either pass or fail.Depending on whether near-exhaustive or contrastive candidate pairs are used, we design You are a {source_lang}-{target_lang} translator.Given a {property}, write as many valid {target_lang} translations as you can.Use "|" to separate between valid translations.Write "NA" if unable to accomplish the task.

Table 3 :
Translation scores of the different models used in FLORES-200 devtest set.
En→De macro pass rates and confidence intervals across tested systems.

Table 5 :
is the strongest open-source model.The WMT21 model consistently exhibits superior Paired Bootstrap En→De Integers test results.

Table 6 :
Percentage of source sentences that pass filtering, and percentage of filtered sentences that introduce a new property value.