Tolerant BLEU: a Submission to the WMT14 Metrics Task

This paper describes a machine translation metric submitted to the WMT14 Metrics Task. It is a simple modiﬁcation of the standard BLEU metric using a monolingual alignment of reference and test sentences. The alignment is computed as a minimum weighted maximum bipartite matching of the translated and the reference sentence words with respect to the relative edit distance of the word preﬁxes and sufﬁxes. The aligned words are included in the n -gram precision computation with a penalty proportional to the matching distance. The proposed tBLEU metric is designed to be more tolerant to errors in inﬂection, which usually does not effect the understandability of a sentence, and therefore be more suitable for measuring quality of translation into morphologically richer languages.


Introduction
Automatic evaluation of machine translation (MT) quality is an important part of the machine translation pipeline. The possibility to run an evaluation algorithm many times while training a system enables the system to be optimized with respect to such a metric (e.g., by Minimum Error Rate Training (Och, 2003)). By achieving a high correlation of the metric with human judgment, we expect the system performance to be optimized also with respect to the human perception of translation quality.
In this paper, we propose an MT metric called tBLEU (tolerant BLEU) that is based on the standard BLEU (Papineni et al., 2002) and designed to suit better when translation into morphologically richer languages. We aim to have a simple language independent metric that correlates with human judgment better than the standard BLEU.
Several metrics try to address this problem as well and usually succeed to gain a higher correlation with human judgment (e.g. ME-TEOR (Denkowski and Lavie, 2011), TerrorCat (Fishel et al., 2012)). However, they usually use some language-dependent tools and resources (METEOR uses stemmer and parahprasing tables, TerrorCat uses lemmatization and needs training data for each language pair) which prevent them from being widely adopted.
In the next section, the previous work is briefly summarized. Section 3 describes the metric in detail. The experiments with the metric are described in Section 4 and their results are summarized in Section 5.

Previous Work
BLEU (Papineni et al., 2002) is an established and the most widely used automatic metric for evaluation of MT quality. It is computed as a harmonic mean of the n-gram precisions multiplied by the brevity penalty coefficient which ensures also high recall. Formally: where BP is the brevity penaly defined as follows: c is the length of the test sentence (number of tokens), r is the length of the reference sentence, and p n is the proportion of n-grams from the test sentence found in the reference translations.
The original experiments with the English to Chinese translation (Papineni et al., 2002) reported very high correlation of BLEU with human judgments. However, these scores were computed using multiple reference translations (to capture translation variability) but in practice, only one  Figure 1: An example of the unigram and bigram precision computation for translation from English to Czech with the test sentence having minor inflection errors and an additional preposition. The first two lines contain the source sentence in English and a correct reference translation in Czech. On the third line, there is an incorrectly translated sentence with errors in inflection. Between the second and the third line, the matching with respect to the affix distance is shown. The fourth line contains the corrected test sentence with the words weights. The bottom part of the figure shows computation of the unigram and bigram precisions. The first column contains the original translation n-grams, the second one the corrected n-grams, the third one the n-gram weights and the last one indicates whether a matching ngram is contained in the reference sentence. reference translation is usually available and therefore the BLEU scores are often underestimated.
The main disadvantage of BLEU is the fact that it treats words as atomic units and does not allow any partial matches. Therefore, words which are inflectional variants of each other are treated as completely different words although their meaning is similar (e.g. work, works, worked, working). Further, the n-gram precision for n > 1 penalizes difference in word order between the reference and the test sentences even though in languages with free word order both sentences can be correct (Bojar et al., 2010;Condon et al., 2009).
There are also other widely recognized MT evaluation metrics: The NIST score (Doddington, 2002) is also an n-gram based metric, but in addition it reflects how informative particular n-grams are. A metric that achieves a very high correlation with human judgment is METEOR (Denkowski and Lavie, 2011). It creates a monolingual alignment using language dependent tools as stemmers and synonyms dictionaries and computes weighted harmonic mean of precision and recall based on the matching. Some metrics are based on measuring the edit distance between the reference and test sentences. The Position-Independent Error Rate (PER) (Leusch et al., 2003) is computed as a length-normalized edit distance of sentences treated as bags of words. The Translation Edit Rate (TER) (Snover et al., 2006) is a number of edit operation needed to change the test sentence to the most similar reference sentence. In this case, the allowed editing operations are insertions, deletions and substitutions and also shifting words within a sentence.
A different approach is used in TerrorCat (Fishel et al., 2012). It uses frequencies of automatically obtained translation error categories as base for machine-learned pairwise comparison of translation hypotheses.
In the Workshop of Machine Translation (WMT) Metrics Task, several new MT metrics compete annually (Macháček and Bojar, 2013). In the comptetition, METEOR and TerrorCat scored better that the other mentioned metrics.
3 Metric Description tBLEU is computed in in two steps. Similarly to the METEOR score, we first make a monolingual alignment between the reference and the test sentences and then apply an algorithm similar to the standard BLEU but with modified n-gram precisions.
The monolingual alignment is computed as a minimum weighted maximum bipartite matching between words in a reference sentence and a translation sentence 1 using the Munkres assignment algorithm (Munkres, 1957).
We define a weight of an alignment link as the affix distance of the test sentence word w t i and the reference sentence word w r j : Let S be the longest common substring of w t i and w r i . We can rewrite the strings as a concatenation of a prefix, the common substring and a suffix: Further, we define the affix distance as: if |S| > 0 and AD(w r , w t ) = 1 otherwise. L is the Levensthein distance between two strings.
For example the affix distance of two Czech words vzpomenou and zapomenout (different forms of verbs remember and forget) is computed in the following way: The longest common substring is pomenou which has a length of 7. The prefixes are vz and za and their edit distance is 2. The suffixes are an empty string and t which with the edit distance 1. The total edit distance of prefixes and suffixes is 3. By dividing the total edit distance by the length of the longest common substring, we get the affix distance 3 7 ≈ 0.43. We denote the resulting set of matching pairs of words as M = {(w r i , w t i )} m i=1 and for each test sentence S t = (w t 1 , ..., w t m ) we create a corrected sentenceŜ t = (ŵ t 1 , ...,ŵ t m ) such that This means that the words from the test sentence which were matched with the affix distance smaller than are "corrected" by substituting them by the matching words from the reference sentence. The threshold is a free parameter of the metric. When the threshold is set to zero, no corrections are made and therefore the metric is equivalent to the standard BLEU.
The words in the corrected sentence are assigned the weights as follows: In other words, the weights penalize the corrected words proportionally to the affix distance from the original words. While computing the n-gram precision, two matching n-grams (ŵ t 1 , . . .ŵ t n ) and (w r 1 , . . . w r n ) contribute to the n-gram precision with a score of instead of one as it is in the standard BLEU. The rest of the BLEU score computation remains unchanged. While using multiple reference translation, the matching is done for each of the reference sentence, and while computing the n-gram precision, the reference sentences with the highest weight is chosen. The computation of the n-gram precision is illustrated in Figure 1.

Evaluation
We evaluated the proposed metric on the dataset used for the WMT13 Metrics Task (Macháček and Bojar, 2013). The dataset consists of 135 systems' outputs in 10 directions (5 into English 5 out of English). Each system's output and the reference translation contain 3000 sentences. According to the WMT14 guidelines, we report the the Pearson's correlation coefficient instead of the Spearman's coefficient that was used in the last years. Twenty values of the affix distance threshold were tested in order to estimate what is the most suitable threshold setting. We report only the system level correlation because the metric is designed to compare only the whole system outputs.

Results
The tBLEU metric generally improves the correlation with human judgment over the standard BLEU metric for directions from English to languages with richer inflection.
Examining the various threshold values showed that dependence between the affix distance threshold and the correlation with the human judgment varies for different language pairs (Figure 2). For translation from English to morphologically richer languages than English -Czech, German, Spanish and French -using the tBLEU metric increased the correlation over the standard BLEU. For Czech the correlation quickly decreases for threshold values bigger than 0.1, whereas for the other languages it still grows. We hypothesize this because the big morphological changes in Czech can entirely change the meaning.
For translation to English, the correlation slightly increases with the increasing threshold value for translation from French and Spanish, but decreases for Czech and German.
There are different optimal affix distance  thresholds for different language pairs. However, the threshold of 0.05 was used for our WMT14 submission because it had the best average correlation on the WMT13 data set. Tables 1 and  2 show the results of the tBLEU for the particular language pairs for threshold 0.05. While compared to the BLEU score, the correlation is slightly higher for translation from English and approximately the same for translation to English. The results on the WMT14 dataset did not show any improvement over the BLEU metric. The reason of the results will be further examined.

Conclusion and Future Work
We presented tBLEU, a language-independent MT metric based on the standard BLEU metric. It introduced the affix distance -relative edit distances of prefixes and suffixes of two string after removing their longest common substring. Finding a matching between translation and reference sentences with respect to this matching allows a penalized substitution of words which has been most likely wrongly inflected and therefore less penalizes errors in inflection.
This metric achieves a higher correlation with the human judgment than the standard BLEU score for translation to morphological richer languages without the necessity to employ any language specific tools.
In future work, we would like to improve word alignment between test and reference translations by introducing word position and potentially other features, and implement tBLEU in MERT to examine its impact on system tuning.