Extrinsic Evaluation of Machine Translation Metrics

Automatic machine translation (MT) metrics are widely used to distinguish the quality of machine translation systems across relatively large test sets (system-level evaluation). However, it is unclear if automatic metrics are reliable at distinguishing good translations from bad translations at the sentence level (segment-level evaluation). In this paper, we investigate how useful MT metrics are at detecting the segment-level quality by correlating metrics with how useful the translations are for downstream task.We evaluate the segment-level performance of the most widely used MT metrics (chrF, COMET, BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state tracking, question answering, and semantic parsing). For each task, we only have access to a monolingual task-specific model and a translation model. We calculate the correlation between the metric’s ability to predict a good/bad translation with the success/failure on the final task for the machine translated test sentences. Our experiments demonstrate that all metrics exhibit negligible correlation with the extrinsic evaluation of the downstream outcomes. We also find that the scores provided by neural metrics are not interpretable, in large part due to having undefined ranges. We synthesise our analysis into recommendations for future MT metrics to produce labels rather than scores for more informative interaction between machine translation and multilingual language understanding.


Introduction
Although machine translation (MT) has typically been seen as a standalone application, in recent years they have been more frequently deployed as a component of a complex NLP platform delivering multilingual capabilities such as cross-lingual information retrieval  or automated multilingual customer support (Gerz et al., 2021). When an erroneous translation is generated by the MT systems, it may add new errors in the pipeline of the complex task leading to a poor user experience. For example, consider the user's re-quest in Chinese "有牙加菜？" ("Is there any good Jamaican food in Cambridge?") is machine translated into English as "Does Cambridge have a good meal in Jamaica?". An intent detection model will erroneously consider "Jamaica" as a location, instead of cuisine, and prompt the search engine to look up restaurants in Jamaica 1 . To avoid this cascade of errors, it is crucial to detect an incorrect translation before it propagates through the task pipeline.
One way to approach this problem is to use segment level scores provided by MT metrics. There has been considerable progress in the development of automatic MT metrics -with some metrics demonstrating > 0.9 correlation with human judgements at the system level for some language pairs (Ma et al., 2019). Metrics seem to be able to pick up subtle differences between MT systems that emerge over a relatively large test corpus. However, there has been little success at correlating human judgements with metrics at the segment level (Ma et al., 2019;Freitag et al., 2021b) These segment-level judgements are crucial for integrating MT into multilingual platforms. Segment-level evaluation of MT is indeed more difficult because it is a task that is difficult for humans to agree on. However, despite MT systems being a crucial intermediate step in several applications, characterising the behaviour of these metrics under task-oriented evaluation is outstanding.
In this work, we provide a complementary evaluation of MT metrics. We focus on the segmentlevel performance of metrics, and we evaluate their performance extrinsically, by correlating them with the accuracy of downstream tasks which we can calculate reliably with accuracy metrics. We consider the Translate-Test setting -where at test time, the input from the target language is translated to the source language for further processing. We assume access to a parallel task-oriented dataset, a task-specific monolingual model, and a translation model that can translate from the target language into the language of the monolingual model. At test time, the target language input is translated into the source language and then executed on a task-specific model. We use the outcome of this extrinsic task to construct a binary classification benchmark for the metrics. Within this new dataset, we exclude all the examples where the examples in the source language failed on the extrinsic task. The metrics are then evaluated on this new classification task.
We use dialogue state tracking, semantic parsing, and extractive question answering as our extrinsic tasks. We evaluate nine metrics consisting of string overlap metrics, embedding-based metrics, and metrics trained using scores obtained through human evaluation of MT outputs. Surprisingly, we find this setup is hard for the existing metrics -demonstrating poor performance in discerning good and bad translations across tasks. We present a comprehensive analysis on failure of the metrics through quantitative and qualitative evaluation. We also investigate peculiar properties of the metrics such as generalisation to new languages, alternatives to references for reference-based metrics in an online setting, and conduct task-specific ablation studies.
Our findings are summarised as follows: • We derive a new classification task measuring how indicative segment level scores are for downstream performance of an extrinsic cross-lingual task (Section 3).
• We identify that segment level scores, from nine metrics, have minimal correlation with extrinsic task performance (Section 4.1).
Our results indicate that these scores are uninformative at the segment level (Section 4.3) -clearly demonstrating a serious deficiency in the best contemporary MT metrics. In addition, we find variable task sensitivity to different MT errors (Section 4.2).
• We propose recommendations on developing MT metrics to produce useful segmentlevel output by predicting labels instead of scores, and suggest reusing existing postediting datasets and explicit error annotations (See Section 5).

Related Work
Evaluation of machine translation has been of great research interest across different research communities (Nakazawa et al., 2021;Fomicheva et al., 2021). Notably, the Conference on Machine Translation (WMT) has been organising annual shared tasks on automatic MT evaluation since 2006 (Koehn andMonz, 2006;Freitag et al., 2021b) that invites metric developers to evaluate their methods on outputs of several MT systems. A common protocol in metric evaluation is to compare the output scores with human judgements collected for the output translations. Designing protocols for human evaluation of machine translated outputs and meta evaluation is challenging (Mathur et al., 2020a), leading to the development of several different methodologies and analyses over the years. Human evaluation of MT systems has been carried out based on guidelines for fluency, adequacy and/or comprehensibility (White et al., 1994) evaluating every generated translation often on a fixed scale of 1 to 5 (Koehn and Monz, 2006) or 1 to 100 (Graham et al., 2013) (direct assessments). For some years, the ranking of MT systems was based on a binary comparison of outputs from two different MT systems (Vilar et al., 2007). More recently, expert-based evaluation is carried out based on Multidimensional Quality Metrics (Lommel et al., 2014, MQM) where translation outputs are scored on the severity of errors using a fine-grained error ontology (Freitag et al., 2021a,b). Over the years, different methods to compute the correlation between the scores produced by the metrics and this human evaluation have been suggested based on the drawbacks of the previous ones (Callison-Burch et al., 2006;Bojar et al., 2014Bojar et al., , 2017. Most metrics claim their effectiveness by comparing their performance with competitive metrics on the recent method for computing correlation with human judgements on the system-level. The meta evaluation progress is generally documented in a metrics shared task overview (Callison-Burch et al., 2007). For example, Stanojević et al. (2015) highlighted the effectiveness of neural embedding based metrics; Ma et al. (2019) show that metrics struggle on segment level performance despite achieving impressive system level correlation; Mathur et al. (2020b) investigate how different metrics behave under different domains. In addition to the overview papers, Mathur et al. (2020a) show that meta evaluation regimes were sensitive to out-liers and small changes in evaluation metrics were not sufficient to claim the effectiveness of any metric. Kocmi et al. (2021) conduct a comprehensive evaluation of metrics to identify which metric is best suited for pairwise ranking of MT systems. Guillou and Hardmeier (2018) look at a specific phenomenon of whether metrics are capable of evaluating translations involving pronominal anaphora.
All the above works draw their conclusions based on some comparison with human judgement. Our work focuses on the usability of the metrics which is solely judged on their ability to work with downstream tasks where MT is used as an intermediate step. The emphasis of the meta evaluation is also on respective segment level performance. Task based MT evaluation has been well studied in the literature (Jones and Galliers (1996);Laoudi et al. (2006); , inter alia) However, prior works focus on evaluating individual MT systems rather than investigating MT metrics. Closer to our work is Scarton et al. (2019); Zouhar et al. (2021) which proposes MT evaluation as ranking translations based on the time to post-edit them. Our work is the first to address the evaluation of MT metrics through extrinsic tasks using MT as an intermediate step to facilitate cross-lingual language understanding.

Methodology
Our aim is to determine how reliable MT metrics are for predicting success on downstream tasks. In essence, we first translate a test sentence and then run a model such as dialogue state tracking on the translation. If the dialogue state is wrong (but was right on the reference) we know that the translation contained a material error. We then see if the metric could predict if the translation was good or bad. We now describe our evaluation setup and the metrics that we evaluate. We refer to Fig 1 for an illustration.

Setup
For all the tasks described below, we first train a model for the respective tasks on the monolingual setup. We evaluate the source language on each task and capture the monolingual predictions of the model. We consider only the Translate-Test paradigm -we translate the test data from each target language into the source language on which the monolingual model has been trained on. The target translations are given as input to the task-specific model. We use either (i) OPUS translation models (Tiedemann and Thottingal, 2020), (ii) M2M100 translation (Fan et al., 2021) or (iii) translations provided by the authors of respective datasets. Note that the data across the target languages are parallel. We obtain the predictions for the translated data to construct a binary classification benchmark for the metrics.
We only consider all the examples in the target language that were predicted as correct in the monolingual source language to avoid errors that arise from task complexity. Therefore, all the incorrect predictions for the target language in the end task arise from erroneous translations. This isolates the extrinsic task failure as the fault of only the MT system. We use these predictions to build a binary classification benchmark -all the examples from the target language that are correctly predicted in the extrinsic task receive a positive label while the incorrect predictions receive a negative label.
We consider the input from the target language as "source", the corresponding machine translation as "hypothesis" and the input from the source language as "reference". These triples are then scored by the respective metrics. After obtaining the segment-level scores for these triples, we define a threshold for the scores, thus turning the metrics into classifiers. The metrics are then evaluated on how well their predictions for a good/bad translation correlate with the success/failure of the end task for the target language.

Tasks
We evaluate the metrics on the following tasks.

Dialogue State Tracking
In the dialogue state tracking task, a model needs to map the user's goals and intents in a given conversation to a set of slots and values -known as a "dialogue state" based on a pre-defined ontology.  Figure 1: Meta-evaluation pipeline. The predictions for the extrinsic task in the target language are obtained using the Translate-Test setup -the target language is translated into the language of the extrinsic task before passing to the task-specific model. Then, the input sentence and the corresponding translations are evaluated with a metric of interest. The metric is evaluated based on the correlation of its scores with the predictions of the end task.
is exactly equal to the ground truth to provide labels for the binary classification task. The metric scores are produced for the current utterance spoken by the user.

Semantic Parsing
Semantic parsing transforms natural language utterances into logical forms to express utterance semantics in some machine-readable language. The original ATIS study (Hemphill et al., 1990) collected questions about flights in the USA with the corresponding SQL to answer respective questions from a relational database. We use the MultiATIS++SQL dataset from Sherborne and Lapata (2022) comprising gold parallel utterances in English, French, Portuguese, Spanish, German and Chinese (from Xu et al. (2020)) paired to executable SQL output logical forms (from Iyer et al. (2017)). The model similarly follows Sherborne and Lapata (2022), as an encoder-decoder Transformer model based on MBART50 (Tang et al., 2021). The parser generates valid SQL queries and performance is measured as exact-match "denotation accuracy" -the proportion of output queries returning identical database results relative to gold SQL queries.

Extractive Question Answering
The task of extractive question answering is predicting a span of words from a paragraph corresponding to the question. We use the XQuAD dataset (Artetxe et al., 2020) for evaluating extractive question answering. The XQuAD dataset was obtained by professionally translating examples from the development set of English SQuAD dataset (Rajpurkar et al., 2016) into ten languages -Spanish, German, Greek, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, and Hindi. We use the publicly available question answering model that finetunes RoBERTa (Liu et al., 2019) on the SQuAD training set 2 . We use the "Exact-Match" metric, i,e: the model's predicted answer span exactly matches the gold standard answer span; for the binary classification task. The metrics scores are produced for the question and the context. A translation is considered to be faulty if either of the scores falls below the threshold.

Metrics
We describe the metrics based on their design principles-based on the surface level overlap, embedding similarity, and neural metrics trained using WMT data. We selected the following metrics as they are the most studied, frequently used, and display a good mix of design principles.

Surface level overlap
BLEU (Papineni et al., 2002) is a string-matching metric that compares the token-level n-grams of the hypothesis with the reference translation. BLEU is computed as a precision score weighted by a brevity penalty. chrF (Popović, 2017) operates on a character level by computing a character n-gram F-score based on the overlap between the hypothesis and the reference.

Embedding based
BERTScore (Zhang et al., 2020) uses contextual embeddings from pre-trained language models to compute the similarity between the tokens in the reference and the generated translation using cosine similarity. The similarity matrix is used to compute precision, recall, and F1 scores.

Trained on WMT Data
WMT organizes an annual shared task on developing MT models for several categories in machine translation (Akhbardeh et al., 2021). Human evaluation of the translated outputs from the participating machine translation models is often used to determine the best-performing MT system. In recent years, this human evaluation has followed two protocols -(i) Direct Assessment (DA) (Graham et al., 2013): where the given translation is rated between 0 to 100 based according to the perceived translation quality and (ii) Expert based evaluation where the translations are evaluated by professional translators with explicit error listing based on the Multidimensional Quality Metrics (MQM) ontology. MQM ontology consists of a hierarchy of errors and translations are penalised based on the severity of errors in this hierarchy. These human evaluations are then used as training data for building new MT metrics. There is another subcategory in these metrics -reference-based metrics and reference-free metrics. The difference is whether the metrics use ground truth during the evaluation of the given input and its translation. COMET metrics: Cross-lingual Optimized Metric for Evaluation of Translation (COMET) (Rei et al., 2020) uses a cross-lingual encoder (XLM-R (Conneau et al., 2020)) and pooling operations to predict score of the given translation.
Representations for the source, hypothesis, and reference are obtained using the encoder which are combined and passed through a feedforward layer to predict a score. These metrics use a combination of WMT evaluation data across the years to produce different metrics. In all the variants, the MQM scores and DA scores are converted to z-scores to reduce the effect of outlier annotations. COMET-DA uses direct assessments from 2017 to 2019 as its training data. COMET-MQM uses direct assessments from 2017 to 2021 as its training data. This metric is then fine-tuned with MQM data from Freitag et al. .
UniTE metrics (Wan et al., 2022), Unified Translation Evaluation, is another neural translation metric that proposes a multi-task setup for the three strategies of evaluation -source-hypothesis, sourcehypothesis-reference, and reference-hypothesis in a single model. The pre-training stage involves training the model with synthetic data constructed using a subset of WMT evaluation data. Fine-tuning uses novel attention mechanisms and aggregate loss functions to facilitate the multi-task setup.
All the above reference-based metrics have their corresponding reference-free versions which use the same training regimes but exclude encoding the reference. We refer to them as COMET-QE-DA, COMET-QE-MQM, and UniTE-QE respectively. COMET-QE-DA in this work uses DA scores from 2017 to 2020.
We list the github repositories of these metrics in Appendix A

Metric Evaluation
The meta-evaluation for the above metrics is carried out using the binary classification task. As the class distribution will change depending on the task and the language pair, we require an evaluation that is robust to class imbalance. We consider using macro-F1 and Matthew's Correlation Coefficient (MCC) (Matthews, 1975) on the classification labels. The range of macro-F1 is between 0 to 1 and gives equal weightage to the positive and negative class.
We included MCC to interpret the MT metric's standalone performance for the given extrinsic task. The range of MCC is between -1 to 1. An MCC value around 0 indicates no correlation with the class distribution. Any MCC value between 0 and 0.3 indicates negligible correlation, 0.3 to 0.5 indicates low correlation.

Results
We report the aggregated results for dialogue state tracking semantic parsing, and question answering in Table 1. We report fine-grained results in Appendix B. We use a random baseline for comparison which assigns the positive and negative labels with equal probability.

Performance on Extrinsic Tasks
We find that almost all metrics perform better than the random baseline on the macro-F1 metric. We use MCC to identify if this increase in macro-F1 makes the metric usable in the end task. Evaluating on MCC, we find that all the metrics show negligible correlation under almost all settings, across all three tasks. Contrary to the trend where neural metrics are better than metrics based on surface overlap (Freitag et al., 2021b), we find this binary classification to be difficult across all the metrics. The decreasing order of performance across tasks is semantic parsing, question answering and dialogue state tracking.
While comparing the reference-based versions of trained metrics (COMET-DA, COMET-MQM, UniTE) with their reference-free QE equivalents, we observe that reference-based versions perform better or are competitive to their reference-free versions for semantic parsing and question answering. However, this trend flips for the dialogue state tracking where reference-free performs the same or better than reference-based metrics. We also note that references are unavailable when the systems are in production, hence reference-based metrics are not suitable for realistic settings. We discuss alternative ways of obtaining references in Section 4.4.
Between the use of MQM-scores and DAscores during fine-tuning the different COMET variants, we find that both COMET-QE-DA and COMET-DA are better than COMET-QE-MQM and COMET-MQM respectively for question answering. There is no clear winner for both dialogue state tracking and semantic parsing as seen in Appendix B.
Looking at the per-language scores in Appendix B, we find that no specific language pairs stand out as easier or harder across the task. We cannot verify if the performance of neural metrics can be generalised to languages not seen during the training as the performance is poor across all the languages.
We now conduct some further analyses on these results:

Case Study
We look at the semantic parsing task where the parser was trained in English and the input language is Chinese for our case study with COMET-DA, as this metric is well studied in the literature. The macro-F1 is 0.59 and the MCC value is 0.187. We report the number of correct and incorrect predictions made by COMET-DA across ten equally divided ranges of scores in Fig. 2. The ticks on x-axis indicate the upper end-point of the interval i,e: the second bars are all the examples that were given scores between -1.00 and -0.74.
First, we highlight that the threshold is at -0.028, suggesting that even good translations received a negative score which is counterintuitive. We expected the metric to fail in the bars around the threshold as those represent the confusing cases. However, the metric makes mistakes throughout the bins. For example "周日下午 阿密 往克利 夫" , is correctly translated as "Sunday afternoon from Miami to Cleveland" yet the metric assigns it a score of -0.1. Similarly, "我需要一趟合航空下 周六的辛辛那提往市的航班" is translated as "I need to book a flight from Cincinnati to New York City next Saturday." and loses the crucial information of "United Airlines" is assigned a high score 0f 0.51. This demonstrates that the metric possess a limited understanding of which is a good/bad translation for the end task.
We suspect this behaviour is due to the current setup of MT evaluation. The development of machine translation metrics largely caters towards only the intrinsic task of evaluating the quality of a translated text in the target language. The severity to penalize a translation error is dependent on the guidelines released by the organisers of the WMT  .
Qualitative evaluation: To quantify detecting which translation errors are most crucial to the respective extrinsic tasks, we conduct a qualitative evaluation of the MT outputs and task predictions. We annotate 100 examples for which the metric produced a wrong prediction of the end task. We annotate the MT errors (if present) in these examples based on the MQM ontology.
We observe that 54% of the classification prediction errors are due to the metric's inability to detect mistranslation, specifically, incorrect translation of named entities. The other MT errors that are undetected belong to addition(2%), omission (6%), and fluency (8%). The rest 30% of the examples did not have any MT errors (none). These translations were paraphrases of the references which were undetected as correct translations by the metric. See the red bars in Fig 2 after the threshold. We also observe that, even though the metric correctly detected the fluency errors in the translation, fluency errors have no impact on the parsing task.
We also find that approx 20% of the errors made by the metric for the classification arise from the task-specific model. For example, the MT model uses an alternative term of shuttle instead of roundtrip while generating the translation for the reference "show me round trip flights from montreal to orlando". The semantic parser fails to generalise despite being trained with mBART.

Finding the Threshold
Interpreting system-level scores provided by automatic metrics requires additional context such as the language pair of the machine translation model or another MT system for comparison 3 . In this classification setup, we rely on interpreting the segment-level score to determine whether the translation is suitable for the downstream task. We observe that finding the right threshold to identify if a translation needs correction is not straightforward. Our current method to obtain a threshold relies on validating different thresholds on the development set and then select the one that has the best F1 score. These different thresholds are obtained by plotting a histogram over the scores with ten bins.
We report the mean and standard deviation of best thresholds for every language pair for every metric in Table 2. Surprisingly, the thresholds are not consistent and off from the midpoint in the case of bounded metrics -BLEU (0-100), chrF (0-100), and BERTScore (0 to 1). The standard deviations across the table indicate that the threshold varies greatly across language pairs. We find that thresholds of these metrics are also not transferable across tasks. The COMET metrics except COMET-DA have smaller standard deviations. By design, the range of COMET metrics in this work is unbounded. However, as discussed in the theoretical range of COMET metrics 4 , empirically, the range for COMET-MQM is found to lie between -0.2 to 0.2 which questions whether this small stan-   (2022), we suggest that notion of negative scores for good translation only for certain language pairs is counter-intuitive as most NLP metrics tend to produce positive scores. Thus, we find that both bounded and unbounded metrics discussed in this work do not provide segment level scores whose range can be interpreted meaningfully across both tasks and different language pairs.

Reference-based Metrics in an Online Setting
In an online setting, we do not have access to references at test time. To test the effectiveness of reference-based methods in this setting, we consider translating the translation back into the source language. While evaluating the reference-based methods on the new setup, the language pair flips the direction; the src new is mt old , mt new is the translation of mt old , and ref new is src old . We generate these new translations using the mBART translation model (Tang et al., 2020). We report these results for the dialogue state tracking task in  Table 3: MCC scores of reference based metrics for the extrinsic Dialogue State Tracking. The setup simulates an online setting without gold standard references. Instead, the translation is translated back into the target language. MCC scores improve in this setup over Table 5 assuming high quality machine translation (zh, de, ar). the target language is ru. This is reassuring that as reference-based metrics improve, their deployment in a reference-less setting can still be useful when the quality of the translation outputs is high. However, their correlation coefficients are within the range of negligible correlation. Using reference-based metrics in an online setting comes with the overhead of producing another translation. Using ru as input language has the lowest performance on the downstream task indicating the machine translation quality of ru-en translation is inferior. The back translation from en to ru is likely to add additional errors to the existing erroneous translation. This cascading of errors confuses the metric and it can mark a perfectly useful translation from ru-en as "bad" due to the error present in the en-ru translation. For example, in the ru-en case, COMET-MQM has a false negative rate of 0.796 in the round-trip translation setup compared to 0.097 when the human reference is used instead. Thus, this setting is likely to fail when the machine translation models generate poor-quality outputs.

Recommendations
Our experiments suggest that evaluating MT metrics on the segment level for extrinsic tasks has considerable room for improvement. We propose recommendations based on our observations -Prefer MQM for Human Evaluation of MT outputs: We first reinforce the proposal of using the MQM scoring scheme with expert annotators for evaluating MT outputs in line with Freitag et al. (2021a). As seen in Section 4.2, different tasks have varying tolerance to different MT errors. With explicit errors marked per MT output, future classifiers can only be trained on a subset of human evaluation data containing errors most relevant to the downstream application.
MT Metrics could produce Labels over Scores: The observations from Sections 4.2 and 4.3 suggest that interpreting the quality of the produced MT translation based on a number is unreliable and difficult. We recommend exploring whether segment-level MT evaluation can be approached as an error classification task instead of regression. Specifically, whether the words in the source/hypothesis can be tagged with explicit error labels. Resorting to MQM-like human evaluation will result in a rich repository of human evaluation based on an ontology of errors and erroneous spans marked across the source and hypothesis (Freitag et al., 2021a). Similarly, the post-editing datasets (Scarton et al. (2019);Fomicheva et al. (2022) , inter alia) also provide a starting point. An interesting exploration in this direction is the work by Perrella et al. (2022) that treats MT evaluation as sequence-tagging problem i,e: words in the candidate translation are tagged with 'minor" or "major" error. Such metrics can also be used for intrinsic evaluation by assigning weights to the labels and producing a weighted score.
Use a Combination of MT metrics for the End Tasks: We do not find a "winning metric" across our tasks (See Section 4). We also find that neural metrics have performance similar to surface overlap metrics. Similar to Amrhein et al. (2022), we recommend using a combination of different families of metrics to judge the usability of the MT output for the downstream task.
Add Diverse References during Training: From Section 4.2, we find that both the neural metric and the task-specific model are not robust to paraphrases. Bawden et al. (2020) proposed automatic paraphrasing of references to improve the coverage for BLEU. We also recommend the inclusion of diverse references through automatic paraphrasing or data augmentation during the training of neural metrics.
We hope to incorporate these techniques in developing future evaluation regimes when using MT as an intermediate step in extrinsic tasks.

Conclusion
We propose a method for evaluating MT metrics which is reliable at the segment-level and does not depend on human judgements with low interannotator agreement, by using correlation MT met-rics with the success of extrinsic downstream tasks.
We evaluated nine different metrics on the ability to detect errors in generated translations when machine translation is used as an intermediate step for three extrinsic tasks -dialogue state tracking, question answering, and semantic parsing. We found that segment-level scores provided by all the metrics show negligible correlation with the success/failure outcomes of the end task across different language pairs. We attributed this result to segment scores produced by these metrics being uninformative and that different extrinsic tasks demonstrate different levels of sensitivity for different MT errors. We made recommendations to predict error types instead of error scores to facilitate the use of MT metrics in downstream tasks.

Limitations
As seen in Section 4.2, sometimes the metrics are unnecessarily penalised due to errors made by the end task models. Filtering these cases would require checking every example in every task manually. We hope our results can provide conclusive trends to the metric developers focusing on segment-level MT evaluation rather than fixating on absolute numbers in the table. We included three tasks to cover different types of errors in machine translations and different types of contexts in which an online MT metric is required. Naturally, this regime can be extended to other datasets, other tasks, and other languages (Ruder et al., 2021;Doddapaneni et al., 2022). Further, our tasks used stricter evaluation metrics such as exact match. Incorporating information from partially correct outputs is not trivial and will be hopefully addressed in the future. We have covered 37 language pairs across the tasks which majorly use English as one of the languages. Most of the language pairs in this study are high-resource languages. The choice of metrics in this work is not exhaustive and is dependent on the availability and ease of use of the metric provided by the authors.

Ethics Statement
This work uses datasets, models, and metrics that are publicly available. Although the scope of this work does not allow us to have an in-depth discussion of biases associated with metrics (Amrhein et al., 2022), we caution the readers of drawbacks of metrics that cause unfair evaluation to margnilised subpopoulations which are discovered or yet to be discovered.

Acknowledgements
We thank Barry Haddow for providing us with valuable feedback on setting up this work. We thank Arushi Goel and the attendees at the MT Marathon 2022 for discussions about this work. This work was supported in part by the UKRI Centre for Doctoral Training in Natural Language Processing, funded by the UKRI (grant EP/S022481/1) and the University of Edinburgh (Moghe) and UK Engineering and Physical Sciences Research Council grant EP/L016427/1 (Sherborne). We also thank Huawei for their support (Moghe).        Table 9: MT Metric performance on MCC for the classification task with extrinsic semantic parsing (Multi-ATIS++SQL) with the parser trained in src language.