Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

As research on machine translation moves to translating text beyond the sentence level, it remains unclear how effective automatic evaluation metrics are at scoring longer translations. In this work, we first propose a method for creating paragraph-level data for training and meta-evaluating metrics from existing sentence-level data. Then, we use these new datasets to benchmark existing sentence-level metrics as well as train learned metrics at the paragraph level. Interestingly, our experimental results demonstrate that using sentence-level metrics to score entire paragraphs is equally as effective as using a metric designed to work at the paragraph level. We speculate this result can be attributed to properties of the task of reference-based evaluation as well as limitations of our datasets with respect to capturing all types of phenomena that occur in paragraph-level translations.


Introduction
Automatic evaluation metrics have always been a critical component to the progress of research on machine translation (MT).As the field of MT moves beyond translating individual sentences to translating full paragraphs, book chapters, or documents (Tu et al., 2018;Sun et al., 2022;Thai et al., 2022;Jiang et al., 2023;Post and Junczys-Dowmunt, 2023), automatic metrics need to be designed to work on these longer texts.
Currently, how well automatic metrics agree with human judgments of paragraph translation quality is an open question. 1 Few studies have meta-evaluated metrics on longer texts, and those that have are focused on the literary domain and are limited in the size of the evaluation dataset (Jiang et al., 2022;Thai et al., 2022;Karpinska and Iyyer, 2023).In this work, we investigate training and meta-evaluating metrics for scoring paragraph translations using the benchmark Workshop on Machine Translation (WMT) datasets that are widely used for metric development (Freitag et al., 2022).
Due to the scarcity of human ratings of paragraph translations, we propose a method to create paragraph-level training and meta-evaluation datasets from the existing WMT sentence-level datasets ( §3).Although these ratings are typically only used at the sentence level, they were collected on contiguous paragraphs and performed with document context, so they can be used as paragraphlevel datasets.We repurpose these datasets to benchmark existing sentence-level metrics as well as train new paragraph-level metrics for scoring paragraph translations ( §4).
Our experimental results are somewhat surprising.We find that there appears to be little evidence that training on paragraph-level data is beneficialat least given the limitations of our experimental setup.Using metrics trained on sentence-level data only to directly score full paragraphs achieves comparable agreement to human ratings as metrics trained on paragraph-level data ( §6.1).Sentencelevel metrics appear to generalize well to inputs much longer than they were trained on ( §6.2).
We hypothesize these observations can be explained by the nature of evaluating translations and characteristics of our paragraph-level dataset ( §7).We speculate that long range dependencies-which paragraph-level metrics can model but sentencelevel likely do not-may not be too important for achieving high agreement with human ratings.Further, due to the fact that our training and evaluation datasets assume a sentence alignment between the reference and hypothesis paragraphs, certain translation phenomena that sentence-level metrics may struggle to handle, like sentence or information reordering, are not well represented in the dataset, limiting our ability to show the benefits of training on paragraph-level ratings.
The contributions of our work include (1) a method for constructing paragraph-level training and meta-evaluation datasets from sentence-level ratings, (2) an experimental study that demonstrates the comparable performance of sentenceand paragraph-level metrics, and (3) an analysis that aims to provide an explanation for our experimental observations.

Terminology
Throughout this paper, we use terms like segment, sentence, paragraph, and document to refer to different lengths of text.To the best of our knowledge, there are no agreed upon definitions for these terms in the MT literature, so here we define how they are used for the rest of the paper.
We refer to the input text to an MT system or evaluation metric as a segment, irrespective of its length.Traditionally, segments in MT have been roughly equivalent to one sentence, although sometimes they can be short phrases or even longer than a single sentence.Regardless, we use sentence to refer to this unit of text since it accurately describes the most common text length that is widely used in MT.
Our work investigates evaluating paragraphs of text, which we define to be multi-sentence segments.We do not require that the paragraphs used in this work obey the traditional definition of a paragraph (i.e., a unit of text separated by a newline character).We refrain from calling this unit of text a document-which we consider to be all of the possible input text-since each document can be broken down into multiple paragraphs and the term paragraph more accurately describes the length of text we use.

Paragraph-Level Datasets
The two main sources for training and metaevaluating MT metrics are the direct assessment (DA) and Multi-dimensional Quality Metrics (MQM; Lommel et al., 2014;Freitag et al., 2021a) datasets that the Workshop on Machine Translation (WMT) has collected as part of the yearly metrics shared task (Freitag et al., 2022).The DA ratings were done by non-expert raters who assigned a quality score in the range 0-100 to translated sentences.Because of differences in rater behavior, the DA scores are z-normalized per rater.In MQM, expert raters identify error spans in translated sentences and assign each error a category and severity level, which are used to calculate a score for that error.A sentence's MQM score is defined as the sum of the errors' scores.
Training and meta-evaluating metrics at the paragraph level requires a collection of translated paragraphs and paragraph-level quality scores.Luckily, the DA data since 2019 and the MQM data can be considered to be paragraph-level ratings.The ratings were performed on contiguous blocks of sentences that were translated by the same system (e.g., the first k sentences per document are rated for a system).Although the scores were collected at the sentence level, the ratings were done in context, meaning the raters had access to the document context for a sentence, so the scores should reflect paragraph-or document-level phenomena like discourse errors.Therefore, we use the sentence-level DA and MQM data to construct paragraph-level datasets as follows.
For each document translated by a system, we run sliding window of size k sentences from the start to the end.If all k sentences in the window have been rated, those k sentences are concatenated together to become a paragraph instance and the window shifts by k.Otherwise, the sliding window shifts by 1 and the process repeats.To maintain consistency between the sentence scores within a paragraph, we additionally require that every sentence is scored by the same rater.Then, we define the paragraph-level scores to be the average DA z-score or sum of MQM scores for each sentence in the paragraph. 2The result is a dataset of rated paragraph translations of k sentences each.
We apply this dataset construction approach to the DA and MQM data for k = 1, 2, . . ., 10 sentences per paragraph.The number of paragraphs is shown in Figure 1 and the distribution of the lengths of the new translated paragraphs is shown in Figure 2. As k increases, the number of paragraphs decreases because there are fewer candidate paragraphs, while the length of the paragraphs increases, roughly by an expected factor of k.
These paragraph-level DA and MQM datasets are used to train and meta-evaluate paragraph-level metrics for the rest of this paper.

Paragraph-Level Metrics
We explore two different methods for creating paragraph metrics: directly applying sentence-level metrics to paragraphs ( §4.1) and training metrics on paragraph-level data ( §4.2).

Applying Sentence-Level Metrics on Paragraphs
Although automatic metrics that have been used to evaluate sentence-level MT were not explicitly designed to evaluate paragraphs, they can be repurposed to score paragraphs in different ways.
First, the input paragraph can be treated as if it were one long segment and passed to the metric to calculate a score.For metrics that use bag-of-ngrams representations, like BLEU (Papineni et al., 2002), there is no input length limitation.However, some learned metrics, like BLEURT (Sellam et al., 2020), have a maximum possible sequence length due to restrictions related to neural network 2 Summing MQM scores was done to generalize an MQM rating for paragraphs since a sentence's MQM score is the total error weight for that sentence.The choice of summing or averaging does not matter for metric meta-evaluation because the correlations are scale invariant.
architectures.Therefore, the length of the input paragraph is restricted in some cases.
Then, if there is assumed to be an alignment between the source, reference, and hypothesis sentences within a paragraph (as is in the case with our datasets), a paragraph score can be calculated by averaging the sentence-level metric's score for each of the k individual sentences.While this sliding window approach more closely aligns how the metrics are being used to how they were designed, we argue this approach is less than ideal because the 1:1 sentence alignment between the source and hypothesis translations will not always exist.However, this approach is useful for understanding and analyzing the behavior of metrics when they are used to score full paragraphs directly.

Learning Paragraph-Level Metrics
While sentence-level metrics can be repurposed to score paragraphs, the lengths of the input paragraphs are significantly longer than the lengths of individual sentences (compare k = 1 to k > 1 in Figure 2) and there may be cross-sentence dependencies that are not learned by sentence-level metrics.Therefore, we explore creating a metric specifically for paragraph-level data.
To do so, we train a BLEURT-style regression model on the paragraph-level datasets: The reference and hypothesis paragraphs are tokenized and concatenated together (separated by a special token), then passed as input to a neural network.The network is then trained to predict the hypothesis paragraph's ground-truth quality score.Sections 5.2 and 5.4 contain more information about the model's architecture and implementation details.
It is desirable for the paragraph-level metric to be able to score paragraphs of any length, so we train the metric on paragraphs composed of k = 1, 2, . . ., 10 sentences.Because the number of paragraph instances decreases significantly as k increases (see Figure 1), longer paragraphs will rarely be seen during training.Therefore, we explore two different techniques for weighting training data: one that selects paragraphs uniformly at random and one that performs a stratified sample so the training data is composed of an equal number of paragraphs for each value of k.
Next, we describe the experimental setup to evaluate the paragraph-level metrics.

Datasets
The paragraph-level datasets used in our experiments are described in Section 3. The WMT'19 (Ma et al., 2019) and '20 (Mathur et al., 2020) paragraph-level DA data is used for training the metrics described in this work, and all metrics are evaluated on the WMT'21 (Freitag et al., 2021b) and WMT'22 (Freitag et al., 2022) paragraph-level MQM data.For both DA and MQM, we use k = 1, 2, . . ., 10 sentences per paragraph.The different paragraph lengths are combined during training but separated for evaluation.

Metrics
Paragraph-Level Metrics.We train two different paragraph-level metrics, one for each of the different weighting techniques, uniform and stratified sampling (see §4.2).We refer to these metrics as PARA-UNIF and PARA-STRAT.
Our metric uses the same architecture as the Metric-X WMT'22 metrics shared task submission (Freitag et al., 2022).The metric builds on the mT5 encoder-decoder language model (Xue et al., 2021), which was originally designed to be a sequence-to-sequence language model.We repurpose the model for our regression task as follows.The inputs to the encoder are the hypothesis and reference translations separated by a special token, and a single dummy token is passed as the first input to the decoder.We arbitrarily selected a reserved vocabulary token, then trained the model so that token's output logit in the first decoding step becomes the score for the input hypothesis translation.This modification of the sequence-tosequence architecture for regression allows us to utilize all of the pre-trained weights from mT5.
The maximum input sequence length to our metric is 1024 SPM tokens (Kudo and Richardson, 2018).The inputs are truncated during training or inference if the input is larger than 1024. 3In the worst case, this happens up to 27% of the time on the MQM data for 10 sentences per paragraph (see Appendix A for specific statistics.) Sentence-Level Baseline.In addition to the paragraph-level metrics, we train a sentence-level version that is trained on the same DA data but only k = 1 sentences per paragraph.This baseline metric can be used to directly compare to the paragraph-level metrics that we train because the model architecture, training procedure, etc., are identical.The only difference is the training data.This metric is referred to as SENT-BASE.
Other Metrics.In addition to the metrics described in this paper, we evaluate BLEU (Papineni et al., 2002), COMET-22 (Rei et al., 2020, 2022), and PaLM-2 from Fernandes et al. (2023) as sentence-level metrics applied to paragraphs (i.e., §4.1) and document-level metric BlonDE (Jiang et al., 2022).BLEU scores translations using lexical n-gram overlap, and COMET-22 is a learned regression metric that first embeds the input hypothesis, reference, and source, combines them to a joint representation, then finally predicts a score.The metric from Fernandes et al. ( 2023) is based on the PaLM-2 large language model (Anil et al., 2023).We evaluate both the zero shot version, in which PaLM-2 is prompted to score a translation on a scale from 0 to 100, and the regression version that finetunes PaLM-2 on MQM ratings to predict a floating point quality score, similar to COMET.Our analysis includes the Bison variant of PaLM-2.
BlonDE evaluates discourse phenomena in document translations via a set of automatically extracted features.It was designed to evaluate texts longer than paragraphs, like book chapters, but we compare against it in this work.BlonDE is available in English only.
We use the SacreBLEU (Post, 2018) implementation of BLEU and the Unbabel/wmt22-comet-da COMET-22 model that was trained on sentencelevel WMT DA data from 2017-2020.4

Meta-Evaluation Metrics
The metrics are meta-evaluated using pairwise accuracy at both the system and segment levels. 5ystem-level scores are calculated by averaging a metric's scores over paragraphs.The system-level pairwise accuracy then computes the proportion of all pairwise system comparison for which the automatic metric and human ground-truth ratings Figure 3: As the number of sentences per paragraph increases, the accuracy scores of the metrics appears to either not decrease (system-level, left) or increase (segment-level, right).This suggests that accurately scoring a paragraph is an easier task than an individual sentence, even for metrics that are not trained on paragraph-level examples.The results of metrics trained in this work presented here are an average of 5 different runs.Results for other language pairs follow the same trend and are included in Appendix B. agree on (Kocmi et al., 2021).
We follow Deutsch et al. (2023) and report segment-level pairwise accuracy using the groupby-item variant of the segment-level correlation in combination with the τ -optimization procedure.The group-by-item segment-level correlation calculates a pairwise accuracy score between all of the systems' translations for the same input source segment, then averages the accuracy across source segments.This version of pairwise accuracy gives credit for metrics that correctly predict when two translations have tied ground-truth scores.Due to the fact that learned regression metrics almost never predict the same score for two non-identical translations, the τ -optimization procedure calibrates the metrics by automatically introducing ties into the metrics' scores.The segments used in this evaluation are paragraphs.
Both system-and segment-level accuracy (with τ -optimization) are official meta-evaluation metrics for the WMT'23 metrics shared task. 6Results using Pearson's correlation follow similar trends to the accuracy results and are available in Appendix B.

Implementation Details
Our learned metrics are implemented with Ten-sorFlow (Abadi et al., 2015) in the T5X library (Roberts et al., 2022).They are initialized with the XXL version of mT5, which contains 13B parameters.It is trained for a maximum of 20k steps and a batch size of 128 using Adafactor (Shazeer and Stern, 2018) on 64 v3 TPUs.Checkpoint selection was done by selecting the step that has the highest average segment-level pairwise accuracy across language pairs and all values of k sentences per paragraph after applying calibrating via τ -optimization.In general, we observed the specific checkpoint selection strategy was not too important.

Results
First, we directly evaluate how well metrics perform when used to directly score paragraphs ( §6.1), then we further examine the behavior of different paragraph-level metrics by analyzing their performances with the context of their sentence-level counterparts ( §6.2).

Paragraph-Level Evaluation
Figure 3 plots the system-and segment-level correlation results for different numbers of k sentences per paragraph.Each metric is used to directly score a full paragraph even if the metric was not designed to do so (e.g., SENT-BASE or COMET-22).There are several interesting observations.Paragraph-Level Performance.First, as the length of the paragraphs increases, the system-level correlations remain relatively steady or increase and the segment-level correlations clearly improve for all metrics, except for PaLM-2 zero-shot.This is evidence that scoring paragraphs is an easier task than scoring individual sentences, a result that is counterintuitive; scoring more text should seemingly be a harder task.We hypothesize this result is explained by the fact that some noise in the human and metric scores is averaged away, leaving more reliable signals as the paragraphs get longer.If the metric scores are unbiased estimators, their agreement with human rating should then increase.PaLM-2 zero-shot is an outlier in this case because it predicts a large number of ties between translations.Prompting large language models for MT evaluation is known to result in the model predicting a small number of unique scores, resulting in many ties (Kocmi and Federmann, 2023;Fernandes et al., 2023).As the length of the paragraph increases, the number of MQM ties decreases.Since pairwise accuracy penalizes incorrect tie predictions, the zero shot model has worse performance on longer texts.See Figure 4 for a visualization of the number of ties in the PaLM-2 output and MQM scores.
Sentence vs. Paragraph Level.Then, there appears to be little evidence that training on paragraph-level examples results in better correlations to human ratings on paragraph-level test data.For instance, increasing the weight of the paragraph-level data during training does not help compared to uniformly sampling data (compare PARA-STRAT to PARA-UNIF).Further, the baseline metric SENT-BASE that shares the same architecture as our paragraph-level metrics but is only trained on sentence-level data (k = 1) performs just as well as the paragraph-level metrics.This observation is additionally supported by COMET-22's results.The difference between the metrics we train versus COMET is relatively constant for all values of k, demonstrating that COMET is not systematically worse on longer inputs.
The generalization of sentence-level metrics on paragraph-level data is rather surprising.The length of the inputs for scoring paragraphs is up to 10x longer than those for scoring sentences (see Table 1).Even though the length of the test data is out-of-distribution with respect to the training data, the sentence-level metrics predict reliable scores on the paragraph-level data.Next, we further analyze the sentence-level metrics to better understand their scores.

Understanding Sentence-Level Metrics
To further analyze the performance of the sentencelevel metrics on paragraph-level data, we compare the two versions of applying a sentence-level metric to paragraphs discussed in §4.1.One version directly scores a full paragraph (thus, making no assumption about an alignment between the hypothesis and reference), whereas the other averages the scores of evaluating the individual k hypothesis sentences against the corresponding reference sentence (thus, assuming a sentence-level alignment exists).
Figure 5 shows that for two sentence-level metrics, the baseline trained in this work and BLEU, the performance of the two paragraph scoring variants is very similar.Then, Figure 6 shows that the Pearson correlation between the scores for those two variants is very high (≥ 0.85).
Together, these results point to the fact that there is little difference between these two methods.Directly scoring a paragraph or scoring individual sentences yield both similar scores and similar agreement to human ratings.The sentence-level metrics appear to be scoring full paragraphs in a desirable way-by calculating some average score across sentences.
This result is not obvious.As the length of the input increases, the bag-of-n-grams representation used by lexical matching metrics like BLEU have an increased potential for erroneous matches between the hypothesis and reference sentences, which could result in misleading scores.Learned metrics, like the ones trained in this work, have not been trained on a significant amount of very long data, so it is not clear that the scoring functions they learn would generalize well to longer inputs.Despite this, the sentence-level metrics appear to predict high-quality scores for paragraphs.
Next, we propose a hypothesis for why this is the case and why training on paragraph-level data does not appear to result in a better metric.

Discussion
In theory, training on paragraph-level data should have advantages compared to training on sentencelevel data.The metric (1) should be able to handle longer input sequences, (2) it should be able to capture long range dependencies, and (3) it should be able to model different paragraph-level phenomena like information or sentence reordering.However, we were not able to demonstrate these advantages Source Context: Maria said no.Source: She did not slap the green witch.
Reference Context: Maria dijo no.Reference: No le dió una bofetada a la bruja verde.
Figure 7: An English-to-Spanish translation example where the reference translation does not have enough information to correctly evaluate the hypothesis.Gender in Spanish is marked on pronouns, and Spanish is a prodrop language, which means the pronoun can be omitted if the context is clear.In this example, the pronoun is dropped from the reference, so determining whether the pronoun used in the hypothesis requires taking into account the previous reference sentence.We suspect such examples are not frequent, and if they do exist, the information required to resolve the ambiguity is relatively local to the reference sentence.
in practice, and we theorize why as follows.
First, the analysis in §6.2 shows that sentencelevel metrics generalize well to significantly longer input, so advantage (1) may not be so relevant.We hypothesize that the scoring function learned by sentence-level metrics like SENT-BASE or COMET could score a token in the hypothesis based on some alignment to the reference using its relative position in the translation.This function would be agnostic with respect to the global positioning, and thus the scoring function would generalize well to longer inputs.If this were true, training on paragraphlevel data would not be necessary to obtain good performance on long sequences.
Second, evaluating translation quality seems to be a very "local" problem in the sense that modeling long range dependencies is not frequently necessary for evaluation.Often, the reference phrase that aligns to a hypothesis phrase has enough information to accurately evaluate the hypothesis.If it does not, the information is likely nearby, not several sentences away (see Figure 7).Although the sentence-level metrics were not trained on multiple sentences, we suspect they are able to capture nearby dependencies across sentences when evaluating paragraphs.In theory, a paragraph-level metric would have the ability to model long range dependencies since it could observe them during training.However, if they are infrequent, advantage (2) over sentence-level metrics may be small.
Finally, the ability for our learned paragraph metrics to capture phenomena like sentence reorder-ing is limited by our dataset construction method.Since the paragraphs in our training and test sets come from MT systems that translated one sentence at a time, there are no phenomena like sentence reordering present in the datasets.Therefore, the paragraph-level metric cannot learn to model such cases, and the metrics are never evaluated on them either.Thus, the limitations of the dataset mean that we cannot demonstrate advantage (3).
We believe that paragraph-level metrics are necessary for evaluating true paragraph translations, where MT systems can be more creative with how a full paragraph is translated, rather than paragraph translations that are created by translating individual sentences.We hypothesize that sentence-level metrics will not generalize well when there is no sentence alignment or there is significant information reordering.To accurately evaluate actual paragraph translations, metrics need to be trained on similar data.Future work should invest in collecting human ratings for paragraph-level translations so that new metrics can be trained and evaluated.

Related Work
The vast majority of research on MT evaluation has worked at the sentence level (Papineni et al., 2002;Banerjee and Lavie, 2005;Snover et al., 2006;Popović, 2015Popović, , 2017;;Lo, 2019;Sellam et al., 2020;Rei et al., 2020Rei et al., , 2022;;Thompson and Post, 2020;Wan et al., 2022), although there has been recent interest in moving beyond sentence-level evaluation.Vernikos et al. (2022) propose a method to incorporate document-level context into a sentencelevel metric by using the additional context when computing the representations for the hypothesis and reference sentences.Although they use document context in their metric, it is still scores single sentences at a time, in contrast to the paragraphlevel metrics in our work that predict a score for entire paragraphs at once.Then Jiang et al. (2022) propose a document-level metric called BlonDE that targets evaluating discourse phenomena as opposed to overall translation quality (i.e., they do not model translation accuracy errors).To the best of our knowledge, ours is the first study aimed at training a learned metric that directly scores entire paragraphs.
Other studies that have evaluated sentence-level metrics beyond the sentence-level have done so in the literary domain.Thai et al. (2022) show that automatic metrics prefer MT output over human translations, and Karpinska and Iyyer (2023) show that metrics prefer actual translations of paragraphs over sentence-by-sentence translations.Our work is complementary to theirs as we focus on the news domain, train metrics on paragraph-level data, and evaluate on a much larger set of human ratings.It is not clear whether conclusions reached about metrics in the news domain will apply to the literary domain or vice versa.
Some researchers have developed challenge sets that can be used to probe how well metrics capture discourse phenomena that appear when translating more than one sentence at a time (Bawden et al., 2018;Müller et al., 2018;Lopes et al., 2020).However, these challenge sets can be trivial for reference-based metrics because the reference often resolves the ambiguity in the translation.To the best of our knowledge, a challenge set that forces reference-based metrics to use context outside of a single reference sentence during evaluation (see Figure 7) does not exist.
Research on generating translations of text longer than single sentences directly use sentencelevel metrics to score translations (Tiedemann and Scherrer, 2017;Miculicich et al., 2018;Ma et al., 2020;Wu et al., 2023;Post and Junczys-Dowmunt, 2023).Our work can be viewed as a justification for doing so.

Conclusion
In this work, we proposed a method for constructing paragraph-level datasets for training and metaevaluating MT evaluation metrics from sentencelevel data.Our experimental results showed that metrics trained on paragraph-level data do not necessarily out-perform those trained on sentencelevel data, potentially due to the fact that sentencelevel metrics seem to generalize well to longer inputs and limitations of our paragraph-level datasets.Future work should invest in collecting human judgments for paragraph translations generated by MT systems that directly translate full paragraphs instead of translating one sentence at a time.Such a dataset would be more likely to contain phenomena that do not exist at the sentence level, which we hypothesize would be more likely to require metrics designed to work at the paragraph level.

Limitations
There are a couple of limitations related to our dataset construction approach that are worth enumerating.
As discussed in Section 7, our ability to evaluate metrics' performances on all types of paragraphlevel translations is limited by our dataset construction method.Our translated paragraphs are generated by MT systems that translate one sentence at time, which results in sentence aligned data.Therefore, we are unable to evaluate metrics on true paragraph-level translations that might have sentence or information reordering.
Then, the WMT data no longer contains information about the white space between the original source sentences.Therefore, the DA and MQM paragraph-level datasets do not contain the paragraph breaks that were in the original document.Each of the k sentences is concatenated together and separated by a space in our work, so it is very likely that the artificially constructed paragraphs do not perfectly resemble real paragraphs.

A Dataset Statistics
The exact number of paragraph-level instances by WMT year and language pair that we generaetd from our dataset construction procedure (see §3) can be found in

B Additional Results
Figure 9 contains the system-and segment-level accuracy correlations on the en-de and en-ru language pairs from WMT'22 MQM that were not presented in the main body of the paper.Figure 10 contains the correlations for all 3 language pairs but uses Pearson correlation instead of pairwise accuracy.
Figure 11 shows the correlation between the two ways to apply a segment-level metric to paragraphlevel data, directly scoring the paragraph or averaging the k segment scores, on the en-ru and zh-en WMT'22 MQM dataset.

Figure 1 :Figure 2 :
Figure 1: The number of contiguous paragraphs for the given number of sentences per paragraph where each sentence is rated by the same rater.Actual values are included in Appendix A.

Figure 4 :
Figure4: There are fewer MQM ties as the number of sentences per paragraph increases.The finetuned PaLM-2 model outputs a very small number of ties, whereas the zero-shot model consistently predicts a large number of ties.Since the pairwise accuracy meta-evaluation metric penalizes metrics for incorrect tie predictions, the zero-shot model will have worse performance as the inputs get longer.

Figure 5 :Figure 6 :
Figure5: Metrics that score a paragraph directly (solid line) versus those that assume an alignment between the reference and hypothesis and calculate a score by averaging across the k sentence-level values (dashed line) perform very similarly.

Figure 8 :
Figure 8: The distribution of the length of the hypothesis translations for the direct assessment (DA) and MQM datasets for a given number of sentences per paragraph.

Figure 9 :
Figure9: System-and segment-level accuracy results for the en-de and en-ru language pairs on the paragraph-level WMT'22 MQM data for different numbers of k sentences per paragraph.In general, the system-level correlations are relatively flat and the segment-level correlations increase as the number of sentences per paragraph increases.BlonDE is not included because it only supports English.

Figure 10 :Figure 11 :
Figure 10: The system-and segment-level correlation results when using Pearson correlation follow very similar trends to those that use pairwise accuracy.The segment-level Pearson uses the "no grouping" variant fromDeutsch et al. (2023) to avoid the NaN problem that happens with the "group-by-item" variant, which was used in combination with pairwise accuracy in the main body of the paper.
Table 2 for DA and Table 3 for MQM. Figure 8 visualizes the distribution of the lengths of the hypotheses in the paragraph-level datasets based on mT5 SPM tokens.Then, Table 4 contains the number of paragraph examples that are too long to fit into the 1024 SPM maximum context length that is used by the metrics trained in this work.

Table 2 :
The number of paragraphs with the given number of sentences per paragraph from the direct assessment data from WMT'19 and WMT'20.Each paragraph is required to a contiguous block of sentences that are rated by the same rater.

Table 4 :
The number (and percent) of paragraphs for which the number of SPM tokens in the reference and hypothesis combined is larger than the maximum allowable input length by our metric, 1024.If the input is too long, it is truncated.