Evaluating the Evaluation Metrics for Style Transfer: A Case Study in Multilingual Formality Transfer

While the field of style transfer (ST) has been growing rapidly, it has been hampered by a lack of standardized practices for automatic evaluation. In this paper, we evaluate leading automatic metrics on the oft-researched task of formality style transfer. Unlike previous evaluations, which focus solely on English, we expand our focus to Brazilian-Portuguese, French, and Italian, making this work the first multilingual evaluation of metrics in ST. We outline best practices for automatic evaluation in (formality) style transfer and identify several models that correlate well with human judgments and are robust across languages. We hope that this work will help accelerate development in ST, where human evaluation is often challenging to collect.


Introduction
Textual style transfer (ST) is defined as a generation task where a text sequence is paraphrased while controlling one aspect of its style (Jin et al., 2020). For instance, the informal sentence in Italian "in bocca al lupo!" (i.e., "good luck") is rewritten to the formal version "Ti rivolgo un sincero augurio!" (i.e., "I send you a sincere wish!"). Despite the growing attention on ST in the NLP literature (Jin et al., 2020), progress is hampered by a lack of standardized and reliable automatic evaluation metrics. Standardizing the latter would allow for quicker development of new methods and comparison to prior art without relying on time and cost-intensive human evaluation that is currently employed by more than 70% of ST papers (Briakou et al., 2021a). ST is usually evaluated across three dimensions: style transfer (i.e., has the style of the generated output changed as intended?), meaning preservation (i.e., are the semantics of the input preserved?), and fluency (i.e., is the output well-formed?). As we will see, a wide range of automatic evaluation metrics and models has been used to quantify each of these dimensions. For example, prior work has employed as many as nine different automatic systems to rate formality alone (see Table 1). However, it is not clear how different automatic metrics compare to each other and how well they agree with human judgments. Furthermore, previous studies of automatic evaluation have exclusively focused on the English language (Yamshchikov et al., 2021;Pang, 2019;Pang and Gimpel, 2019;Tikhonov et al., 2019;Mir et al., 2019); yet, ST requires evaluation methods that generalize reliably beyond English.
We address these limitations by conducting a controlled empirical comparison of commonly used automatic evaluation metrics. Concretely, for all three evaluation dimensions, we compile a list of different automatic evaluation approaches used in prior ST work and study how well they correlate with human judgments. We choose to build on available resources as collecting human judgments across the evaluation dimensions is a costly process that requires recruiting fluent speakers in each language addressed in evaluation. While there are many stylistic transformations in ST, we conduct our study through the lens of formality style transfer (FoST), which is one of the most popular style dimensions considered by past ST work (Jin et al., 2020;Briakou et al., 2021a) and for which reference outputs and human judgments are available for four languages: English, Brazilian-Portuguese, French, and Italian.
• We contribute a meta-evaluation study that is not only the first large-scale comparison of automatic metrics for ST but is also the first work to investigate the robustness of these metrics in multilingual settings.
• We show that automatic evaluation approaches based on a formality regression model fine-tuned on XLM-R and the chrF metric correlate well with human judgments for style transfer and meaning preservation, re-spectively, and propose that the field adopts their usage. These metrics are shown to work well across languages, and not just in English.
• We show that framing style transfer evaluation as a binary classification task is problematic and propose that the field treats it as a regression task to better mirror human evaluation.
• Our analysis code and meta-evaluation files with system outputs are made public to facilitate further work in developing better automatic metrics for ST: https://github.com/ Elbria/xformal-FoST-meta.

Limitations of Automatic Evaluation
Recent work highlights the need for research to improve evaluation practices for ST along multiple directions. Not only does ST lack standardized evaluation practices (Yamshchikov et al., 2021), but commonly used methods have major drawbacks which hamper progress in this field. Pang (2019) and Pang and Gimpel (2019) show that the most widely adopted automatic metric, BLEU, can be gamed. They observe that untransferred text achieves the highest BLEU score for the task of sentiment transfer, questioning complex models' ability to surpass this trivial baseline. Mir et al. (2019) discuss the inherent trade-off between ST evaluation aspects and propose that models are evaluated at specific points of their trade-off plots. Tikhonov et al. (2019) argue that, despite their cost, human-written references are needed for future experiments with style transfer. They also show that comparing models without reporting error margins can lead to incorrect conclusions as state-of-the-art models sometimes end up within error margins from one another.

Structured Review of ST Evaluation
We systematically review automatic evaluation practices in ST with formality as a case study. We select FoST for this work since it is one of the most frequently studied styles (Jin et al., 2020) and there is human annotated data including human references available for these evaluations (Rao and Tetreault, 2018;Briakou et al., 2021b). Tables 1  and 2 summarize evaluation details for all FoST methods in papers from the ST survey by Jin et al. (2020). 1 Most works employ automatic evaluation for style (87%) and meaning preservation (83%).
Fluency is the least frequently evaluated dimension (43%), while 74% of papers employ automatic metrics to assess the overall quality of system outputs that captures all desirable aspects. Across dimensions, papers also frequently rely on human evaluation (55%, 58%, 60%, and 40% for style, meaning, fluency, and overall). However, human judgments and automatic metrics do not always agree on the best-performing system. In 60% of evaluations, the top-ranked system is the same according to human and automatic evaluation (marked as in Table 1), and their ranking disagrees in 40% of evaluations (marked as in Table 1). When there is a disagreement, human evaluation is trusted more and viewed as the standard. This highlights the need for a systematic evaluation of automatic evaluation metrics.
Finally, almost all papers (91%) consider FoST for English (EN), as summarized in Table 2. There are only two exceptions: Korotkova et al. (2019) study FoST for Latvian (LV) and Estonian (ET) in addition to EN, while Briakou et al. (2021b) study FoST for 3 Romance languages: Brazilian Portuguese (BR-PT), French (FR), and Italian (IT). The former provides system output samples as a means of evaluation, and the latter employs human evaluations, highlighting the challenges of automatic evaluation in multilingual settings.
Next, we review the automatic metrics used for each dimension of evaluation in FoST papers. As we will see, a wide range of approaches is used. Yet, it remains unclear how they compare to each other, what their respective strengths and weaknesses are, and how they might generalize to languages other than English.

Automatic Metrics for FoST
Formality Style transfer is often evaluated using model-based approaches. The most frequent method consists of training a binary classifier on human written formal vs. informal pairs. The classifier is later used to predict the percentage of generated outputs that match the desired attribute per evaluated system-the system with the highest percentage is considered the best performing with respect to style. Across methods, the corpus used to train the classifier is the GYAFC parallelcorpus (Rao and Tetreault, 2018) consisting of 105K parallel informal-formal human-generated excerpts. This corpus is curated for FoST in EN,   Meaning Preservation Evaluation of this dimension is performed using a wider spectrum of approaches, as presented in the third column of Table 1. The most frequently used metric is reference-BLEU (r-BLEU), which is based on the n-gram precision of the system output compared to human rewrites of the desired formality. Other approaches include self-BLEU (s-BLEU), where the system output is compared to its input, measuring the semantic similarity between the system input and its output, or regression models (e.g., CNN, BERT) trained on data annotated for similaritybased tasks, such as the Semantic Textual Similarity task (STS) (Agirre et al., 2016).
Fluency Fluency is typically evaluated with model-based approaches (see fourth column of Table 1). Among those, the most frequent method is that of computing perplexity (PPL) under a language model. The latter is either trained from scratch on the same corpus used to train the FoST models (i.e., GYAFC) using different underlying architectures (e.g., KenLM, LSTM), or employ large pre-trained language models (e.g., GPT). A few other works train models on EN data annotated for grammaticality (Heilman et al., 2014) or linguistic acceptability (Warstadt et al., 2019) instead.
Overall Systems' overall quality (see fifth column of Table 2) is mostly evaluated using r-BLEU or by combining independently computed metrics into a single score (e.g., geometric mean -GM(.), harmonic mean -HM(.), F1(.)). Moreover, 6 out of 8 approaches that rely on combined scores do not include fluency scores in their overall evaluation.
English Focus Since most of the current work on FoST and ST is in EN, prior work relies heavily on EN resources for designing automatic evaluation methods. For instance, resources for training stylistic classifiers or regression models are not available for other languages. For the same reason, it is unclear whether model-based approaches for measuring meaning preservation and fluency can be ported to multilingual settings. Furthermore, reference-based evaluations (e.g., r-BLEU) require human rewrites that are only available for EN, BR-PT, IT, and FR. Finally, even though perplexity does not rely on annotated data, without standardizing the data language models are trained on, we cannot make meaningful cross-system comparisons.

Summary
Reviewing the literature shows the lack of standardized metrics for ST evaluation, which hampers comparisons across papers, the lack of agreement between human judgments and automatic metrics, which hampers system development, and the lack of portability to languages other than English which severely limits the impact of the work. These issues motivate the controlled multilingual evaluation of evaluation metrics in our paper.

Evaluating Evaluation Metrics
We evaluate evaluation metrics (described in §3.2) for multilingual FoST, in four languages for which human evaluation judgments (described in §3.1) on FoST system outputs are available.

Human Judgments
We use human judgments collected by prior work of Rao and Tetreault (2018) for EN and Briakou et al. (2021b) for BR-PT, FR, and IT. We include details on their annotation frameworks, the quality of human judges, and the evaluated systems below.
Human Annotations We briefly describe the annotation frameworks employed by Rao and Tetreault (2018) and Briakou et al. (2021b) to collect human judgments for each evaluation aspect: 1. formality ratings are collected-for each system output-on a 7-point discrete scale, ranging from −3 to +3, as per Lahiri (2015) (Very informal, Informal, Somewhat Informal, Neutral, Somewhat Formal, Formal. Very Formal); 2. meaning preservation judgments adopt the Semantic Textual Similarity annotation scheme of Agirre et al. (2016), where an informal input and its corresponding formal system output are rated on a scale from 1 to 6 based on their similarity (Completely dissimilar, Not equivalent but on same topic, Not equivalent but share some details, Roughly equivalent, Mostly equivalent, Completely equivalent); 3. fluency judgments are collected for each system output on a discrete scale of 1 to 5, as per Heilman et al.
(2014) (Other, Incomprehensible, Somewhat Comprehensible, Comprehensible, Perfect); 4. overall judgments are collected following a relative ranking approach: all system outputs are ranked in order of their formality, taking into account both meaning preservation and fluency.
Human Annotators Both studies recruited workers from the Amazon Mechanical Turk platform after employing quality control methods to exclude poor quality workers (i.e., manual checks for EN, and qualification tests for BR-PT, FR, and IT). For all human evaluations and languages Briakou et al. (2021b) report at least moderate interannotator agreement.
Evaluated Systems The evaluated system outputs were sampled from 5 FoST models for each language, spanning a range of simple baselines to neural architectures (Rao and Tetreault, 2018;Briakou et al., 2021b). We also include detailed descriptions of them in Appendix C. For each evaluation dimension 500 outputs are evaluated for EN and 100 outputs per system for BR-PT, FR, and IT.

Evaluation Metrics
For the FoST evaluation aspects described below, we cover a broad spectrum of approaches that range from dedicated models for the tasks at hand to more lightweight methods relying on unsupervised approaches and automated metrics.
Formality We benchmark model-based approaches that fine-tune multilingual pre-trained language models (i.e., XLM-R, mBERT), where the task of formality detection is modeled either as a binary classification task (i.e., formal vs. informal), or as a regression task that predicts different formality levels on an ordinal scale.

Meaning Preservation
We evaluate the BLEU score (Papineni et al., 2002) of the system output compared to the reference rewrite (r-BLEU) since it is the dominant metric in prior work. Prior reviews of meaning preservation metrics for paraphrase and sentiment ST tasks in EN (Yamshchikov et al., 2021) cover n-gram metrics and embeddingbased approaches. We consider three additional metric classes to compare system outputs with inputs, as human annotators do: 1. n-gram based metrics include: s-BLEU (self-BLEU that compares system outputs with their inputs as opposed to references, i.e., r-BLEU), METEOR (Banerjee and Lavie, 2005) based on the harmonic mean of unigram precision and recall while accounting for synonym matches, and chrF (Popović, 2015) based on the character n-gram F-score; 2. embedding-based methods fall under the category of unsupervised evaluation approaches that rely on either contextual word representations extracted from pre-trained language models or non-contextual pre-trained word embeddings (e.g., word2vec (Mikolov et al., 2013); Glove (Pennington et al., 2014)).
For the former, we use BERT-score (Zhang et al., 2020a) which computes the similarity between each output token and each reference token based on BERT contextual embeddings.
For the latter, we experiment with two similarity metrics: the first is the cosine distance between the sentence-level feature representations of the compared texts extracted via averaging their word embeddings; the second is the Word Mover's Distance (WMD) metric of Kusner et al. (2015) that measures the dissimilarity between two texts as the minimum amount of distance that the embedded words of one text need to "travel" to reach the word embeddings of the other; 3. semantic textual similarity (STS) models constitute supervised methods that we model via fine-tuning multilingual pre-trained language models (i.e., XLM-R, mBERT) to predict a semantic similarity score for a pair of texts on an ordinal scale.
Fluency We experiment with perplexity (PPL) and likelihood (LL) scores based on probability scores of language models trained from scratch (e.g., KenLM (Heafield, 2011)), as well as pseudolikelihood scores (PSEUDO-LL) extracted from pre-trained masked language models similarly to Salazar et al. (2020), by masking sentence tokens one by one.

Experiment Settings
Supervised Metrics For all supervised modelbased approaches, we experiment with fine-tuning two multilingual pre-trained language models: 1. multilingual BERT, dubbed mBERT (Devlin et al., 2019)-a transformer-based model pretrained with a masked language model objective on the concatenation of monolingual Wikipedia corpora from the 104 languages with the largest Wikipedias. 2. XLM-R (Conneau et al., 2020)-a transformer-based masked language model trained on 100 languages using monolingual Common-Crawl data. All models are based on the Hugging-Face Transformers (Wolf et al., 2020) 2 library. We fine-tune with the Adam optimizer (Kingma and Ba, 2015), a batch size of 32, and a learning rate of 5e−5 for 3 and 5 epochs for classification and regression tasks, respectively. We perform a grid search on held-out validation sets over learning rate with values: 2e−3, 2e−4, 2e−5, and 5e−5 and over number of epochs with values: 3, 5, and 8.
Cross-lingual Transfer For supervised modelbased methods that rely on the availability of human-annotated instances to train dedicated models for specific tasks, we experiment with three standard cross-lingual transfer approaches (e.g., Hu et al. (2020)  Training Data Table 3 presents statistics on the training data used for supervised and unsupervised models across the 3 ST evaluation aspects. For datasets that are only available for EN, we use the already available machine translated resources for STS 10 and formality datasets (Briakou et al., 2021b). The former employs the DeepL service (no information of translation quality is available) while the latter uses the AWS translation service 11 (with reported BLEU scores of 37.16 (BR-PT), 33.79 (FR), and 32.67 (IT)). 12 The KenLM models for all the languages are trained on 1M randomly sampled sentences from the OpenSubtitles dataset (Lison and Tiedemann, 2016).

Experimental Results
We analyze the results of comparing the outputs from the several automatic metrics to their humangenerated counterparts for formality style transfer ( §5.1), meaning preservation ( §5.2), fluency ( §5.3) via conducting segment-level analysis-and then, turn into analyzing system-level rankings to evaluation overall task success ( §5.4).

Formality Transfer Metrics
The field is divided on the best way to evaluate the style dimension -formality in our case. Practitioners use either a binary approach (is the new sentence formal or informal?) or a regression approach (how formal is the new sentence?). We discuss the first approach and its limitations in § 5.1.1, before moving to regression in § 5.1.2.

Evaluating Binary Classifiers
As discussed in §2, the vast majority of FoST works evaluate style transfer based on the accuracy of a binary classifier trained to predict whether humanwritten segments are formal or informal. Yet, as Table 1 indicates, this approach fails to identify the best system in this dimension 59% of the time.
To better understand this issue, we evaluate these classifiers on human-written texts versus ST system outputs. Table 4 presents F1 scores when testing the binary formality classifiers on the task they are trained on: predicting whether human-written sentences from GYAFC and XFORMAL are formal or informal. First, the last column (i.e., δ(XLM-R, mBERT)) shows that XLM-R is a better model than mBERT for this task, across  Table 4: F1 scores of binary formality classifiers under different cross-lingual transfer settings. Numbers in parentheses indicate performance drops over ZERO-SHOT. ZERO-SHOT yields the highest scores across languages and pre-trained language models. XLM-R yields improvements over mBERT across most setting (δ(XLM-R, mBERT)).  Table 5: Spearman's ρ correlation (%) of formality regression models. Numbers in parentheses indicate performance drops over ZERO-SHOT. ZERO-SHOT yields the highest scores across languages and pre-trained language models. XLM-R yields improvements over mBERT across most settings (δ(XLM-R, mBERT)).

Human Written Texts
languages, with the largest improvements in the ZERO-SHOT setting where XLM-R beats mBERT by +3, +2, +1 for BR-PT, FR, and IT respectively. Second, ZERO-SHOT is surprisingly the best strategy to port EN models to other languages. TRANSLATE-TRAIN and TRANSLATE-TEST hurt F1 by 3 and 9 points on average compared to ZERO-SHOT, despite exploiting more resources in the form of machine translation systems and their training data. However, transfer accuracy is likely affected by regular translation errors (as suggested by larger F1 drops for languages with lower MT BLEU scores) and by formality-specific errors. Machine translation has been found to produce outputs that are more formal than its inputs (Briakou et al., 2021b), which yields noisy training signals for TRANSLATE-TRAIN and alters the formality of test samples for TRANSLATE-TEST. System Outputs We now evaluate the best performing binary classifier (i.e., XLM-R in ZERO-SHOT setting) on real system outputs-a setup in line with automatic evaluation frameworks. Figure 1 presents a breakdown of the number of formal vs. informal predictions of the classifiers binned by human-rated formality levels. Across languages, the performance of the classifier deteriorates as we move away from extreme formality ratings (i.e., very informal (−3) and very formal (+3)). This lack of sensitivity to different formality levels is problematic since system outputs across languages are concentrated around neutral formality values. In addition, when testing on BR-PT, FR, and IT (ZERO-SHOT settings), the classifier is more biased towards the formal class, which leads one to question its ability to correctly evaluate more formal outputs in multilingual settings. Taken together, these results suggest that validating the classifiers against human rewrites rather than system outputs is unrealistic and potentially misleading.   Table 5 presents Spearman's ρ correlation of regression models' predictions with human judgments. Again, XLM-R with ZERO-SHOT transfer yields the highest correlation across languages. More specifically, the trends across different transfer approaches and different pre-trained language models are similar to the ones observed on evaluation of binary classifiers: XLM-R outperforms mBERT for almost all settings, while ZERO-SHOT is the most successful transfer approach, followed by TRANSLATE-TRAIN, with TRANSLATE-TEST yielding the lowest correlations across languages. Interestingly, regression models highlight the differences between the generalization abilities of XLM-R and mBERT more clearly than the previous analysis on binary predictions: ZERO-SHOT transfer on XLM-R yields 8%, 8%, and 10% higher correlations than mBERT for BR-PT, FR, and IT-while both models yield similar correlations for EN.  fer is a close second to chrF, consistent with this model's top-ranking behavior as a formality transfer metric. However, chrF outperforms the remaining more complex and expensive metrics, including BERT-score and mBERT models. In contrast to Yamshchikov et al. (2021), embedding-based methods (i.e., cosine, WMD) show no advantage over n-gram metrics, perhaps due to differences in word embedding quality across languages. Finally, it should be noted that r-BLEU is the worst performing metric across languages, and its correlation with human scores is particularly poor for languages other than English. This is remarkable because it has been used in 75% of automatic evaluations for FoST meaning preservation evaluation (as seen in Table 1). We, therefore, recommend discontinuing its use. Table 7 presents Spearman's ρ correlation of various fluency metrics with human judgments. Pseudo-likelihood (PSEUDO-LL) scores obtained from XLM-R correlate with human fluency ratings best across languages. Their correlations are strong across languages, while other methods only yield weak (i.e., KenLM, mBERT) to moderate correlations (i.e, KenLM-PPL) for IT. We, therefore, recommend evaluating fluency using Pseudolikelihood scores derived from XLM-R to help standardize fluency evaluation across languages.

System-level Rankings
Finally, we turn to predict the overall ranking of systems by focusing on how many correct pairwise system comparisons each metric gets correct. For each language, there are 5 systems, which means there are 10 pairwise comparisons, for a total of 40 given the 4 languages. We analyze corpus-level r-BLEU, commonly used for this dimension, along with leading metrics from the other dimensions: XLM-R formality regression models, chrF and XLM-R pseudo-likelihood. r-BLEU gets 30 out of 40 comparisons correct while the other metrics get 25, 22, and 19 respectively. This indicates that r-BLEU correlates with human judgments better at the corpus-level than at the sentence-level, as in machine translation evaluation (Mathur et al., 2020). We caution that these results are not definitive but rather suggestive of the best performing metric, given the ideal evaluation would be a larger number of systems with which to perform a rank correlation. The complete analysis for each language is in Appendix B.

Conclusions
Automatic (and human) evaluation processes are well-known problems for the field of Natural Language Generation (Howcroft et al., 2020;Clinciu et al., 2021) and the burgeoning subfield of ST is not immune. ST, in particular, has suffered from a lack of standardization of automatic metrics, a lack of agreement between human judgments and automatics metrics, as well as a blindspot to developing metrics for languages other than English. We address these issues by conducting the first controlled multilingual evaluation for leading ST metrics with a focus on formality, covering metrics for 3 evaluation dimensions and overall ranking for 4 languages. Given our findings, we recommend the formality style transfer community adopt the following best practices: 1. Formality XLM-R formality regression models in the ZERO-SHOT cross-lingual transfer setting yields the clear best metrics across all four languages as it correlates very well with human judgments. However, the commonly used binary classifiers do not generalize across languages (due to misleadingly over-predicting formal labels). We propose that the field use regression models instead since they are designed to capture a wide spectrum of formality rates.

Meaning Preservation
We recommend using chrF as it exhibits strong correlations with human judgments for all four languages. We caution against using BLEU for this dimension, despite its overwhelming use in prior work as both its reference and self variants do not correlate as strongly as other more recent metrics.
3. Fluency XLM-R is again the best metric (in particular for French). However, it does not correlate well with human judgments as compared to the other two dimensions.
4. System-level Ranking chrF and XLM-R are the best metrics using a pairwise comparison evaluation. However, an ideal evaluation would be to have a large number of systems with which to draw reliable correlations.

Cross-lingual Transfer
Our results support using ZERO-SHOT transfer instead of machine translation to port metrics from English to other languages for formality transfer tasks.
We view this work as a strong point of departure for future investigations of ST evaluation. Our work first calls for exploring how these evaluation metrics generalize to other styles and languages. Across the different ways of defining style evaluation (either automatic or human), prior work has mostly focused on the three main dimensions covered in our study. As a result, although our metaevaluation on ST metrics focuses on formality as a case study, it can inform the evaluation of other style definitions (e.g., politeness, sentiment, gender, etc.). However, more empirical evidence is needed to test the applicability of the best performing metrics for evaluating style transfer beyond formality. Our work suggests that the top metrics based on XLM-R and chrF are robust across 4 Romance languages; yet, our conclusions and recommendations are currently limited to this set of languages. We hope that future work in multilingual style transfer will allow for testing their generalization to a broader spectrum of languages and style definitions. Furthermore, our study highlights that more research is needed on automatically ranking systems. For example, one could build a metric that combines metrics' outputs for the three dimensions, or one could develop a singular metric. In line with Briakou et al. (2021a), our study also calls for releasing more human evaluations and more system outputs to enable robust evaluation. Finally, there is still room for improvement in assessing how fluent a rewrite is. Our study provides a framework to address these questions systematically and calls for ST papers to standardize and release data to support larger-scale evaluations.

A Cross-metric Correlation Analysis
Correlations across meaning preservation metrics Figure 3 presents a cross-metric correlationbased analysis of the different approaches for measuring meaning preservation. We observe consistent trends across languages: methods that are similar in nature correlate well with each other. Concretely, across settings, n-gram based methods (i.e., BLEU, METEOR, and chrF) yield 0.8 − 0.9 correlation scores. The latter also holds when looking at correlations within the group of embedding-based methods (cosine and WMD) and and group of STS approaches for EN, FR, and IT, while for BR-PT we observe that the correlation between XLM-R and mBERT based approaches is smaller (0.7 vs. 0.8 for other languages). Finally, n-gram approaches correlate better with STS methods (with correlations in the range of 0.7 − 0.8) across languages, while the lowest correlations (0.5−0.6) are observed between embedding-based methods (i.e., cosine, WMD) and each of the rest metrics.
Correlations within and across formalityfluency metrics Figure 4 presents results of cross-metric correlations for the studied approaches that capture formality transfer and fluency. For formality, each of the translate-based approaches (i.e., TRANSLATE-TRAIN and TRANSLATE-TEST) yields high correlations (0.8 − 0.9) between models that fine-tune XLM-R vs. mBERT, while their correlations decrease (0.7) for IT and BR-PT in the zero-shot setting. Finally, pseudo-perplexity metrics extracted from XLM-R-that consists the best correlated metric with human judgments for fluency-yield positive correlations with all formality metrics. Table 8 presents the number of correct systemlevel pair-wise comparisons of automatic metrics based on human judgments. For STS, chrF, F.REG*, F.CLASS*, and PSEUDO-LKL*, system-level scores are extracted via averaging sentence-level scores. For s-BLEU and r-BLEU the system scores are extracted at the corpus-level. The total number of pairwise comparisons for each language is 10 (given access to 5 systems). Among the meaning preservation metrics (i.e., STS, s-BLEU, and chrF), chrF yields the highest number of correct comparisons (i.e., 37 out of 40 for all languages). The formality regression models (i.e., F.REG*) result in correct rankings more frequently than the formality classifiers (i.e., F.CLASS*) yielding 35 out of 40 correct comparisons. Reference-BLEU (i.e., r-BLEU) is compared with overall ranking judemnts. It ranks 8 out of 10 systems correctly for EN, FR, and BR-PT and only 6 for IT. Finally, perplexity (i.e., PPL) results in the fewest correct rankings at system-level (i.e., 22 out of 40), despite correlating well with human judgments at the segment-level. Additionally, in Figure 2 we visualize the differences between relative rankings induced by human judgments and the best segment-level correlated metrics for each dimension, averaged per system.

C Evaluated Systems Details
For each of BR-PT, IT, and FR, outputs are sampled from: 1. Rule-based systems consisting of handcrafted transformations (e.g., fixing casing, normalizing punctuation, expanding contractions, etc.); 2. Round-trip translation models that pivot to EN and backtranslate to the original language; 3. Bi-directional neural machine translation (MT) models that employ side constraints to perform style transfer for both directions of formality (i.e., informal↔formal)-trained on (machine) translated informal-formal pairs of an English parallel corpus (i.e., GYAFC); 4. Bi-directional NMT models that augment the training data of 3. via backtranslation of informal sentences; 5. A multi-task variant of 3. that augments the training data with parallel-sentences from bilingual resources (i.e., OpenSubtitles) and learns to translate jointly between and across languages.
For EN, the outputs were sampled from: 1. A rule-based system of similar transformations to ones for BR-PT, FR, and IT; 2. A phrase-based machine translation model trained on informal-formal pairs of GYAFC; 3. An NMT model trained on GYAFC to perform style transfer uni-directionaly;   Figure 2: Difference in relative ranking between human judgments and automatic metrics across systems (i.e, represented by different markers) for different evaluation dimensions. STS, s-BLEU and chRF are compared with meaning rankings, r-BLEU (reference-BLEU) with overall, XLM-R classifiers (*F.CLASS) and regression (*F.REF) models with formality, and XLM-R pseudoperplexity (*PPL) with fluency. 4. A variant of 3. that incorporates a copyenriched mechanism that enables direct copying of words from input; 5. A variant of 4. trained on additional backtranslated data of target style sentences using 2.
In general, neural models performed best for all languages according to overall human judgments, while the simpler baselines perform closer to the more advanced neural models for BR-PT, FR, and IT. For each evaluation dimension 500 outputs are evaluated for EN and 100 outputs per system for BR-PT, FR, and IT.

D Meaning Preservation Metrics
(reference-based) Table 9 presents supplemental results on meaning preservation metrics for reference-based settings.  Table 9: Spearman's ρ correlation of meaning preservation metrics for reference-based meaning. Mode-based metrics marked with * use XLM-R while markers ∼ use mBERT as the base pre-trained language model. F.REG refers to formality regression models, PPL to perplexity, and LL to likelihood.