Understanding Translationese in Cross-Lingual Summarization

Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS data, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese. Then we systematically investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation; (2) the translationese in training sets would harm model performance in real-world applications; (3) though machine-translated documents involve translationese, they are very useful for building CLS systems on low-resource languages under specific training strategies. Lastly, we give suggestions for future CLS research including dataset and model developments. We hope that our work could let researchers notice the phenomenon of translationese in CLS and take it into account in the future.


Introduction
Cross-lingual summarization (CLS) aims to generate a summary in a target language from a given document in a different source language.Under the globalization background, this task helps people grasp the gist of foreign documents efficiently, and attracts wide research attention from the computational linguistics community (Leuski et al., 2003;Wan et al., 2010;Yao et al., 2015;Zhu et al., 2019;Ouyang et al., 2019;Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021;Liang et al., 2022).
As pointed by previous literature (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021;Wang et al., 2022b), one of the key challenges lies in CLS is data scarcity.In detail, naturally occurring documents in a source language paired with the corresponding summaries in a target language are rare (Perez-Beltrachini and Lapata, 2021), making it difficult to collect large-scale and high-quality CLS datasets.For example, it is costly and laborintensive to employ bilingual annotators to create target-language summaries for the given sourcelanguage documents (Chen et al., 2022).Generally, to alleviate data scarcity while controlling costs, the source documents or the target summaries in existing large-scale CLS datasets (Zhu et al., 2019;Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021;Bai et al., 2021;Wang et al., 2022a;Feng et al., 2022) are (automatically or manually) translated from other languages rather than the text originally written in that language (Section 2.1).
Distinguished from the text originally written in one language, translated text1 in the same language might involve artifacts which refer to "translationese" (Gellerstam, 1986).These artifacts include the usage of simpler, more standardized and more explicit words and grammars (Baker et al., 1993;Scarpa, 2006) as well as the lexical and word order choices that are influenced by the source language (Gellerstam, 1996;Toury, 2012).It has been observed that the translationese in data can mislead model training as its special stylistic is away from native usage (Selinker, 1972;Volansky et al., 2013;Bizzoni et al., 2020;Yu et al., 2022).Nevertheless, translationese is neglected by previous CLS work, leading to unknown impacts and potential risks.
Grounding the truth that current large-scale CLS datasets are typically collected via human or machine translation, in this paper, we investigate the effects of translationese when the translations appear in target summaries (Section 3) or source documents (Section 4), respectively.We first confirm that the different translation methods (i.e., human translation or machine translation) will lead to different degrees of translationese.In detail, for CLS datasets whose source documents (or target summaries) are human-translated texts, we collect their corresponding machine-translated documents (or summaries).The collected documents (or summaries) contain the same semantics as the original ones, but suffer from different translation methods.Then, we utilize automatic metrics from various aspects to quantify translationese, and show the different degrees of translationese between the original and collected documents (or summaries).
Second, we investigate how translationese affects CLS model evaluation and performance.To this end, we train and evaluate CLS models with the original and the collected data, respectively, and analyze the model performances via both automatic and human evaluation.We find that (1) the translationese in documents or summaries of test sets might lead to the discrepancy between human judgment and automatic evaluation (i.e.ROUGE and BERTScore).Thus, the test sets of CLS datasets should carefully control their translationese, and avoid directly adopting machine-translated documents or summaries.(2) The translationese in training sets would harm model performance in realworld applications where the translationese should be avoided.For example, a CLS model trained with machine-translated documents or summaries shows limited ability to generate informative and fluent summaries.(3) Though it is sub-optimal to train a CLS model only using machine-translated documents as source documents, they are very useful for building CLS systems on low-resource languages under specific training strategies.Lastly, since the translationese affects model evaluation and performance, we give suggestions for future CLS data and model developments especially on low-resource languages.

Contributions.
(1) To our knowledge, we are the first to investigate the influence of translationese on CLS.We confirm that the different translation methods in creating CLS datasets will lead to different degrees of translationese.(2) We conduct systematic experiments to show the effects of the translationese in source documents and target summaries, respectively.(3) Based on our findings, we discuss and give suggestions for future research.

Translations in CLS Datasets
To provide a deeper understanding of the translations in CLS datasets, we comprehensively review previous datasets, and introduce the fountain of their documents and summaries, respectively.Zhu et al. (2019) utilize a machine translation (MT) service to translate the summaries of two English monolingual summarization (MS) datasets (i.e., CNN/Dailymail (Nallapati et al., 2016) and MSMO (Zhu et al., 2018)) to Chinese.The translated Chinese summaries together with the original English documents form En2ZhSum dataset.Later, Zh2EnSum (Zhu et al., 2019) and En2DeSum (Bai et al., 2021) are also constructed in this manner.The source documents of these CLS datasets are originally written in those languages (named natural text), while the target summaries are automatically translated from other languages (named MT text).Feng et al. (2022) utilize Google MT service2 to translate both the documents and summaries of an English MS dataset (SAMSum (Gliwa et al., 2019)) into other five languages.The translated data together with the original data forms MSAM-Sum dataset.Thus, MSAMSum contains six source languages as well as six target languages, and only the English documents and summaries are natural text, while others are MT text.Since the translations provided by MT services might contain flaws, the above datasets further use round-trip translation strategy ( § 2.2) to filter out low-quality samples.
In addition to MT text, human-translated text (HT text) is also adopted in current CLS datasets.Wik-iLingua (Ladhak et al., 2020) collects documentsummary pairs in 18 languages (including English) from WikiHow3 .In this dataset, only the English documents/summaries are natural text, while all those in other languages are translated from the corresponding English versions by WikiHow's human writers (Ladhak et al., 2020).XSAMSum and XMediaSum (Wang et al., 2022a) are constructed through manually translating the summaries of SAMSum (Gliwa et al., 2019) and MediaSum (Zhu et al., 2021) HT text, respectively.
XWikis (Perez-Beltrachini and Lapata, 2021) collects document-summary pairs in 4 languages (i.e., English, French, German and Czech) from Wikipedia.Each document-summary pair is extracted from a Wikipedia page.To align the parallel pages (which are relevant to the same topic but in different languages), Wikipedia provides Interlanguage links.When creating a new Wikipedia page, it is more convenient for Wikipedians to create by translating from its parallel pages (if has) than editing from scratch, leading to a large number of HT text in XWikis. 4Thus, XWikis is formed with both natural text and HT text.Note that though the translations commonly appear in XWikis, we cannot distinguish which documents/summaries are natural text or HT text.This is because we are not provided with the translation relations among the parallel contents.For example, some documents in XWikis might be translated from their parallel documents, while others might be natural text serving as origins to create their parallel documents.
Table 1 summarizes the fountain of the source documents and target summaries in current datasets.We can conclude that when performing CLS from a source language to a different target language, translated text (MT and HT text) is extremely common in these datasets and appears more commonly in target summaries than source documents.

Round-Trip Translation
The round-trip translation (RTT) strategy is used to filter out low-quality CLS samples built by MT services.In detail, for a given text t that needs to be translated, this strategy first translates t into the target language t, and then translates the result t back to the original language t ′ based on MT services.Next, t is considered a high-quality translation if the ROUGE scores (Lin, 2004) between t and t ′ exceed a pre-defined threshold.Accordingly, the CLS samples will be discarded if the translations in them are not high-quality.

Translationese Metrics
To quantify translationese, we follow Toral (2019) and adopt automatic metrics from three aspects, i.e., simplification, normalization and interference.Simplification.Compared with natural text, translations tend to be simpler like using a lower number of unique words (Farrell, 2018) or content words (i.e., nouns, verbs, adjectives and adverbs) (Scarpa, 2006).The following metrics are adopted: • Type-Token Ratio (TTR) is used to evaluate lexical diversity (Templin, 1957) calculated by dividing the number of types (i.e., unique tokens) by the total number of tokens in the text.• Vocabulary Size (VS) calculates the total number of different words in the text.• Lexical Density (LD) measures the information lies in the text by calculating the ratio between the number of its content words and its total number of words (Toral, 2019).Normalization.The lexical choices in translated text tend to be normalized (Baker et al., 1993).We use entropy to measure this characteristic: • Entropy of distinct n-grams (Ent-n) in the text.
• Entropy of content words (Ent-cw) in the text.Interference.The structure of translated text tends to be similar to its source text (Gellerstam, 1996).
• Syntactic Variation (SV) is calculated by the normalized tree edit distance (Zhang and Shasha, 1989) between the constituency parse trees of the translated text and the source text.5 • Part-of-Speech Variation (PSV) is computed by the normalized edit distance between the part-ofspeech sequences of the translated text and the source text.
It is worth noting that, ideally, the fewer / lowerlevel translationese in the translations, the higher all the above metrics will be.

Translationese in Target Summaries
In this section, we investigate how translationese affects CLS evaluation and training when it appears in the target summaries.For CLS datasets whose source documents are natural text while target summaries are HT text, we collect another summaries (in MT text) for them via Google MT.In this manner, one document will pair with two summaries (containing the same semantics, but one is HT text and the other is MT text).The translationese in these two types of summaries could be quantified.Subsequently, we can use the summaries in HT text and MT text as references, respectively, to train CLS models and analyze the influence of translationese on model performance.

Experimental Setup
Datasets Selection.
First, we should choose CLS datasets with source documents in natural text and target summaries in HT text.Under the consideration of the diversity of languages, scales and domains, we decide to choose XSAMSum (En⇒Zh) and WikiLingua (En⇒Ru/Ar/Cs). 6Summaries Collection.The original Chinese (Zh) summaries in XSAMSum, as well as the Russian (Ru), Arabic (Ar) and Czech (Cs) summaries in WikiLingua are HT text.Besides, XSAMSum and WikiLingua also provide the corresponding English summaries in natural text.Therefore, in addition to these original target summaries (in HT text), we can automatically translate the English summaries to the target languages to collect another summaries (in MT text) based on Google MT service.
RTT strategy (c.f., Section 2.2) is further adopted to remove the low-quality translated summaries.As a result, the number of the translated summaries is less than that of original summaries.To ensure the comparability in subsequent experiments, we also discard the original summaries if the corresponding translated ones are removed.Lastly, the remaining original and translated summaries together with source documents form the final data we used.
Thanks to MSAMSum (Feng et al., 2022) which has already translated the English summaries of SAMSum to Chinese via Google MT service, thus, 6 Since a CLS dataset might contain multiple source and target languages, we use "X⇒Y" to indicate the source language and target language are X and Y, respectively.Language nomenclature is based on ISO 639-1 codes.

Translationese Analysis
We analyze the translationese in the target summaries of the preprocessed datasets.As shown in Table 2, the scores (measured by the metrics described in Section 2.3) in HT summaries are generally higher than those in MT summaries, indicating the HT summaries contain more diverse words and meaningful semantics, and their sentence structures are less influenced by the source text (i.e., English summaries).Thus, the degree of translationese in HT summaries is less than that in MT summaries, which also verifies that different methods of collecting target-language summaries might lead to different degrees of translationese.

Translationese's Impact on Evaluation
For each dataset, we train two models with the same input documents but different target summaries.Specifically, one uses HT summaries as references (denoted as mBART-HT), while the other uses MT summaries (denoted as mBART-MT).
Table 3 gives the experimental results in terms of ROUGE-1/2/L (R1/R2/R-L) (Lin, 2004) and BERTScore (B-S) (Zhang et al., 2020).Note that there are two ground-truth summaries (HT and MT) in the test sets.Thus, for model performance on each dataset, we report two results using HT and MT summaries as references to evaluate CLS models, respectively.It is apparent to find that when using MT summaries as references, mBART-MT performs better than mBART-HT, but when using HT summaries as references, mBART-MT works worse.This is because the model would perform better when the distribution of the training data and the test data are more consistent.Though straightforward, this finding indicates that if a CLS model achieves higher automatic scores on the test set whose summaries are MT text, it does not mean that the model could perform better in real applications where the translationese should be avoided.
To confirm the above point, we further conduct human evaluation on the output summaries of mBART-HT and mBART-MT.Specifically, we randomly select 100 samples from the test set of XSAMSum, and employ five graduate students as evaluators to score the generated summaries of mBART-HT and mBART-MT, and the ground-truth HT summaries in terms of informativeness, fluency and overall quality with a 3-point scale.During scoring, the evaluators are not provided with the source of each summary.More details about human evaluation are given in Appendix C. Table 4 shows the result of human evaluation.The Fleiss' Kappa scores (Fleiss, 1971) of informativeness, fluency and overall are 0.46, 0.37 and 0.52, respectively, indicating a good inter-agreement among our evaluators.mBART-HT outperforms mBART-MT in all metrics, and thus the human judgment is in line with the automatic metrics when adopting HT summaries (rather than MT summaries) as references.Based on this finding, we argue that when building CLS datasets, the translationese in target summaries of test sets should be carefully controlled.

Translationese's Impact on Training
Compared with HT summaries, when using MT summaries as references to train a CLS model, it is easier for the model to learn the mapping from the source documents to the simpler and more standardized summaries.In this manner, the generated sum- maries tend to have a good lexical overlap with the MT references since both the translationese texts contain normalized lexical usages.However, such summaries may not satisfy people in the real scene (c.f., our human evaluation in Table 4).Thus, the translationese in target summaries during training has a negative impact on CLS model performance.Furthermore, we find that mBART-HT has the following inconsistent phenomenon: the generated summaries of mBART-HT achieve a higher similarity with HT references than MT references on the WikiLingua (En⇒Ru, Ar and Cs) datasets (e.g., 24.6 vs. 23.9,23.6 vs. 23.0 and 16.5 vs. 15.4R1, respectively), but are more similar to MT references on XSAMSum (e.g., 40.2 vs. 39.1 R1).We conjecture this inconsistent performance is caused by the trade-off between the following factors: (i) mBART-HT is trained with the HT references rather than the MT references, and (ii) both the generated summaries and MT references are translationese texts containing normalized lexical usages.Factor (i) tends to steer the generated summaries closer to the HT references, while factor (ii) makes them closer to the MT references.When the CLS model has fully learned the mapping from the source documents to the HT summaries during training, factor (i) will dominate the generated sum-

Translationese Analysis
We analyze the translationese in the preprocessed documents.Table 5 shows that most scores of HT documents are higher than those of MT documents, indicating a lower degree of translationese in HT documents.Thus, the different methods to collect the source documents might also result in different degrees of translationese.

Translationese's Impact on Evaluation
For each direction in the WikiLingua dataset, we train two mBART models with the same output summaries but different input documents.In detail, one uses HT documents as inputs (denoted as mBART-iHT), while the other uses MT documents (denoted as mBART-iMT).Table 6 lists the experimental results in terms of ROUGE-1/2/L (R-1/2/L) and BERTScore (B-S).Note that there are two types of input documents (HT and MT) in the test sets.For each model, we report two results using HT and MT documents as inputs to generate summaries, respectively.Compared with using HT documents as inputs, both mBART-iHT and mBART-iMT achieve higher automatic scores when using MT documents as inputs.For example, mBART-iHT achieves 32.7 and 33.7 R1, using HT documents and MT documents as inputs in WikiLingua (Ru⇒En), respectively.The counterparts of mBART-iMT are 32.4 and 34.8 R1.In addition to the above automatic evaluation, we conduct human evaluation on these four types (mBART-iHT/iMT with HT/MT documents as inputs) of the generated summaries.In detail, we randomly select 100 samples from the test set of WikiLingua (Ar⇒En).Five graduate students are asked as evaluators to assess the generated summaries in a similar way to Section 3.3.For evaluators, the parallel documents in their mother tongue are also displayed to facilitate evaluation.As shown in Table 7, though using MT documents leads to better results in terms of automatic metrics, human evaluators prefer the summaries generated using HT documents as inputs.Thus, automatic metrics like ROUGE and BERTScore cannot capture human preferences if input documents are machine translated.Besides, translationese in source documents should also be controlled in the test sets.

Translationese's Impact on Training
When using HT documents as inputs, mBART-iHT outperforms mBART-iMT in both automatic and human evaluation (Table 6 and Table 7).Thus, the translationese in source documents during training has a negative impact on CLS model performance.However, different from the translationese in summaries, the translationese in documents do not affect the training objectives.Consequently, we wonder if it is possible to train a CLS model with both MT and HT documents and further improve the model performance.In this manner, MT documents are also utilized to build CLS models, benefiting the research on low-resource languages.We attempt the following strategies: (1) mBART-CL heuristically adopts a curriculum learning (Bengio et al., 2009) strategy to train a mBART model from ⟨MT document, summary⟩ samples to ⟨HT document, summary⟩ samples in each training epoch.
(2) mBART-TT adopts the tagged training strategy (Caswell et al., 2019;Marie et al., 2020) to train a mBART model.The strategy has been studied in machine translation to improve the MT performance on low-resource source languages.In detail, the source inputs with high-level translationese (i.e., MT documents in our scenario) are prepended with a special token [TT].For other inputs with low-level translationese (i.e., HT documents), they remain unchanged.Therefore, the special token explicitly tells the model these two types of inputs.
As shown in Table 6, both mBART-CL and mBART-TT outperform mBART-iHT in all three di- rections (according to the conclusion of our human evaluation, we only use HT documents as inputs to evaluate mBART-CL and mBART-TT).Besides, mBART-TT outperforms mBART-CL, confirming the superiority of tagged training in CLS.To give a deeper analysis of the usefulness of MT documents, we use a part of (10%, 30%, 50% and 70%, respectively) HT documents (paired with summaries) and all MT documents (paired with summaries) to jointly train mBART-TT model.Besides, we use the same part of HT documents to train mBART-iHT model for comparisons.Table 8 gives the experimental results.With the help of MT documents, mBART-TT only uses 50% of HT documents to achieve competitive results with mBART-iHT.Note that compared with HT documents, MT documents are much easier to obtain, thus the strategy is friendly to low-resource source languages.

Discussion and Suggestions
Based on the above investigation and findings, we conclude this work by presenting concrete suggestions to both the dataset and model developments.
Controlling translationese in test sets.As we discussed in Section 3.3 and Section 4.3, the translationese in source documents or target summaries would lead to the inconsistency between automatic evaluation and human judgment.In addition, one should avoid directly adopting machinetranslated documents or summaries in the test sets of CLS datasets.To make the machine-translated documents or summaries suitable for evaluating model performance, some post-processing strategies should be conducted to reduce translationese.
Prior work (Zhu et al., 2019) adopts post-editing strategy to manually correct the machine-translated summaries in their test sets.Though post-editing increases productivity and decreases errors compared to translation from scratch (Green et al., 2013), Toral (2019) finds that post-editing machine translation also has special stylistic which is different from native usage, i.e., post-editese.Thus, the post-

Related Work
Cross-Lingual Summarization.Cross-lingual summarization (CLS) aims to summarize sourcelanguage documents into a different target language.Due to data scarcity, early work typically focuses on pipeline methods (Leuski et al., 2003;Wan et al., 2010;Wan, 2011;Yao et al., 2015), i.e., translation and then summarization or summarization and then translation.Recently, many large-scale CLS datasets are proposed one after another.According to an extensive survey on CLS (Wang et al., 2022b), they can be divided into synthetic datasets and multi-lingual website datasets.Synthetic datasets (Zhu et al., 2019;Bai et al., 2021;Feng et al., 2022;Wang et al., 2022a) are constructed by translating monolingual summarization (MS) datasets.Multi-lingual website datasets (Ladhak et al., 2020;Perez-Beltrachini and Lapata, 2021) are collected from online resources.Based on these large-scale datasets, many researchers explore various ways to build CLS systems, including multi-task learning strategies (Cao et al., 2020;Liang et al., 2022), knowledge distillation methods (Nguyen and Luu, 2022;Liang et al., 2023a), resource-enhanced frameworks (Zhu et al., 2020) and pre-training techniques (Xu et al., 2020;Wang et al., 2022aWang et al., , 2023b;;Liang et al., 2023b).More recently, Wang et al. (2023a) explore zeroshot CLS by prompting large language models.Different from them, we are the first to investigate the influence of translationese on CLS.
Translationese.Translated texts are known to have special features which refer to "translationese" (Gellerstam, 1986).The phenomenon of translationese has been widely studied in machine translation (MT).Some researchers explore the influence of translationese on MT evaluation (Lembersky et al., 2012;Zhang and Toral, 2019;Graham et al., 2020;Edunov et al., 2020).To control the effect of translationese on MT models, tagged training (Caswell et al., 2019;Marie et al., 2020) is proposed to explicitly tell MT models if the given data is translated texts.Besides, Artetxe et al. (2020) and Yu et al. (2022) mitigate the effect of translationese in cross-lingual transfer learning.

Conclusion
In this paper, we investigate the influence of translationese on CLS.We design systematic experiments to investigate how translationese affects CLS model evaluation and performance when translationese appears in source documents or target summaries.Based on our findings, we also give suggestions for future dataset and model developments.A The Inconsistent Performance of mBART-HT According to our conjecture, XSAMSum (En⇒Zh) should be more difficult than WikiLingua (En⇒Ru/Ar/Cs) for CLS models to perform.Consequently, factor (i) dominates in WikiLingua, while factor (ii) dominates in XSAMSum, leading to the inconsistent performance.To convince that, we illustrate the difficulty of each CLS dataset from the following aspects: (1) Scale calculates the number of CLS samples in each dataset.Generally, the more samples used to train a CLS model, the easier it is for the model to learn CLS.(2) Coverage measures the overlap rate between documents and summaries, which is defined as the average proportion of the copied bigram in summaries for each dataset. 8The higher coverage of a dataset, the less , respectively.Thus, the source documents and target summaries are natural text and

Table 8 :
Experimental results (R1 / R2 / R-L / B-S)."HT" and "MT" indicate the percentages of ⟨HT document, summary⟩ and ⟨MT document, summary⟩ pairs used to train each CLS model, respectively.The bold and underline denote the best and the second scores, respectively.

Table 9 :
The scales and coverage of CLS data.

Table 10 :
Results of mBART-HT (R1 / R2 / R-L) on the hard and simple test subsets of WikiLingua (En⇒Cs).4154-4164, Brussels, Belgium.Association for Computational Linguistics.Junnan Zhu, Qian Wang, Yining Wang, Yu Zhou, Jiajun Zhang, Shaonan Wang, and Chengqing Zong.2019.NCLS: Neural cross-lingual summarization.In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3054-3064, Hong Kong, China.Association for Computational Linguistics.Junnan Zhu, Yu Zhou, Jiajun Zhang, and Chengqing Zong.2020.Attend, translate and summarize: An efficient method for neural cross-lingual summarization.In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1309-1321, Online.Association for Computational Linguistics.