Zero-Shot Cross-Lingual Summarization via Large Language Models

Given a document in a source language, cross-lingual summarization (CLS) aims to generate a summary in a different target language. Recently, the emergence of Large Language Models (LLMs), such as GPT-3.5, ChatGPT and GPT-4, has attracted wide attention from the computational linguistics community. However, it is not yet known the performance of LLMs on CLS. In this report, we empirically use various prompts to guide LLMs to perform zero-shot CLS from different paradigms (i.e., end-to-end and pipeline), and provide a preliminary evaluation on the generated summaries. We find that ChatGPT and GPT-4 originally prefer to produce lengthy summaries with detailed information. These two LLMs can further balance informativeness and conciseness with the help of an interactive prompt, significantly improving their CLS performance. Experimental results on three widely-used CLS datasets show that GPT-4 achieves state-of-the-art zero-shot CLS performance, and performs competitively compared with the fine-tuned mBART-50. Moreover, we also find some multi-lingual and bilingual LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) have limited zero-shot CLS ability. Due to the composite nature of CLS, which requires models to perform summarization and translation simultaneously, accomplishing this task in a zero-shot manner is even a challenge for LLMs. Therefore, we sincerely hope and recommend future LLM research could use CLS as a testbed.

In recent years, a number of powerful multilingual pre-trained generative models have been proposed one after another, such as mBART (Liu et al., 2020), mBART-50 (Tang et al., 2021), mT5 (Xue et al., 2021) and BLOOM (Scao et al., 2022).The parameters in these models have gradually increased from million levels (e.g., 580M in mT5-base and 610M in mBART-Large) to billion levels (e.g., 3.7B in mT5-XL, 13B in mT5-XXL and 176B in BLOOM), facilitating various research topics (e.g., machine translation and CLS) in the multi-lingual world.Besides, large language mod-els (LLMs) have been key to strong performance when transferring to new tasks by simply conditioning on a few input-label pairs (in-context learning) (Dong et al., 2022;Min et al., 2022) or short sentences describing crucial reasoning steps (chainof-thoughts) (Fu et al., 2022;Zhang et al., 2022).
In this report, we present a preliminary evaluation of LLMs' zero-shot CLS performance, including GPT-3.5, ChatGPT, GPT-4, BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan.In detail, we design various prompts to guide LLMs to perform CLS in an end-to-end manner with or without chain-of-thoughts (CoT).Figure 1 gives an example of prompting ChatGPT to perform zero-shot CLS.To further exploit the interaction capability of conversational LLMs (e.g., ChatGPT and GPT-4), we leverage an interactive prompt to let them produce more concise summaries.Moreover, to provide a deeper analysis of LLMs' zero-shot CLS performance, we compare them with fine-tuned mBART-50 (Tang et al., 2021) which has shown its superiority in many previous CLS works (Wang et al., 2022a;Feng et al., 2022;Perez-Beltrachini and Lapata, 2021).Experimental results on three CLS datasets, covering three domains (news, howto guide and dialogue) and two cross-lingual directions (En⇒Zh and En⇒De)1 , show that GPT-4 achieves the best zero-shot performance but is still worse than the fine-tuned mBART-50 model in terms of ROUGE scores and BERTScore.We also conduct case studies to show that ChatGPT and GPT-4 could absorb the core idea of the given source-language documents and generate fluent and concise target-language summaries.
In addition, we find that the current open-source LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) achieve limited zero-shot CLS performance, which is significantly worse than that of GPT-4.This leads us to conclude that the composite end-to-end CLS prompts are difficult for them to follow, and there is still a challenge for LLMs to perform zero-shot CLS in an end-to-end manner which requires simultaneously carrying out translation and summarization.Based on the finding, we suggest that future multi-lingual or bilingual LLM research uses CLS as a testbed to evaluate LLMs' capabilities to follow composite instructions as well as combine their different abilities.
Our main contributions are concluded as follows: • To the best of our knowledge, we are the first to explore the zero-shot CLS performance of LLMs.
To achieve that, we design various prompts to guide LLMs to perform CLS in an end-to-end manner with or without chain-of-thoughts.• Experimental results on three widely-used CLS benchmark datasets, covering various domains and languages, show several LLMs (especially ChatGPT and GPT-4) achieve competitive results compared with the strong fine-tuned baseline.• We also find the current open-source LLMs generally achieve limited zero-shot CLS performance, making us think CLS could be used as a testbed for future LLM research due to its challenges.

Cross-Lingual Summarization Prompts
We heuristically design 3 prompts to guide LLMs to perform zero-shot CLS in an end-to-end manner, which is shown as follows with an example from an English document to a Chinese summary: • The direct (Dir) prompt guides LLMs straightforwardly output the corresponding target-language summary without chain-of-thoughts (CoT): Please summarize the following text in Chinese: [English Doc] where [English Doc] indicates a given English document.
• The translate-then-summarize (TS) CoT prompt  makes LLMs first translate the given document from the source language to the target language, and then summarize the translated document to perform CLS: Please first translate the following text to Chinese and then summarize the translated text in Chinese: [English Doc] • The summarize-then-translate (ST) CoT prompt lets LLMs first summarize the given document and then translate the output summary to the target language: Please first summarize the following text and then translate the summary to Chinese: Note that though the TS and ST CoT prompts guide LLMs to perform CLS step by step, the behaviors are end-to-end since the target-language summaries are generated within a single turn.
To further exploit the potentiality of conversational LLMs, inspired by Bang et al. (2023), after prompting with Dir, TS or ST prompt, we adopt an interactive (Inta) prompt to make the preliminarily generated summary more concise: Please make the Chinese summary shorter.and the whole process is denoted as "Dir+Inta", "TS+Inta" or "ST+Inta".

Large Language Models
We explore the CLS ability of the following LLMs: • Davinci-003 is the most advanced GPT-3.5 model with 175B parameters.We evaluate its performance by requesting the official API provided by OpenAI with default settings.2 • ChatGPT is created by fine-tuning a GPT-3.5 series model via reinforcement learning from human feedback (RLHF) (Christiano et al., 2017).
We conduct experiments on the ChatGPT platform3 between February 17 to February 19, 2023.
• GPT-4, as a multi-modal LLM that can accept image and text inputs and produce text outputs, exhibits human-level performance on various benchmark datasets (OpenAI, 2023).We assess GPT-4 on the ChatGPT platform between March 15 to March 19, 2023.

Experimental Setup
Datasets.We evaluate LLMs on the following three CLS datasets: CrossSum (En⇒Zh) (Hasan et al., 2021), WikiLingua (En⇒Zh/De) (Ladhak et al., 2020) and XSAMSum (En⇒Zh/De) (Wang et al., 2022a).CrossSum is collected from BBC news website, it contains 3,981 English news reports paired with Chinese summaries.WikiLingua involves 18,887 English how-to guides paired with Chinese summaries, and 58,375 English how-to guides paired with German summaries.Note that both CrossSum and WikiLingua also provide CLS samples in other cross-lingual directions, and we only utilize En⇒Zh or (and) En⇒De samples in this work.XSAMSum contains 16,369 English dialogues paired with both Chinese and German summaries.The detailed statistics of these datasets are listed in Table 1.Since ChatGPT, GPT-4 and Vicuna-13B can only be interacted with manually when we conduct experiments, evaluating their performance is time-consuming.Thus, we randomly sample 50 documents from the test set of each CLS dataset for evaluation.
Metrics.We adopt ROUGE-1/2/L (R-1/2/L) (Lin, 2004) and BERTScore (B-S) (Zhang et al., 2020) in our experiments.The ROUGE scores measure the lexical overlap between the generated summaries and corresponding references based on the unigram, bigram and longest common subsequence, while the BERTScore measures the semantic similarity.For ROUGE scores, we use multi-lingual rouge8 toolkit.For BERTScore, we use bert-score9 toolkit, and the score is calculated based on bertbase-multilingual-cased10 model.
Baselines.We also compare zero-shot LLMs with fine-tuned mBART-50 (Tang et al., 2021) to provide a deeper analysis.mBART-50 is a multi-lingual version of BART (Lewis et al., 2020) with the vanilla transformer encoder-decoder architecture (Vaswani et al., 2017).This model has been pre-trained on large-scale multi-lingual unlabeled corpora with BART-like denoising objectives.

Implementation Details
For ChatGPT, GPT-4 and Vicuna-13B, we manually evaluate their results via the corresponding platform and demo websites.Among them, the demo website of Vicuna-13B cannot support the long input sequences, and it will automatically truncate the long sequences, thus we only evaluate Vicuna-13B on XSAMSum (En⇒Zh/De).For Davinci-003, we use the official API with default settings.Green indicates the zero-shot result is better than that of the fine-tuned baseline."-" denotes the model cannot be evaluated in the corresponding dataset.

Model
For BLOOMZ, ChatGLM-6B and ChatYuan, we download the corresponding checkpoints and evaluate their performances following the officially released codes.The 176B BLOOMZ makes use of 5*80G GPUs to load with FP16 precision.We use a sampling decoding strategy and set the temperature to 0.7.We only evaluate ChatGLM-6B and ChatYuan on En⇒Zh cross-lingual direction due to their bilingualism (i.e., Chinese and English).

Main Results
Table 2 lists the experimental results.As we can see, Davinci-003, ChatGPT and GPT-4 achieve competitive results with the fine-tuned mBART-50.
The Effect of Each CLS Prompt.Among three end-to-end prompts (i.e., Dir, ST and TS), the CoT prompts lead to better performance than the direct prompt, indicating the effectiveness of CoT.It also indicates that it is still challenging for a single model to directly perform CLS without giving any crucial or helpful instructions.
The Effect of Interactive Prompt.Further, with the help of the interactive prompt, the performance of ChatGPT and GPT-4 significantly improve and even outperform mBART-50 in several automatic metrics.As shown in Table 3, more concise summaries can be generated after inputting the interactive prompt, e.g., 183.7 tokens generated by Chat-GPT (Dir) on CrossSum, while the counterpart of ChatGPT (Dir+Inta) is 66.4 tokens.Figure 3 also shows an example English document with the summaries generated by ChatGPT and GPT-4 via different prompts.Therefore, the conversational LLMs prefer to generate lengthy summaries probably due to the RLHF training process, and the interactive prompt further helps them balance informativeness and conciseness, and significantly improves their zero-shot CLS ability.
Best Zero-Shot LLM vs. Fine-Tuned mBART.GPT-4 achieves state-of-the-art zero-shot CLS performance among all LLMs, justifying its superiority.But the model is still slightly worse than the fine-tuned mBART-50 in terms of automatic evaluation metrics.One possible reason is that zeroshot LLMs are not aware of the text style of the golden summaries when performing zero-shot CLS on each dataset.However, lower automatic scores do not indicate worse performance.For example, as discussed by Goyal et al. (2022), the news summaries generated by GPT-3 achieve lower ROUGE scores than fine-tuned models but higher in human evaluation.Thus, the comparison between LLMs and fine-tuned mBART-50 in CLS needs human evaluation, which we reserve for the future.

Limited Performance of Open-Source LLMs.
For open-source LLMs, i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan-7.7B, they perform poorly on CrossSum and WikiLingua datasets whose documents typically contain more lengthy content than those of XSAMSum.Although they perform decently on XSAMSum, there is still a large gap compared to GPT-4.Thus, we conclude that zero-shot CLS is challenging for LLMs to perform due to its composite nature that requires models to perform summarization and translation simultaneously.In this situation, we suggest future bilingual or multi-lingual LLM research adopt CLS as a testbed to evaluate the LLMs' capabilities to follow composite instructions as well as combine their different ability.

LLM-based Evaluation
It is worth noting that conducting human evaluation on the generated summaries of both LLMs and finetuned models is not trivial since human evaluators can easily realize which summaries are generated by LLMs or fine-tuned models.In this manner, the evaluators may have biases during scoring each summary.To ensure the fairness of human judgment, Stiennon et al. (2020) only retain the generated summaries whose length belongs to a certain range, and then collect human judgment on these summaries to minimize the potential evaluation bias caused by summary length.In our scene, the text styles of LLMs and fine-tuned models are quite different, which might also lead to bias.Thus, the human evaluation of comparing zero-shot LLMs and fine-tuned models on CLS needs more carefully designed.
As an alternative to human evaluation, recent studies (Liu et al., 2023;Kocmi and Federmann, 2023;Wang et al., 2023) show that the natural language generation (NLG) results evaluated by LLMs could achieve better correlations with humans.Following Wang et al. (2023), we utilize ChatGPT to score the generated summaries in a reference-free manner on four aspects, i.e., coherence, relevance, consistency and fluency.An example prompt is shown in Figure 4, and please refer to Wang et al. (2023) for prompts of all aspects.The instruction of each aspect (marked in purple in Figure 4) is inspired by SummEval (a widelyused summarization meta-evaluation benchmark dataset) (Fabbri et al., 2021).We utilize the official Score the following dialogue summarization given the corresponding dialogue with respect to fluency on a continuous scale from 0 to 100, where a score of zero means "disfluency" and score of one hundred means "perfect fluency".Note that fluency measures the quality of individual sentences, are they well-written and grammatically correct.Consider the quality of individual sentences.
Dialogue: [a given dialogue] Summary: [one generated summary] Scores: APIs provided by OpenAI12 to conduct the experiments with gpt-3.5-turbomodel, and set the temperature to 0 to eliminate the randomness of evaluation results.
Moreover, while we show the interactive prompt can improve the performance of zero-shot LLM in terms of ROUGE scores and BERTScore, we do not find the same trend in the LLM-based evaluation results.In some cases, the interactive prompt even leads to worse LLM-based scores.We conjecture that the interactive prompt would force zeroshot LLMs to reduce the length of the generated summaries, and the models cannot make a good trade-off between conciseness and other aspects, which is also hard for humans.

Conclusion and Future Work
In this technical report, we evaluate the zero-shot performance of mainstream bilingual and multilingual LLMs on cross-lingual summarization.We find that Davinci-003, ChatGPT and GPT-4 can combine the ability to summarize and translate to perform zero-shot CLS, and achieve competitive results with the fine-tuned baseline (i.e., mBART-50).In addition, the current open-source LLMs (i.e., BLOOMZ, ChatGLM-6B, Vicuna-13B and ChatYuan) generally show their limited ability to perform CLS in an end-to-end manner, showing the challenge of performing zero-shot CLS still exists.
In the future, we would like to unleash the potentiality of LLMs and leverage LLMs to perform CLS in few-shot learning manners.

Limitations
While we evaluate the performance of LLMs on the cross-lingual summarization task, there are some limitations worth noting: (1) We only evaluate the lower threshold of these models' CLS performance.Prompts are important to guide LLMs to perform specific tasks, and future work could explore better prompts to obtain better results.(2) This report only uses two cross-lingual directions (En⇒Zh and En⇒De) in experiments, and all the languages are considered high-resource languages in the world.The performance of LLMs on low-resource languages still needs to be explored.According to Jiao et al. (2023), the machine translation ability of ChatGPT is limited on low-resource languages.We conjecture that the same situation might exist in CLS.(3) Though the general trend of the evaluation results should be correct, the comparisons between LLMs are not rigorous due to the decoding strategies of these models are not the same. 13his is one of the major reasons leading to the limited soundness of this work.(4) In the future, we would like to conduct human evaluation to give more analyses.

Figure 2 :
Figure 2: An illustration of all prompts used to guide LLMs to perform zero CLS (take ChatGPT as an example).

Figure 3 :
Figure 3: Example dialogue document in XSAMSum and summaries generated by ChatGPT and GPT-4.

Figure 4 :
Figure 4: An example prompt used to guide ChatGPT to score the summarization results in the aspect of fluency.Purple indicates the detailed instruction of the corresponding aspect.
Please summarize the following text in Chinese: Annie: Are you going to be at school?Christine: Not tomorrow.I am not well.Annie: Oh noes!What happened?Christine: Got the flu, I think.Annie: what's your temperature?Christine: Not high, I'm not running a fever or anything Annie: Are you eating ok? Christine: Yeah.Just blocked nose, sore throat.Tired.Annie: Sounds like you've got a cold.You need anything?Christine: I could do with some Theraflu.Annie: OK, I think we've still got some sachets, should be in date.I'll drop them through your letterbox later on.Christine: Yeah.Don't call in because I'll feel bad if you catch this cold off me.Annie: I think I probably had it already, but you might be sleeping.Christine: If the light in my room is on, call if you want.Please first translate the following text to Chinese and then summarize the translated text in Chinese: Annie: Are you going to be at school?Christine: Not tomorrow.I am not well.Annie: Oh noes!What happened?Christine: Got the flu, I think.Annie: what's your temperature?Christine: Not high, I'm not running a fever or anything Annie: Are you eating ok? Christine: Yeah.Just blocked nose, sore throat.Tired.Annie: Sounds like you've got a cold.You need anything?Christine: I could do with some Theraflu.Annie: OK, I think we've still got some sachets, should be in date.I'll drop them through your letterbox later on.Christine: Yeah.Don't call in because I'll feel bad if you catch this cold off me.Annie: I think I probably had it already, but you might be sleeping.Christine: If the light in my room is on, call if you want.

Table 1 :
Statistics of CLS datasets used in experiments."Src Lang." and "Trg Lang" denote the source and the target languages."Doc.Length" and "Sum.Length" show the average length of source documents and target summaries (token level)."Example" lists the number of samples in each dataset w.r.t training, validation and test sets.

Table 2 :
Experimental results on CrossSum, WikiLingua and XSAMSum.Pink denotes the fine-tuned baseline.Light blue and blue denote the zero-shot performance of open-source and non-open-source LLMs, respectively.

Table 3 :
The average length (token level) of the generated summaries on the test set of each CLS dataset.Light green indicates the length of golden summaries.

Table 4 :
Evaluation results (judged by ChatGPT) on XSAMSum.Pink denotes the fine-tuned baseline.Light blue and blue denote the zero-shot performance of open-source and non-open-source LLMs, respectively."-" denotes the model cannot be evaluated in the corresponding dataset.