CrossSum: Beyond English-Centric Cross-Lingual Summarization for 1,500+ Language Pairs

We present CrossSum, a large-scale cross-lingual summarization dataset comprising 1.68 million article-summary samples in 1,500+ language pairs. We create CrossSum by aligning parallel articles written in different languages via cross-lingual retrieval from a multilingual abstractive summarization dataset and perform a controlled human evaluation to validate its quality. We propose a multistage data sampling algorithm to effectively train a cross-lingual summarization model capable of summarizing an article in any target language. We also introduce LaSE, an embedding-based metric for automatically evaluating model-generated summaries. LaSE is strongly correlated with ROUGE and, unlike ROUGE, can be reliably measured even in the absence of references in the target language. Performance on ROUGE and LaSE indicate that our proposed model consistently outperforms baseline models. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and the first ever that is not centered around English. We are releasing the dataset, training and evaluation scripts, and models to spur future research on cross-lingual summarization. The resources can be found at https://github.com/csebuetnlp/CrossSum

another, degrading the overall performance.

040
Input Article: [...] (Dexamethasone was tested as part of a global clinical trial to test the effectiveness of various existing therapies against the new coronavirus.) [...] 3 (As a result, the case fatality rate of critically ill patients who require a ventilator is reduced by 30%.) [...] (British Prime Minister Boris Johnson welcomed "the great achievements of the British scientific community".) [...] ("And this is a medicine available all over the world".) [...] (but a very cheap steroid that has been used for a long time.) Summary: িবজ্ঞানীরা বলেছন েড�ােমথােসান নােম স�া ও সহজলভয্ একিট ওষু ধ কেরানাভাইরােস গুরুতর অসু � েরাগীেদর জীবন রক্ষা করেত সাহাযয্ করেব। (Scientists say a cheap and readily available drug called dexamethasone will help save the lives of critically ill patients with coronavirus.) Figure 1: A sample article-summary pair from Cross-Sum, the article is written in Japanese, and the summary is in Bengali. We additionally translate the texts to English for better understanding. Words and phrases of the article relevant to the summary are color-coded.  Figure 2: Training on the dataset respecting the original XL-Sum splits causes unusually high ROUGE scores (marked red) in many-to-one models due to implicit data leakage. Therefore, we redid the splits taking the issue into account, and consequently, models trained on the new set (marked blue) do not exhibit any unusual spike.
of another, the leakage is not observed anymore l 1 and l 2 are from the previously selected languages 267 that are not English. 268 We hired bilingually proficient expert annotators 269 adept in the language of interest and English. Two  Table 2 in the Appendix.

302
At the same time, many would not be sampled dur-

303
ing training for lack of enough training steps (due 304 also done to reduce annotation costs.

970
The pipeline first performs in-language summariza-971 tion (the language of the summary is the same as 972 that of its input article) and then translates the sum-  Figure 7: Training on the dataset respecting the original XL-Sum splits causes absurdly high ROUGE scores (marked red) in many-to-one models due to implicit data leakage. Therefore, we split taking the issue into account and consequently, models trained on the new set (marked blue) do not exhibit any unusual spike in ROUGE-2.          Figure 9: ROUGE-2 and LaSE scores for Hindi, Arabic, and Russian as source pivots as the target languages vary. Just like Figure 5, the m2m model significantly outperforms the o2m models and s. + t. baseline on most languages.  Figure 10: Zero-shot ROUGE-2 scores for the different target languages as the source languages vary. The zero-shot models are trained with only the in-language samples of the pivot. Though their results are clearly behind the fully supervised models, the zero-shot models are able to generate non-trivial summaries for many language pairs. n g l i s  Figure 11: Zero-shot LaSE scores for the different source languages as the target languages vary. The zero-shot models are trained with only the in-language samples of the pivot. Though their results are clearly behind the fully supervised models, the zero-shot models are able to generate non-trivial summaries for many language pairs.