LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-licensed dataset created with the goal of enabling further research in automatic summarization for less-resourced languages. LR-Sum contains human-written summaries for 40 languages, many of which are less-resourced. We describe our process for extracting and filtering the dataset from the Multilingual Open Text corpus (Palen-Michel et al., 2022). The source data is public domain newswire collected from from Voice of America websites, and LR-Sum is released under a Creative Commons license (CC BY 4.0), making it one of the most openly-licensed multilingual summarization datasets. We describe how we plan to use the data for modeling experiments and discuss limitations of the dataset.


Introduction
Datasets for automatic summarization have historically focused largely on English, and while there has recently been a greater focus on datasets that include other languages (Cao et al., 2020;Giannakopoulos et al., 2015Giannakopoulos et al., , 2017;;Hasan et al., 2021;Scialom et al., 2020), there still remains a need for high-quality summarization data for lessresourced languages.Datasets with human-written summaries are important for both training statistical summarization models and for automatic evaluation of them.While recently there have been a growing number of multilingual summarization datasets, many are relatively small, have limited language coverage, have restrictive licenses, or a combination of these drawbacks.
In this paper, we present LR-Sum, a new 40language summarization dataset with a focus on less-resourced languages. 1We created it with the 1 There is no definitive definition for less-resourced (Liu et al., 2022) and we take the view that less-resourced can depend on the intersection of many factors (Lignos et al., 2022), including what task a dataset is created for.
goal of providing high-quality, human-written summaries in as many languages as possible.The collection of curated and filtered summaries that comprise LR-Sum are licensed using a Creative Commons Attribution license (CC BY 4.0), and the articles that it was collected from are in the public domain.This allows LR-Sum to be distributed freely and annotated without restriction, unlike many summarization datasets which use copyrighted material, often redistributed without appropriate licensing.For many of the languages in LR-Sum, this is the largest collection of summarization data with such a permissive license.
Tables 1 and 2 show example article-summary pairs from LR-Sum and highlight how similar content in the summary is not merely simple extraction from the text.Results of experiments described in Section 4 show that for many less-resourced languages, the task of producing summaries remains challenging, enabling LR-Sum to serve as a benchmark of progress.LR-Sum is released via GitHub at https://github.com/bltlab/lr-sum.

Summary:
First-ever aerial census will be conducted simultaneously across five states to determine elephant migration patterns and numbers Article: Five southern African countries, with more than half the continent's elephants, are conducting a first-ever aerial census to determine the elephant population and how to protect it.Light aircraft will fly simultaneously across the plains of Angola, Botswana, Namibia, Zambia and Zimbabwe -in a conservation area known as the Kavango-Zambezi Trans-frontier Conservation Area (KAZA) -in an exercise that will run until October 20.[...] We hope to see what the results come up with," Ives said."What we will be interested in seeing is not only how many elephants there are but the distribution, therefore, and what the likelihood of those elephants moving between countries is.

Related Work
In this section, we briefly list existing English and multilingual summarization datasets and discuss work in dataset creation for less-resourced languages more generally.
The NYT Annotated Corpus (Sandhaus, 2008) is a corpus of New York Times articles and 600k summaries written by library scientists.
CNN/Daily Mail (Hermann et al., 2015) was originally created for question answering, but Nallapati et al. (2016) adapt this dataset for summarization.
XSum (Narayan et al., 2018a) uses the first sentence "story body introduction" tag of a BBC article as the summary and the remainder of the text as the article and show that XSum favors abstractive summaries.

Multilingual Summarization Datasets
MLSUM (Scialom et al., 2020) is an extension of the CNN/Daily Mail dataset for five languages: French, German, Spanish, Turkish, and Russian.
MultiLing (Giannakopoulos et al., 2015(Giannakopoulos et al., , 2017) is a shared task that focuses on multilingual summarization covering upwards of 40 languages, but the dataset size is somewhat limited, with training sets of only around 10,000 articles in total.
XL-Sum (Hasan et al., 2021) includes 44 languages, many of which are less-resourced languages, by scraping BBC News and making use of bullet points as summaries.XL-Sum has a more restrictive license than LR-Sum.
MassiveSumm (Varab and Schluter, 2021) is a very large web-scraped summarization corpus that covers the majority of languages covered both in our dataset, LR-Sum, and also XL-Sum, and it does so in larger quantities.However, MassiveSumm cannot be easily redistributed due to copyright and being scraped from various news sites.Mas-siveSumm's GitHub README contains the disclaimer "The data is noisy and recall-oriented." 3ultiSumm (Cao et al., 2020) creates summaries from titles for Bosnian and Croatian.

Data for Less-Resourced Languages
A number of other text corpora have been created for less-resourced languages for summarization and other tasks.Abdulrahman et al. (2019)  Multilingual Open Text (MOT) (Palen-Michel et al., 2022) is a corpus collected from the websites of Voice of America, an international news service funded by the U.S. Government providing news articles and other short snippets like audio and image descriptions.Our work creates a summarization dataset for the majority of the lan-guages within MOT.MOT has a permissive license (CC BY 4.0), and the original source articles are in the public domain.By comparison, many of the multilingual datasets derived from privately funded news sources like CNN or BBC News were collected from copyrighted data without the copyright owner's permission, limiting legal distribution.XL-Sum's license is CC BY-NC-SA 4.0, which restricts commercial usage.MOT contains news text data for many less-resourced languages, some of which overlap with XL-Sum and some of which are complementary.We discuss which languages are present in LR-Sum vs XL-Sum in more detail in Section 3.2.

Methodology
The approach for creating LR-Sum is to leverage the coverage of less-resourced languages in MOT to construct a summarization dataset.MOT (Palen-Michel et al., 2022) semi-regularly releases new versions of the dataset as new articles are published on Voice of America's website.We use MOT release v1.6 from October 1, 2022 for the creation of LR-Sum.Only the content type of "article" is included in LR-Sum since the categories of photo, audio, etc. already tend to be short snippets describing content, which typically are too short to make useful article-summary pairs.
While bold text or bullet points are used in some other summarization datasets (Hasan et al., 2021;Hermann et al., 2015;Narayan et al., 2018a), these ways of extracting summaries are not available in VOA articles.Instead a description field is present for VOA articles.This description field in VOA new articles can be noisy.While it is generally used to give a brief summary of the article contents, there are numerous instances where the description contains the first few lines of the article, information about the authors, or general information about what VOA is.
A number of filtering steps are taken to ensure high-quality summaries.First, we filter to ensure that the description field has content and that the content of the description field is at least 10 tokens long. 4Then, we filter out any articles that do not have a minimum of 10 sentences.We also filter by total number of tokens to remove outlier articles with fewer than 30 or more than 6,000 tokens.When an article does not have a human-written summary, the description field simply contains the first few sentences.Because ellipses can signal that the description is just a copy of the first few sentences of the article, we also filter out all descriptions that end with ellipses.We further remove these instances from the dataset by limiting token overlap of the description and the first 3 sentences to 85%.5 With the goal of keeping LR-Sum from being purely extractive, we also block descriptions where an oracle extractive approach selecting the best sentence in the article produces a ROUGE-2 score above 95.
We manually created a list of 254 sentences to remove from summaries based on strings that appear the most frequently in the description field.Examples include "Amerikan basınında haftaiçi hergün öne çıkan başlıkları Amerika'nın Sesi'nde bulubilirsiniz" ("You can find the highlights of the American press every weekday on Voice of America" in Turkish) or "Këtë javë në Uashington" ("Live from Washington" in Albanian). 6hile MOT includes data in the Lingala and Oromo languages, we do not include them in LR-Sum since fewer than 100 articles made it through our filtering process.Lingala had only 3 articles, and Oromo 29.MOT also includes data in Bambara, but it contains so few articles that none made it through the filtering process.

Dataset Description
LR-Sum includes 40 languages in total.We show various statistics of the dataset in Table 3.We measure mean length of articles and summaries in token counts.Compression is 1 -the ratio between summary length and article length as used by Bommasani and Cardie (2020) and Hasan et al. (2021).Mean novelty is the proportion of tokens in the summary that do not occur in the article.LR-Sum's measures are comparable with MLSUM (Scialom et al., 2020) and XL-Sum (Hasan et al., 2021) for languages shared between datasets.The overall mean article length for LR-Sum is 520.7 and the overall mean summary length is 36.5.For comparison, MLSUM's English section has a mean article length of 709.2 and mean summary length of 55.6, while their Turkish section has mean article length of 309.1 and mean summary length of 22.8.
LR-Sum includes fourteen languages that are not covered by XL-Sum.However, Dari Persian and Kinyarwanda are quite close to Persian Farsi and Kirundi, which are contained in XL-Sum.Seven of the remaining twelve languages have more than 1,000 article-summary pairs for training: Albanian, Bosnian, Khmer, Sorani Kurdish, Lao, Macedonian, and Northern Ndebele.Armenian, Georgian, Haitian Creole, and Shona have fewer than 1,000 training examples.Tibetan and Greek have fewer than 1,000 article-summary pairs overall, which is not enough for training and test splits.Instead, the Tibetan and Greek data could still be useful as a test set for automatic evaluation of models built for those languages or used in few-shot training.
LR-Sum includes languages which can be complementary to existing resources.For example, LR-Sum includes almost twice as many articles in Burmese as XL-Sum.For many languages (i.e.Turkish, Azerbaijani, Persian, Korean) adding LR-Sum to XL-Sum results in more than double the amount of data available in XL-Sum alone.
LR-Sum also has some unique subdivisions and special focuses for certain languages.Its English section can be subdivided into Zimbabwe and Cambodia-focused sections.Similarly, the French and Portuguese found in LR-Sum tends to be news focused on Africa.Chinese is divided into simplified and traditional varieties.Kurdish is subdivided into the Kurmanji and Sorani dialects.LR-Sum separates Farsi and Dari as separate languages based on their provenance from separate VOA sites, despite their being largely mutually intelligible.

Dataset Splits
We report the size of the dataset splits for LR-Sum in Appendix 9. Splits are 80% train, 10% validation, and 10% test, except for languages where the number of examples was quite small.To ensure enough test and validation data when possible, in cases where the total was below 4,000 examples, we took 500 for validation and test each and left the rest for training.For languages where the total number of examples was fewer than 1,000, we only created test sets and did not create training or validation data (Amharic, Bangla, Greek, Hausa, Kinyarwanda, Somali, Swahili, Tibetan, and Tigrinya).

Methodology
We conduct three experiments to demonstrate the usefulness of LR-Sum and establish baseline performance on the dataset.For all abstractive models trained, we use mT5 (Xue et al., 2021) as the base model.We report ROUGE-1 and 2 (R1, R2) and ROUGE-L (RL; Lin, 2004) scores.7,8 1.We train individual baseline models for 12 lessresourced languages that are unique to LR-Sum and not present in XL-Sum.9 2. We conduct a series of experiments with extractive models for the less-resourced languages unique to LR-Sum.3. We train a multilingual model using the concatenation of LR-  compare with using a multilingual model checkpoint trained on XL-Sum alone.For this experiment, we evaluate both models on LR-Sum's test sets and evaluate on all less-resourced languages.

Individual Models
We fine-tune 12 models for each of the lessresourced languages not present in XL-Sum.We use mT5 (Xue et al., 2021) as the base model.For these experiments, we use the same training script as Hasan et al. (2021), which is a modified version of a script from the Hugging Face Transformers Library (Wolf et al., 2020).We use the same hyperparameter settings as Hasan et al. (2021).The details of hyper-parameters can be found in Appendix A.1.

Extractive Baselines
We conducted experiments to determine whether extractive approaches to be a strong summarization baseline.To demonstrate the strongest possible extractive performance, we also report the oracle, which here is simply selecting the single sentence in the article which produces the highest ROUGE score.We additionally report results for LexRank (Erkan and Radev, 2004) and Luhn (Luhn, 1958) extractive methods.For implementations of these extractive approaches, we used sumy 10 (Belica, 2013).The sentence segmentation and tokenizations from the MOT corpus were used for the extractive approaches requiring segmentation and tokenization.

Results and Discussion
Overall, we find that abstractive models fail to beat extractive ones for some languages, while extractive models and even the lead-3 baseline remain competitive for others.The fact that the baselines and extractive approaches still outperform abstractive neural models demonstrates the potential use of this corpus for further summarization research to improve abstractive models in less-resourced settings.
Results comparing the different approaches for 12 languages are shown in Table 4.The multilingual models tend to produce higher scores, likely due to positive transfer between languages.However, the advantage is often only a few points beyond individual or extractive models.The results of combining datasets (Table 7) show how LR-Sum can be combined with existing summarization datasets like XL-Sum to improve multilingual summarization model coverage.The additional data from the concatenation of LR-Sum and XL-Sum shows an expected advantage for languages not seen by the XL-Sum-only multilingual model.

Individual Model Results
The results of training the individual models are shown in Tables 4 and 5.The scores are generally slightly lower than the multilingual model with the exception of Albanian, Lao, and Northern Ndebele.The difference in training set size does not appear to be a factor in the performance, potentially because all the training set sizes for these less-resourced languages are small compared to the usual hundreds of thousands of examples found in datasets like MLSUM (Scialom et al., 2020).A language's presence in mT5's pre-training also does not appear to be indicative of better performance.

Extractive Results
The results for extractive models can be found in Table 6.Oracle gives a sense of the upper bound that can be achieved through extractive models.The scores for the oracle are higher than both individual and multilingual abstractive models, which suggests there is plenty of room for improving performance for the abstractive baselines.
For all the languages we evaluated, LexRank had higher scores than Luhn in terms of ROUGE-1, though Luhn was slightly higher in ROUGE-2 and ROUGE-L for Haitian Creole, Bosnian and Albanian.Lead-3 proves to be a strong baseline and scores higher than the extractive models for RL and frequently for R1 and R2.In terms of R1, LexRank outperforms the individual abstractive models for Khmer, Georgian, Bosnian, Northern Ndebele, and Shona but ROUGE-L scores tend to be higher for the individual abstractive models.The multilingual model still beats the lead-3 baseline except for Northern Ndebele and Khmer as shown in Table 4.

Multilingual Model Results
Table 7 shows the results of mT5 (Xue et al., 2021) trained on the concatenation of training data from LR-Sum and XL-Sum compared with the model checkpoint of mT5 trained on XL-Sum only.As expected, languages not present in XL-Sum had much better performance with the model trained on both datasets.Dari Persian did not perform better likely due to Farsi already being represented in XL-Sum and the two languages being very similar.Scores for Greek and Tibetan were effectively zero as there is only enough data in LR-Sum for a test set and so there was no training data in those languages due to data scarcity.
The results for additional training data for languages present in both languages are more mixed.Despite both datasets being news data, it is possible there are differences in dialect, topic, or standardization that account for the differences.We discuss the performance of the two multilingual models evaluated on the XL-Sum test set in Appendix B.

Conclusions and Future Work
We have presented LR-Sum, a permissivelylicensed summarization dataset for less-resourced languages based on news data.We have demonstrated LR-Sum's usefulness in augmenting the training data of other multilingual summarization models and demonstrated potential for further research in summarization for less-resourced languages.Even with the best performing model, the results are only slightly higher than the lead-3 baseline, which indicates ample room for improvement and future research directions.
In future work, we plan to experiment with leveraging additional training data like the remaining portions of the MOT data which were not suitable for extracting summaries but may still be use- ful in fine-tuning a multilingual language model to perform better on certain less-resourced languages.LR-Sum also presents opportunities for few-and zero-shot experimentation for languages where there are not enough examples to use as training data, but where the data that does exist may be useful as a test set.We look forward to collaborating with speakers of the languages included in LR-Sum to further increase the quality and quantity of summarization data for less-resourced languages.

Limitations
A limitation of this work is that the dataset has not yet been thoroughly vetted by native speakers of the languages contained in the dataset.We acknowledge the importance of working with native speakers and manually reviewing datasets in greater detail as argued for by Kreutzer et al. (2022) and Lignos et al. (2022).We hope to do more manual review of LR-Sum and other summarization datasets in the near future.

Ethics Statement
Our work provides a dataset for further research on summarization for less-resourced languages.Automatic summarization has the potential to assist users in digesting information.It is our intention that providing a summarization dataset with coverage of less-resourced languages will benefit speakers of languages that may otherwise not have had access to this technology.However, there is also cause for caution.The results of our work used automatic evaluation metrics and generated summaries have not yet been subjected to more rigorous human review.Even just based on automated metrics, it is clear there is still room for improvement of the models as they tend to score lower than higher resourced counterparts on similar tasks.Therefore, the models presented in this work should be considered baselines for further work.The dataset and models presented in this work are meant to support further research in summarization of less-resourced languages and not intended for immediate deployment in applications.
In particular, the abstractive summarization models, like most text generation models, have the potential to make factual errors, which have the potential to mislead or misinform.Additionally, both extractive and abstractive models may lack adequate context or miss important information.As mentioned in the limitations section, this dataset, like most summarization news datasets, has not been fully manually reviewed and so may contain a few erroneous summaries despite our best efforts.

LR-& XL-
Figure 1 provides a histogram of article lengths, and Figure 2 provides a histogram of summary lengths.

Figure 1 :
Figure 1: Histogram of articles lengths in tokens.

Figure 2 :
Figure 2: Histogram of summary lengths in tokens.
might work better given the small training set sizes.Previous work by Nallapati et al. (2017); Narayan et al. (2018b); Zhang et al. (2018) and Scialom et al. (2020), among others, has shown lead-3 (the first three sentences of the article)

Following
Hasan et al. (2021)'s reported better performance with multilingual training, we train a multilingual model but instead with the concatenation of the training sets of LR-Sum and XL-Sum.In this experiment we also use the same modified Hugging Face script(Wolf et al., 2020) that Hasan et al. (2021) use for training along with the same hyper-parameters as Hasan et al. (2021) used for multilingual training.Hyper-parameter settings can be found in Appendix A.2.

Table 2 :
Example summary and article pair from English.Colors mark approximate content equivalence between summary and a portion of the article.

Table 3 :
Metrics across languages in the LR-Sum Dataset.Compression ratio is the ratio of article length to summary length.Mean novelty is the mean proportion of tokens in the summary that do not occur in the article.Vocabulary is the number of unique tokens (types).All measures are computed using tokens.

Table 4 :
Comparison of different summarization approaches.Best scores in bold excluding oracle.

Table 5 :
Results of abstractive models for less-resourced languages of LR-Sum not also in XL-Sum from finetuning mT5 on LR-Sum data.Whether the languages are present in mT5 pre-training is marked with a check.

Table 6 :
Extractive model results on less-resourced languages that are not covered in XL-Sum.Results in bold are best or R1, R2, and RL across approaches excluding oracle.

Table 7 :
Results from a multilingual model trained on both LR-Sum and XL-Sum data compared with a multilingual model trained only on XL-Sum.We additionally omit Tibetan and Greek from the results as they have only enough data for test sets.Higher-resourced languages are also omitted.