IndicNLG Benchmark: Multilingual Datasets for Diverse NLG Tasks in Indic Languages

Natural Language Generation (NLG) for non-English languages is hampered by the scarcity of datasets in these languages. We present the IndicNLG Benchmark, a collection of datasets for benchmarking NLG for 11 Indic languages. We focus on five diverse tasks, namely, biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, paraphrase generation and, question generation. We describe the created datasets and use them to benchmark the performance of several monolingual and multilingual baselines that leverage pre-trained sequence-to-sequence models. Our results exhibit the strong performance of multilingual language-specific pre-trained models, and the utility of models trained on our dataset for other related NLG tasks. Our dataset creation methods can be easily applied to modest-resource languages as they involve simple steps such as scraping news articles and Wikipedia infoboxes, light cleaning, and pivoting through machine translation data. To the best of our knowledge, the IndicNLG Benchmark is the first NLG benchmark for Indic languages and the most diverse multilingual NLG dataset, with approximately 8M examples across 5 tasks and 11 languages. The datasets and models will be publicly available.


Introduction
NLG is the process of generating textual output (Gatt and Krahmer, 2018).Initial work on NLG focused on tabular input (Reiter and Dale, 1997) but in a general setting the input can also belong to one or more modalities such as text, images, videos, audio, etc. NLG progress had been hindered by data scarcity for different tasks and languages but, recently, the increasing availability of large scale datasets (Narayan et al. 2018;Wiseman et al. 2017;Lebret et al. 2016), along with the advancements in neural networks pretrained on large amounts of text (Lewis et al., 2020a;Raffel et al., 2020) have led to substantial progress in NLG.
Most of the aforementioned progress is for European languages and especially for English, mainly because it is the lingua franca, making it easy to obtain data for it (Bender, 2019).However, English is not the native language for a vast majority of the world's population, which tend to use their country's or region's native languages on a daily basis.India, with its population of 1.4 billion people2 (18% of the world population) is a quintessential example where only 10% of the population speaks English whereas a significant portion of the remaining 90% speaks one or more of the 22 'scheduled' Indian languages listed in the Constitution of India3 .It is not surprising that most people in India tend to consume literature and media in Indian languages rather than English.Therefore, we believe that it is important to focus on Indic NLG, which lacks datasets for diverse NLG tasks.
Given that there is no existing or widely used datasets for diverse Indic NLG tasks4 , this paper aims to fill this gap, via the IndicNLG Benchmark Table 1: A summary of the 5 tasks and 11 languages (L) covered by IndicNLG Benchmark, where L={as, bn, gu, hi, kn, ml, mr, or, pa, ta, te}.The communicative intent, inputs and total corpora sizes are given.
where we create new datasets for 11 Indic languages.The 11 languages belong to two language families: Indo-Aryan and Dravidian.Dravidian languages are agglutinative, while Indo-Aryan languages are mostly not.They differ in many other aspects like gender agreement, core vocabularies, etc.The word order is the same (SOV).While the two families have their distinct core vocabularies, there exist shared vocab between these families on account of borrowings.Many other similarities are also seen due to convergence of properties over time, hence the two language families are part of the Indian subcontinent linguistic area (Emeneau, 1956).The IndicNLG Benchmark spans five NLG tasks such as biography generation using Wikipedia infoboxes, news headline generation, sentence summarization, question generation and paraphrase generation.We also train a variety of models focusing on pre-training and multilingualism to establish strong baselines for the benchmark.Our main contributions are: 1.We create the IndicNLG Benchmark, a collection of NLG datasets for five diverse NLG tasks spanning 11 languages from the two major language families (Indo-Aryan and Dravidian) in India.Table 1 summarizes key characteristics of the benchmark.2. It is the largest and linguistically most diverse multilingual NLG dataset, comprising a total of 8.5M examples across 11 languages and 5 tasks (∼55K to ∼5.57M examples for a tasklanguage pair), opening up possibilities for multilingual NLG research.3. We provide strong baselines for all tasks and languages by leveraging multilingual pre-trained models for multilingual fine-tuning, which show clear evidence of the advantage of language group specific pre-trained models compared to language agnostic ones.4. We also show that the utility of our models built using mined datasets to improve performance on related NLG tasks via transfer learning.
The rest of the paper is organized as follows: Section 2 describes related work.Section 3 describes the IndicNLG Benchmark where the tasks, datasets and their creation are explained along with some important quantitative and qualitative statistics.This is followed by Section 5 where we describe the experimental settings for benchmarking the performance of our NLG models for various tasks.Section 6 contains results and analyses.We end the paper with general and task specific summaries in Section 7 and outline several future directions we plan to pursue.The appendices (Sections A,B.1,B.2,B.3,B.4 and B.5) contains additional results and analyses for interested readers.

Related Work
This paper focuses on data creation, modeling and benchmarking for NLG using pre-trained models and multilingualism.

NLG Benchmarks
Gehrmann et al. (2021) create a benchmark, GEM, for NLG tasks such as extreme summarization (Hasan et al., 2021;Scialom et al., 2020;Narayan et al., 2018), data-to-text generation (Gardent et al., 2017;Parikh et al., 2020;Nan et al., 2021;Dušek et al., 2020), and cross lingual summarization (Ladhak et al., 2020).In addition, they aim to establish baseline models along with automatic and human evaluations.However, the focus of GEM benchmark is predominantly English (7 out of 11 tasks).Cahyawijaya et al. (2021) propose a NLG benchmark for three Indonesian languages.Concurrent with our work, Guan et al. (2022) propose a NLG benchmark for Chinese long text NLG including two tasks.In addition, Chen et al. (2022) propose an NLG benchmark for three languages fr, de and es along with en, for three tasks of story generation, headline generation and question generation.In contrast, our IndicNLG benchmark covers 11 Indic languages and five tasks, making it the first for Indic languages as well as the most linguistically diverse NLG benchmark to the best of our knowledge.IndicNLG Benchmark complements the IndicGLUE benchmark (Kakwani et al., 2020) for Indic natural language understanding (NLU).

Pre-trained models and multilingualism
The availability of pre-trained models, typically trained using unsupervised approaches and monolingual data, help reduce the requirement of large amounts of (supervised) fine-tuning data for a given downstream task.In this context, T5 (Raffel et al., 2020), mT5 (Xue et al., 2021), BART (Lewis et al., 2020a), mBART-25 (Liu et al., 2020) and mBART-50 (Tang et al., 2021) are commonly used for fine-tuning.More recently Dabre et al. (2022) introduce a pre-trained sequence-tosequence model for Indic languages, which we use in this paper.Previous research in machine translation and summarization have shown that multilingual models fine-tuned on pre-trained models tend to yield the best results (Hasan et al., 2021;Ramesh et al., 2022), a direction we also follow.

IndicNLG Benchmark
The IndicNLG Benchmark is a collection of datasets which we use to benchmark NLG performance for 5 NLG tasks spanning 11 Indic languages.In this section, we describe the datasets and their sizes.

Tasks and Languages Choice Criteria
Task choice depends on language coverage, task coverage and practical applications.Regarding  language choice, our priority is to include as many languages as possible, where we are currently limited to 11.

IndicNLG Tasks
We focus on biography generation (BG) using Wikipedia infoboxes (WikiBio), news headline generation (HG), sentence summarization (SS), paraphrase generation (PG) and question generation (QG).Dataset sizes in number of examples for each task and language are given in Table 2. Except for WikiBio, datasets are available for all 11 Indian languages of interest namely, Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Odia (or), Punjabi (pa), Tamil (ta) and Telugu (te).All sizes reported are after deduplication.Due to lack of space, we only give the important details, and we encourage readers to check Appendix B.1,B.2,B.3,B.4,and B.5 for BG, HG, SS, PG and QG tasks, respectively, for further details regarding the dataset construction and cleaning process, quantitative and qualitative statistics, and examples.

Biography Generation (WikiBio)
The WikiBio task was first proposed for English, where, given the Wikipedia infobox of a person, the objective is to generate the first sentence of its Wikipedia page (Lebret et al., 2016).An infobox is a table containing facts in a key-value format, and the task objective is the summary of the infobox.In order to create the datasets, we crawl the Wikipedia pages of the aforementioned languages, except Marathi and Gujarati 5 , preprocess and filter them to ensure high quality.The "BG" column in Table 2 gives the statistics of the final corpora.We extracted a total of 57,426 examples, with Assamese and Tamil having the least and most number of examples, respectively.The English Wikibio dataset contains 728,321 examples, which shows that our dataset, which is ∼6% the size, is very low-resource.

Headline Generation
Headline generation is the task in which, given an article, the objective is to generate an appropriate sentence, a title, that accurately depicts the article (Banko et al., 2000).The headline should be able to draw the reader's attention while compressing information from several hundreds of words into a single sentence.The raw data for Hindi is crawled from HTML web pages of various domains like Dainik Bhaskar, Naidunia, NDTV, Business Standard and IndiaTV to ensure content diversity.We extract document and headline pairs and filter noisy examples.For other languages, we used the Headline Prediction dataset from the 'IndicGLUE' benchmark (Kakwani et al., 2020) 6  where we chose the document as the input and the correct headline as the output.
The column "HG" in Table 2 gives the statistics of the final corpora.There are a total of 1.31M examples, with Hindi containing the most (297K) and Malayalam containing the least (20K) number of examples.Comparing this with the corresponding Indic section of the XL-sum dataset (Hasan et al., 2021)

Sentence Summarization
Sentence summarization involves compressing the information of a reasonably long sentence into a shorter, compact sentence (Rush et al., 2015;Chopra et al., 2016).Following Rush et al. (2015), we create a sentence summarization dataset where the input is the first sentence of a news article and the output is its headline.The intuition is that the first sentence in a news article often expands upon the information in the headline, which makes the headline a summary of the first sentence.We simply re-process the headline generation dataset by extracting the first sentence and headline pairs to create our sentence summarization dataset.However, not all first sentences were valid document summaries, and we discard such examples to ensure high quality.
The column "SS" in Table 2 gives the corpora statistics, where we have a total of 431K examples.The count of examples in the training set ranges from 5.9K for Malayalam (least) to 112K for Hindi (most).Although this dataset is derived from the headline generation dataset, the number of examples is far fewer than the latter, which has 1.31M examples.Nevertheless, there are more examples compared to the XL-Sum counterpart, which contains 167K examples.The Gigaword7 corpus for sentence summarization for English (Rush et al., 2015), containing over 4M examples, however, is almost an order of magnitude larger.

Paraphrase Generation
Paraphrase generation or paraphrasing (McKeown, 1983;Barzilay and Lee, 2003) is the task of transforming a sentence into a different sentence in the same language while preserving meaning and semantics.A paraphrasing system is important as it enables generation of alternatives for a given sentence.Following Zhao et al. (2008), we use the pivoting approach to extract paraphrases from a parallel corpus.The intuition is that sentences are paraphrases if they have the same translation.To this end, we use the Samanantar corpus (Ramesh et al., 2022) which contains parallel corpora between English and all 11 Indic languages of interest8 .Using English as the pivot language, we extract paraphrases for each language.Since this approach can lead to multiple paraphrases with the same meaning, we choose one as the input and then retain up to 5 paraphrases ordered from lexically dissimilar to similar.We hope this will enable a paraphrasing system to be learned that can generate diverse paraphrases depending on the user's needs.
In Table 2, the "PP" column gives the corpora statistics, indicating a total of 5.57M examples, making this task the most resource rich in this bencmark.Each example is a tuple.Sizes vary strongly between languages, where Hindi has 950K paraphrases and Assamese has 8,840.Comparing our dataset with the OpusParcus corpus (Creutz, 2018) for 6 European languages, we are almost 3 orders of magnitude larger and have up to 5 references per example, where the latter has only 1.However, OpusParcus has been fully manually checked, whereas ours is not.

Question Generation
Question generation is the task of generating a question given some context and the answer (Du et al., 2017;Zhou et al., 2017).Question generation can be extremely useful to teachers in designing examination questions given some fixed answer.Unlike the previous tasks, creating data for this task is quite expensive, and thus we rely on machine translation.Following earlier work (Du et al., 2017;Zhou et al., 2017;Dong et al., 2019 inter alia), we start with the SQuAD training and development (Rajpurkar et al., 2016) question answering sets and repurpose them to serve as a question generation dataset.Specifically, we extract the question, its answer and the sentence containing the answer.We designate the input to question generation as the sentence along with the answer.The question serves as the output.We then translate the data into Indic languages using the IndicTrans (Ramesh et al., 2022) 9 English to Indic model.We end up with 98K examples per language and 1M total across all languages.Previous work has used IndicTrans to create testsets and found the translation to be of good quality (In-dicXNLI; Aggarwal et al. 2022).

Summary of Datasets
In summary, the IndicNLG Benchmark contains diverse NLG tasks for 11 Indic languages that vary in their linguistic characteristic and resource availability.While the corpora are smaller than their English counterparts in some cases, they are of reasonable size for building a benchmark dataset.Moreover, the relatedness of Indic languages opens up the possibility of training multilingual generation models.In the appendices, we also report extensive metrics to quantify characteristics of the datasets.The metrics show that the datasets are as challenging (if not more) as standard English datasets for these tasks, as measured by n-gram novelty and simple baseline approaches (Please see Tables 10, 15, 20, 24 and 30 for individual dataset qualitative metrics).

Dataset Quality
We study the quality of our automatically created datasets by conducting a human evaluation exercise.We choose two languages from each of Indo-Aryan and Dravidian language families as representative languages viz we choose hi and bn from Indo-Aryan and ml and kn from Dravidian language families.We conduct this study for only three tasks out of four (WikiBio, Headline Generation and Paraphrasing) due to limited annotation budget.
We annotate 250 examples in WikiBio and Paraphrasing, and 100 examples in Headline Generation (the examples are longer here, so we annotate fewer examples to save on annotation cost).For each language, we hire two native-language annotators and pay them higher wages than the minimum hourly wages.Table 3 contains the results of our human evaluation exercise, details of which are described below.
For annotation of WikiBio and Headline Generation we follow the human evaluation setup of Hasan et al. (2021).Specifically, we ask raters to annotate for three properties.Property A is Yes if the output and input pair are aligned.Property B is Yes if the output contains information inconsistent with the input.Property C is Yes if the output contains extra in-formation that cannot be inferred from the input.
The numbers of Property A, B and C mean the ratios of 'Yes' given by the annotators.It is desirable that values of Property A should be high, and values for Property B and C be low.We find that Property A is greater than 90% indicating high match between the input and output example pair.Property C values are greater than 10% in Wik-iBio, indicating the extra information present in output, an observation also made in Hasan et al. (2021).We find that the amount of extra information present in the output of Indic WikiBio dataset is similar in nature to abstractive summarization dataset such as XSum (71.7% from Table 3 in Hasan et al. 2021).In addition, as discussed in Hasan et al. (2021), large pretrained models are able to make use of external information from texts these models were trained on.So, we believe that the extra information present in the outputs will not have a big adverse effect on the quality of the datasets.Property C values are lower than 10% in Headline Generation.Property B is relatively low (around 5%) as desired for Headline Generation; however, it is higher for WikiBio dataset.
For the annotation of Paraphrasing data, we follow the setup of Cer et al. (2017), and adopt finegrained labels on the similarity between the input and output pairs, with a value of 5 indicating perfect match, down to 0 indicating unrelated pairs (See Table 1 of Cer et al. (2017) for the detailed description of the labels).We see that across languages, the input and output share good similarity (at least 50% ≥ 3); however we do note that there is relatively larger amount of noise for kn and ml languages (with high values of Label 0).
We release the annotated data along with the datasets and the trained models.

Experimental Settings
We establish strong baseline models using pretrained models and multilingualism.We describe the experimental settings for generating benchmark scores for all the tasks.

Datasets
We use the aforementioned datasets we created for our experiments and split them into 3 parts, roughly 80%, 10% and 10% for training, development and testing, respectively (with some exceptions).Details of splits are in the Appendix section B.5 and Appendix Tables 10, 13, 18 and 23.

Models Compared
We compare monolingual and multilingual finetuning of multilingual pre-trained models; strategies that are important for low-resource languages.By monolingual models, we mean models finetuned on data for one language.To specify the language, we prefix a language code for each example.A multilingual model is fine-tuned on the dataset obtained by combining all the languages' data.

Pre-trained Models Used
For our experiments with fine-tuning of pretrained encoder-decoder Transformer (Vaswani et al., 2017) models.We compare IndicBART (Dabre et al., 2021), a language-group specific pretrained model trained specifically for Indic languages, with mT5, a general pre-trained model for 100+ languages.We used mT5 instead of mBART (Liu et al., 2020) since it covers all languages in our dataset.IndicBART: IndicBART (Dabre et al., 2021) is a pre-trained model that focuses on all 11 Indic languages in this paper, trained in the same way as mBART (Liu et al., 2020).It has two versions, one using the Devanagari script for all languages and another using each language's original scripts.mT5: mT5 (Xue et al., 2021) is a multilingual model, trained using the span-prediction denoising approach, covering 101 languages, and we choose the mT5-small model containing 300M parameters for a fair comparison with the IndicBART model containing 244M parameters.

Training Settings
As much as possible, we tune hyperparameters when we fine-tune models, referring to settings in Dabre et al. (2021) and Xue et al. (2021).We give details about hyperparameter settings and training convergence in Appendix A.

Evaluation Metrics
For all tasks except paraphrasing, we report the Rouge-L F1 score (Lin, 2004).In order to compute Rouge scores for the decoded test sets, we use the multilingual rouge scoring implementation (Hasan et al., 2021) 12 which enables segmentation, stemming and punctuation normalization for various languages.For paraphrasing, we compute iBLEU (Sun and Zhou, 2012) following Hosking and Lapata (2021) using the equation: where O =output, R =references, I =input and, α = 0.7.BLEU is calculated using sacreBLEU (Post, 2018).Higher iBLEU implies better paraphrasing.

Results and Analysis
In this section, we present the results obtained using models trained for a variety of tasks and analyze them from various perspectives.We compare between models fine-tuned on IndicBART (IB), separate script IndicBART (SSIB) and mT5 in monolingual and multilingual settings.

Research Questions
In the analysis presented below, we try to answer the following research questions: Impact of multilingualism: How do monolingual models compare with multilingual models?Impact of language family: Are language family specific pre-trained models (IndicBART) better than universal pre-trained models (mT5)?Impact of task nature on performance: What are the determiners of the task performance?For this, we try to compare across tasks and languages and provide insights.

Main Results
We report the Rouge-L scores for all tasks in table 4: To save space we report on 5 languages: Assamese (as), Hindi (hi), Oriya (or), Tamil (ta) and Telugu (te), and give the rest in the Appendix which also show the impact of pre-training by comparing against models trained from scratch.Impact of multilingualism: Multilingual models are inherently superior to monolingual models, regardless of training from scratch or via fine-tuning.This shows that multilingualism enables transfer learning, giving stronger baselines.The only ex- Table 4: iBLEU scores for paraphrasing and Rouge-L scores for biography generation (WikiBio), headline generation, sentence summarization and question generation.We report scores for 5 languages: Assamese (as), Hindi (hi), Oriya (or), Tamil (ta) and Telugu (te).
We compare between models fine-tuned on mT5, separate script IndicBART (SSIB), single script IndicBART (IB) in monolingual and multilingual settings.
ception is headline generation.We suspect that this is due to poor hyperparameter choices.Impact of language family: In monolingual settings, with a few exceptions, fine-tuning In-dicBART gives substantially better results than fine-tuning mT5.In multilingual settings, the gap narrows and in several cases, fine-tuned mT5 outperforms fine-tuned IndicBART.Note that mT5 contains 300M parameters, whereas IndicBART has 244M parameters.In monolingual settings, a language family specific pre-trained model seems to be more beneficial, in addition to being more cost-effective, as compared to a generic pretrained model.However, in multilingual settings, a larger model might be better, regardless of its generic nature, as the additional parameters can be better utilized to learn from the increased data.Impact of task nature on performance: Biogra- This shows that paraphrasing is a challenging task, as the model should learn to produce diverse paraphrases and not just make trivial changes to the in-put.Finally, question generation also proves to be a challenging task, as evidenced by the relatively lower Rouge scores close to 26.Additional evaluation of QG models: We also evaluate our fine-tuned QG models on publicly available manually created test sets (chaii (Google, 2021), MLQA (Lewis et al., 2020b), TyDi QA (Clark et al., 2020) and XQuAD (Artetxe et al., 2020) test sets (results in Table 5).Note that these test sets cover only 4 languages.We see that results on translated test sets and manual test sets indicate broadly similar relative ranking of models -indicating that translated test sets can serve as a reasonable proxy when manual test sets are not available.However, we note that score differences between mT5 and IndicBART models are significantly larger on the manual test sets.

Summary of IndicNLG Benchmarking
Table 6 gives an overview of average performance across all languages for each task in monolingual and multilingual settings.This helps us answer the research questions we posed earlier.Compared to the fine-grained view in Table 4, the observations still hold on an average, which shows that multilingual fine-tuning IndicBART models is highly useful regardless of language or task.Headline generation is an exception to this rule, but even in this task, IndicBART models achieves higher score.The overall competitiveness or superiority of IndicBART indicates the importance of language group specific pre-trained models for NLG.
We also observe that task performance varies according to language as well as difficulty and according to scores alone, the perceived difficulty of tasks from easiest to hardest are: paraphrasing, question generation, headline generation, biography generation and sentence summarization.

Transfer Learning across tasks
We study if our models for one NLG task can benefit another task.In our case study, we explore whether our fine-tuned headline generation models can be further fine-tuned to improve extreme document summarization.Extreme summarization is the task of generating a short (often) one-sentence summary of a news article (Narayan et al., 2018;Hasan et al., 2021).Headline generation also compresses an article into a few words and can be seen as complementary task to extreme summarization.Specifically, for this study we focus on the Indic languages (bn, gu, hi, mr, pa, ta, te) from the XL-Sum dataset (Hasan et al., 2021).
Table 7 shows the results for (a) the zero-shot evaluation of multilingual headline generation models for XL-Sum summarization (IB → HG) and (b) supervised results using the XL-Sum training data for multilingual fine-tuning of IndicBART (IB → XL) and of headline generation models (IB → HG → XL).We see that its possible to obtain reasonable performance on summarization using a headline generation model as it is, although it is better to fine-tune IndicBART on the summarization data to get a boost of around 3 Rouge-L.However, fine-tuning the headline generation model significantly improves summarization (average 5 Rouge-L improvement) showing that our headline generation models may be used as pretrained models for other down-stream tasks.

Conclusion and Future Work
We present the IndicNLG Benchmark, a collection of datasets for 5 diverse NLG tasks in 11 Indic languages and 8M examples, with the aim of creating much-required standard benchmarks to drive Indic NLG research.To the best of our knowledge, this is the most diverse multilingual NLG dataset.Our methods are simple enough to create similar datasets for modest resourced languages.
We trained a variety of monolingual and multilingual models and showed the impact of the combination of multilingualism and pre-trained models.
In general, multilingual models outperform their monolingual counterparts.Language-family specific pre-trained models like IndicBART are valuable as they are competitive with large multilingual models like mT5 despite needing only 14% the compute (147B vs 1T training tokens), while having smaller vocabularies (64K vs 250K) and fewer params (244M vs 300M).Given these ob-servations, we recommend future baselines to consider multilingual fine-tuning of language familyspecific models as the starting point.Future work will also involve extending the datasets for additional Indic languages and new tasks.

Limitations
This paper describes methods for data creation for 11 Indic languages for the purposes of natural language generation along with modeling recommendations.The following are the limitations: 1. Data creation relies on resources like parallel corpora, monolingual corpora and Wikipedia of reasonable sizes.Our approach may not apply to languages where such resources are scarce.
2. For researchers with limited computational resources, training multilingual models may be time-consuming, especially given the sizes of datasets for paraphrasing and headline generation.
3. Evaluation for NLG is still not deeply explored and the metrics we use might not be the best ones, although they are widely used in existing literature.
4. The question generation dataset is generated using machine translation, and it might consist of translationese.

A Hyperparameter Tuning
We provide details of model hyperparamers we used for model training.All our models were fine-tuned on single A100 GPUs, regardless of monolingual or multilingual fine-tuning.For finetuning IndicBART, we followed the settings recommended by Dabre et al. (2022) and for finetuning we mT5-small we followed those recommended by Xue et al. (2021).For IndicBART we train till convergence on the validation set scores, which are computed via greedy decoding every 1,000 batches.On the other hand, we train mT5 for 10 epochs and choose the checkpoint with the highest validation set scores.In case of multilingual models, the convergence is determined by the average of validation set scores for all languages.
For decoding the test sets, we use beam search.Following are some specific optimal hyperparameters for each task which we determined to be optimal: Biography Generation (WikiBio): For In-dicBART fine-tuning we use batch sizes of 4096 tokens, dropouts of 0.1, label smoothing of 0.1, learning rate of 0.0003 and weight decay of 0.00001 with the ADAM optimizer.For mT5, we used most of the default hyperparameters in the fine-tuning script, with exceptions: batch sizes of 32 examples and learning rate of 0.00005 with the ADAM optimizer.We use maximum source and target length of 512 and 64 respectively for both models.For decoding the test sets, we use beam search with a beam of size 4, length penalty of 1.0 and penalize translations in the beam where 4grams are repeated.Headline Generation: Settings for IndicBART are the same as for WikiBio.We train monolingual mT5 models for 10 epochs with learning rate 5e-4 and weight decay 0.01.Multilingual mT5 models were trained for 15 epochs as our headline generation data is very large.For decoding the test sets, we use beam search with a beam of size 5, length penalty of 1.0 and penalize translations in the beam where 4-grams are repeated.
Sentence Summarization: The settings are the same as in WikiBio.For decoding the test sets, we use beam search with a beam of size 5, length penalty of 0.8 and penalize translations in the beam where 3-grams are repeated.
Paraphrase Generation: We use the same settings as for WikiBio.For decoding the test sets, we use beam search with a beam of size 4, length penalty of 1.0, and we do not use repetition penalties.
Question Generation: Other than using maximum source and target length of 256 and 32 respectively, we used the same settings as for Wik-iBio.For decoding the test sets, we use beam search with a beam of size 4, length penalty of 1.0, and we do not use repetition penalties.

B Dataset and Experiment Details
In the Appendices B.1 to B.5, we provide additional details on creation of the datasets, various dataset statistics, detailed results for all tasks and sample outputs.

B.1 WikiBio
Dataset Creation The dataset creation process involves collection of raw data, noise removal, serialization, filtration, cross-lingual overlap removal and splitting.

Raw Data Collection:
We download the Wikipedia15 dumps for each language, parse them and save the information (metadata) of pages that have infoboxes.Next, we use the wptools API16 to extract the infoboxes and the first lines of the pages about people.The first sentence is supposed to be a simple biography of the person whom the page is about.

Field
Value Transliteration name अिमताभ बच्चन Amitabh Bachchan spouse जया बच्चन Jaya Bachchan father हिरवं श राय बच्चन Harivansh Rai Bachchan Table 8: A WikiBio infobox snippet for the Indian actor Amitabh Bachchan.The complete infobox will contain several facts as key-value pairs.
Dataset Preprocessing: The extracted data is rather unclean, as it contains spurious newline characters17 , special characters18 , and the values of some fields of the infobox are inside double square brackets ([[]]).This necessitates cleaning of the infoboxes and the single sentences we extract.Following cleaning, we serialize the infoboxes to convert it into a text sequence.Thus, we transform data-to-text generation into a textto-text generation setup (Kale and Rastogi, 2020;Puduppully and Lapata, 2021).We separate attribute names from the values by enclosing the attributes names within special tokens.For example, the infobox snippet represented in Table 8 (excluding the transliteration column) gets converted into "<TAG> name </TAG> अिमताभ बच्चन <TAG> spouse </TAG>जया बच्चन <TAG> father </TAG> हिरवं श राय बच्चन" where "<TAG>" and "</TAG>" are tokens indicating that the content enclosed in them are fields.
We perform spelling normalization of the words in the infoboxes and the sentences describing them.Normalizing text written in Indic scripts helps to handle texts that display a lot of quirky behavior on account of varying input methods, multiple representations for the same character, etc.There is a need to canonicalize text representation so that NLP applications can consistently handle the data.Dataset cleaning and splitting: We clean the dataset by discarding examples which satisfy the following criteria: output sentence containing fewer than 5 tokens, name field is not present in the input infobox, duplicate examples, and person name is in English.Furthermore, we clean the dataset to ensure there is no leakage during training of models in a multilingual setting.Otherwise, an example in one language in the training set may have its equivalent in another language in the validation or test set.Dataset Statistics The dataset splits for each language are given in Table 9.We compare the counts of examples with that of English WikiBio dataset (Lebret et al., 2016).We see that the Indic language WikiBio is low resource as compared to English WikiBio, with total number of examples ∼6% of the size of English WikiBio.
In Table 10, we present some quantitative statistics where, we can see that the average count of words in input, output, count of attribute-value pairs, common words, and overlap percentage are comparable between Indic and English WikiBio.Results: We report the rouge-L scores on the test set for all the 9 languages in  Table 10: Quantitative statistics of the WikiBio dataset for English and 9 Indian languages.We use statistics such as the average number of words in input (W I avg ), the average number of words in output (W O avg ), the average number of field-value pairs (W F V avg ), the average number of words common between input and output (W Common avg ), and the percentage of words common between input and output (% overlap).Here, average statistics (last row) does not include English.

B.2 Headline Generation
Dataset Creation: As mentioned in section 3.2.2, the raw data for the Hindi dataset is crawled from various web pages from different domains.
Whereas for other languages, we use IndicGLUE headline classification dataset.The extraction of the article-title pair for IndicGLUE dataset is implicit, the correct headline is the title, and the article is the same.But for the Hindi dataset, we use manual inspection of these domains to get the logic for extracting the article-title pair.The first sentence of the body field is the title in some domains, whereas somewhere in the middle of the body field for others.Similarly, news articles are present in these body fields of the crawling.Hence it is an involved step to extricate the actual dataset from these crawlings.
Dataset Cleaning: The data has noises like publisher name and information, data and update time, or some advertisement links, or read more links.
To clean the data majorly we perform regex pattern matching and keyword searching to find and remove the unwanted noise in the data.Sometimes a sample contains more than one language, or some domains has news in multiple languages.Hence to separate these languages we use language detection tools like gcld3 19 , langdetect 20 and, langid 21 .The dataset created using the above process sometimes contains noise in the form of a document matched with an incorrect headline.We notice that such examples have low percentage of words in common between title and document.In order to remove such examples, we compute the overlap between the document content and the title by an "overlapping ratio".Suppose D and T represent the set of words in the document and title, respectively.The overlapping ratio is computed as |D∩T | |T | .We exclude examples below certain threshold of overlapping ratio.Dataset statistics: Table 13 gives the dataset splits.We calculate two types of dataset statistics, quantitative and qualitative.Quantitative Analysis: Table 14 shows the percentage of novel n-grams, which indicates the percentage of n-grams of title not present in the document.It is a measure of the abstractive nature of the task, as the model will be required to predict 19 https://pypi.org/project/gcld3/ 20https://pypi.org/project/langdetect/ 21https://pypi.org/project/langid/words not present in the input.We find that the Indic Headline Generation dataset is comparable with XL-Sum dataset in terms of novel n-grams.Qualitative Analysis:We use the LEAD-1 and EXT-ORACLE Rouge-L (R-L) scores as baselines, which also serve as an indicator of task difficulty.LEAD-1 Rouge-L scores are calculated between the first sentence of the document as system summary and title as reference summary.EXT-ORACLE scores are computed by selecting the sentence from the document as summary that give the highest rouge scores with the reference summary.Table 15 shows the LEAD-1 Rouge-L scores are very low for all the languages including XL-sum, indicating that the first sentence does not contain sufficient information to make a title.On the other hand, the EXT-ORACLE scores are higher.The scores clearly indicate that the task of headline generation is not a trivial one.Note that, the LEAD-1 and EXT-ORACLE scores in the case of document-headline pairs from XL-sum are substantially lower than our datasets, despite the low morphological complexity of English.Results: Table 16 shows the Rouge-L scores for the headline generation test set across all the eleven languages and eight models.Monolingual IB gives the highest rouge-L score for almost all the languages.Output: Table 17 shows the output generated by each model we have trained.Along with the native language of the example (which is Hindi here), we show the text's translation and transliteration in English for better understanding.We take only the first few sentences for the input, as the actual input article size is more than ten sentences.The multilingual IB and SSIB give a title which relates more to the article's first sentence.In contrast, monolingual IB and SSIB provide the overall summary in one line as the title, which correlates with the target summary.No PT in both monolingual and multilingual settings behaves the same, except multilingual output has more details.(Rush et al., 2015) corpus is an order of magnitude larger than the total count of examples in our sentence summa-22 https://huggingface.co/datasets/gigaword rization dataset.Table 19 shows some quantitative statistics for the sentence summarization dataset.The count of words in title and sentence is comparable to that of English Gigaword corpus.The size of vocabulary of sentence and summary is smaller than that of English Gigaword corpus.
Results: We report the Rouge-L scores on the decoded test sets for all the models and languages in Table 21.
Example Outputs: Tables 22 contains the outputs generated by all the models that we have trained.In the first example, except for monolingual No PT, the output generated by all the models is comparable.In comparing the reference with the output, we see that the meaning is mainly conveyed.The output generated by the No PT and mT5 models is relatively longer than the reference.Table 20: Qualitative statistics for sentence summarization, focusing on n-gram overlaps between the sentencesummary pairs.Baseline scores using the first "k" words of the sentence as a summary are also computed.

B.4 Paraphrasing
Initial paraphrase extraction: We use the pivoting approach to extract the initial set of paraphrases.Prior to pivoting, we normalize and tokenize the English sentences using Sacremoses23 along with removing white spaces to ensure that the same sentences with differing spaces between words become identical.A single paraphrase example is a tuple consisting of M sentences which are all considered to be paraphrases of each other.

Dataset cleaning:
After paraphrase extraction, we clean it to remove noise.First, we remove the sentences in an example which are exact duplicates that only differ due to tokenization and spelling.
For this, we first normalize and tokenize the sentences using the IndicNLP library (Kunchukuttan, 2020) 24 .Then, we remove the punctuation and white spaces from the sentence.If this results in a single sentence in the example, then the example is discarded.
We then randomly select one paraphrase as the input from each group of paraphrases and calculate n-gram overlap for n=1, 2, 3, and 4 between it and the remaining sentences which are considered as references.We eliminate the paraphrases which have an n-gram overlap ratio greater than 0.8 to ensure that the paraphrases are not very similar.The ratio is calculated using the formula: where O n =n-gram overlap between reference and input, I n =Total n-grams in input and R n =Total ngrams in reference.This formula computes the average of 1-, 2-, 3-, and 4-gram overlaps.These overlaps are computed as the harmonic mean between the overlapping n-grams relative to the input (a n ) and the reference (b n ).This ensures that the overlap information is not biased towards either the input or the reference.
Next for each input, we select up to five references.We first sort the references based on the ngram overlap scores with respect to the input, then divide the scores into 5 equal intervals and finally, select the example corresponding to the first score in each interval.If for a particular input, the number of references is less than 5, we keep all the references.Dataset splitting: We split the collection of examples into train, validation and test sets.We first sort examples in the descending order of the number of paraphrases in them, the top 10,000 go into the test set, the next 10,000 into the validation set and the remaining into the training set.Except Assamese, all languages have 10,000 examples with 5 references25 per input in the validation and test set.The training set has anywhere between 1 and 5 references.Assamese, due to its low-resource nature, could only be split into validation and test sets with 4,420 examples each.
The per language statistics are given in Table 23.We compare our statistics with the English language portion of OpusParcus corpus (Creutz, 2018).Dataset statistics: Table 30 gives the novel ngram percentage between question and context.We can see that Indic languages have higher novel n-grams when compared to the English dataset.
Results: We report the Rouge-L scores on the test set for all the 11 languages in Table 31.
Output: Table 32 shows the output generated by all the eight models.We show the example in the native language (Hindi here), but we also mention translation and transliteration of the native text in English for better understanding.The target output is about who won the match and, output generated by monolingual IB, multilingual mT5, multilingual IB and, multilingual No PT are very close to the target output.In contrast, other outputs is more about who lost the match.But all of these outputs are related to the actual context and the answer.

Table 2 :
Dataset sizes in number of examples for 5 tasks of WikiBio biography generation (BG), Headline Generation (HG), Sentence Summarization (SS), Paraphrase Generation (PG) and Question Generation (QG) spanning 11 languages in the IndicNLG Benchmark.
which can also be used for headline generation, we have more examples for each language.For example, we have 297K examples for Hindi, whereas XL-Sum only has 88K examples.

Table 6 :
Summary of results on IndicNLG Benchmark across biography generation (BG), headline generation (HG), sentence summarization (SS), paraphrase generation (PG) and question generation (QG).The table shows average scores across all 11 languages (iBLEU for paraphrase generation and Rouge-L for other tasks).gually fine-tuned IndicBART gives the best performance, but the iBLEU scores are less than 20.

Table 7 :
XLSum test set evaluation on different experiments to show that headline generation dataset helps in a transfer learning setup.IB is IndicBART, XL is XL-Sum, HG is Headline Generation.

Table 11 .
Example Outputs: We present an example and its output generated by all the different models in table 12.

Table 9 :
Number of examples in the WikiBio dataset for 9 Indian languages.The total number of examples (last row) does not include examples in English, the statistics for which are obtained from Lebret et al. (2016).

Table 13 :
Train, test and validation set sizes for headline generation in terms of number of samples.The 'total' row indicates the sum of the respective sets for all the languages.

Table 14 :
Quantitative statistics of the headline generation dataset, focusing on document-title lengths and vocabulary sizes.

Table 15 :
Qualitative statistics for headline generation dataset, focusing on n-gram overlaps between document and title.Standard baseline scores such as LEAD and EXT-ORACLE are also included.

Table 17 :
Model generated output for News Headline GenerationB.3SentenceSummarisationDatasetCreationand Cleaning: Since it is a sentence summarization dataset we simply used headline dataset by using first sentence of the article as input and title as its summary.But not all the examples were correlated.Hence, we compute the overlapping ratio between the sentence and summary pair.It is similar to headline dataset overlapping ratio, as mentioned in section B.2.This help us in filter out the least overlapping sample from the dataset.Table18contains the splits of the sentence summarization dataset.

Table 18 :
Size of train, test and validation sets in terms of number of samples for sentence summarization.The total size of the dataset, excluding English, is also included.

Table 19 :
Quantitative statistics for sentence summarization, focusing on the lengths of the sentence-title pairs in terms of words, as well as the vocabulary sizes.

Table 22 :
Model generated output for sentence summarization

Table 24 :
Percentage of novel n-grams in references not present in the input for paraphrase generation.

Table 25 :
The table shows BLEU scores for the paraphrase generation test sets.We compare between models without pre-training (no PT), Indic BART (IB), separate script Indic BART (SSIB) and mT5 models in monolingual and multilingual settings.

Table 26 :
The table shows Self-BLEU scores for the paraphrase generation test sets.We compare between models without pre-training (no PT), IndicBART (IB), separate script IndicBART (SSIB) and mT5 models in monolingual and multilingual settings.

Table 27 :
The table shows iBLEU scores for the paraphrase generation test sets.We compare between models without pre-training (no PT), Indic BART (IB), separate script Indic BART (SSIB) and mT5 models in monolingual and multilingual settings.B.5 Question GenerationDataset extraction From SQuAD dataset: Table 29 shows the original sample, in which 'Context' is a paragraph which is associated with multiple question and answer pairs.Each 'Answer' has an 'Answer Text' which is the actual answer and a 'Answer Start Index' which is the index of the first character of 'Answer Text' in 'context'.The bold text in the Context row of the Original English Sample is the sentence which contains the answer to the question.This sentence is extracted with the help of 'Answer Start Index' for each questionanswer pair and is treated as a reference context.Finally, each sample of the English dataset has <Context, Answer, Question, ID> and then this is translated using IndicTrans 26 to the required Indic language dataset.The example of the translated dataset in Hindi and Marathi is shown in the same table.We use SQuAD-v1 train 27 set for training and validation set, where the split is 80-20 respectively.We use SQuAD-v1 dev 28 set as test set directly.The train, dev and test splits contain69,979,  17,495, and 10,553 examples, respectively.

Table 29 :
Original English sample Context Architecturally, the school has a Catholic character.Atop the Main Building's gold dome is a golden statue of the Virgin Mary.Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend Venite Ad Me Omnes ¨.Next to the Main Building is the Basilica of the Sacred Heart.Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection.It is a replica of the grotto at Lourdes, France, where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.Question To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?It is a replica of the grotto at Lourdes, France, where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858.Answer Saint Bernadette Soubirous Question To whom did the Virgin Mary allegedly appear in 1858 in Lourdes, France?ID 5733be284776f41900661182 Hindi Translated sample Context यह लू दर्े स, फ्रां स में िस्थत ग्रॉटो की प्रितकृ ित है , जहां 1858 में सें ट बनार् डे ट सौिबरस को वर्ि ज न मै री िदखाई दी थी। (R:yah loordes, phraans mein sthit groto kee pratikrti hai, jahaan 1858 mein sent barnaadet saubiras ko varjin mairee dikhaee dee thee.)Answer सं त बनार् डे ट सािबरोस (R:sant barnaadet saabiros) Question सन् 1858 में लू डर् स फ्रां स में कु ँ वारी मिरयम किथत तौर पर िकसके सामने प्रकट हुई? (R:san 1858 mein loordas phraans mein kunvaaree mariyam kathit taur par kisake saamane prakat huee?) ID 5733be284776f41900661182 Translation example for question generation.Here R stands for Romanisation, that is transliteration in English of the native text.

Table 30 :
The table shows the percentage of novel n-gram for question generation.We give the statistics of novel n-gram in question string compared to context string.