PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding summarization between Indian languages. Our dataset is publicly available and can be freely modified and re-distributed.


Introduction
The era of deep learning has witnessed great advancements in various natural language processing (NLP) tasks. Yet prevalent solutions usually rely on large datasets, which are limited to high-resource languages. This is particularly pronounced in India, where languages spoken by a large population have been historically overlooked in research and under-resourced (Kumar et al., 2022).
In the case of text summarization, where a system generates a brief description of a longer text, the availability of datasets for Indian languages 1 is restricted in terms of both language coverage and size (Sinha and Jha, 2022). Moreover, some existing datasets are not easily accessible or have been criticized for their quality (Urlana et al., 2022b). Given the multilingual nature of India, having 122 major languages and 22 official ones, 2 the development of cross-lingual summarization is desirable * Equal contribution. We release our corpus under the CC-BY-4.0 license: https://github.com/ashokurlana/PMIndiaSum 1 We refer to the languages widely spoken in India as "languages in India" or "Indian languages", but they are spread around the globe and do not necessarily belong only to India. 2 https://en.wikipedia.org/wiki/Languages_of_India for information access, however, it is challenging due to the absence of reliable resources.
To address these challenges, we present PMIndi-aSum, a multilingual and cross-lingual summarization dataset sourced from a governmental website: the Prime Minister of India. 3 This website publishes political news, usually available in various languages covering the same content. In accordance with established methods (Napoles et al., 2012;Rush et al., 2015, inter alia), we use a news article as the document and its headline as the summary. Beyond monolingual data pairs, as Figure 1 depicts, the website's multilingualism enables cross-lingual document-headline alignment with high confidence in quality.
Our effort leads to extensive coverage of 196 language directions from 14 languages across four families, making the corpus the current widest collection of Indian language pairs for headline summarization. There are 76,680 monolingual document-headline pairs and 620,336 cross-lingual pairs in total. We display language family and code information in Table 1. Of particular note is the inclusion of Manipuri 4 , an often neglected language in present-day datasets. Besides the corpus release, we obtain results in multiple metrics for popular summarization paradigms like fine-tuning, two-step summarization-translation, and prompting, to serve as a reference for future research. Kannada (kn), Malayalam (ml), Tamil (ta), Telugu (te) Indo-Aryan Assamese (as), Bengali (bn), Gujarati (gu), Hindi (hi), Marathi (mr), Odia (or), Punjabi (pa), Urdu (ur) Indo-European English (en) Tibeto-Burman Manipuri (mni) This paper gives a comprehensive account of how we tackle the issues raised by Sinha and Jha (2022) with regard to developing datasets for Indian language summarization. Explicitly, we 1) respect copyright permissions and ensure transparency in data processing; 2) carry out multifaceted quality checks; 3) offer a broad range of language directions; 4) provide reasonably sized data to facilitate system building. Following Gebru et al. (2021)'s advocate, we include a datasheet in Appendix G. We believe that our work can aid research on summarization and dataset construction.

Data acquisition
The Prime Minister of India website posts articles that consist of a headline and a news body. These articles are available in multiple languages, with English being the default option. The HTML structure of each article includes a language indicator and a pointer to the English version. We gather website data for all articles in all available languages, which are sourced from two channels: 1. We ingest readily crawled HTML data from the PMIndia parallel corpus release (Haddow and Kirefu, 2020), which correspond to articles published between 2014 and 2019. 2. We newly crawl for more articles up to early 2023, using PMIndia crawler. 5 We specifically design a parser to retrieve article bodies and headlines from the HTML. Overall, we harvest 94,036 article-headline pairs for all languages, which we preliminarily regard as monolingual document-summary data. Figure 2 outlines the origins of English articles across different years.

Data processing
To create a high-quality dataset, the collected headline-body pairs undergo rule-based processing in the context of monolingual summarization. We describe the rules, and list them in Table 2 at each step. The cleaned version contains 76,680 monolingual instances for all languages, which is 81.5% of the raw size.
Language mismatch. Despite confidence in the website's efforts towards language correctness, we set our own language filtering. We discard an entire data instance if either the document or the headline contains text outside of the Unicode range of its designated language. 6 We notice that a large number of samples removed, especially from bn and en, are code-mixed data.
Duplicates and empty. To maintain a dataset with only unique document-summary pairs, we remove all duplicates. We also eliminate samples that have identical summaries to steer clear of any headline-article mismatch errors from the website. In addition, we take out instances if either their document or summary is empty.
Prefix. To enforce that PMIndiaSum is abstractive in nature, we remove all samples where the summary is repeated as the initial or first few sentences in the document.

Monolingual statistics
Length and compression. Our average document length is 27 sentences with 518 tokens with ur being the longest, and ml being the shortest. Summaries are on average 12 tokens, with nearly all being a single sentence. We then compute the compression ratio, which quantifies how concise a summary S is given its document D as 1− len(S) len(D) in (Bommasani and Cardie, 2020). For each language, we display the average over all samples. High compression of around 90% implies that the headlines are extremely abstractive and condensed.
Density. Grusky et al. (2018) introduced extractive fragments F (D, S) as a set of shared sequences in document D and summary S. We use their density metric which reflects the degree to which the summary can be composed of texts from the document: density = 1 len(S) f∈F (D,S) len(f ) 2 . Novelty. To evaluate the originality of summaries in our corpus, we compute the percentage of n-grams that are present in a summary but not in its corresponding document. We report 1-to-4-gram novelty scores averaged across all samples in each language. Here, unigram novelty is equivalent to 1 − coverage in (Grusky et al., 2018).
Redundancy. Hasan et al. (2021) described re-dundancy as the amount of repeated information in a summary. It is calculated as the ratio of repetitive n-grams to all n-grams in the summary. We measure the average redundancy for unigrams and bigrams, where lower is more desirable.

Multilingualism and parallelism
On top of monolingual data, PMIndiaSum features massive parallelism because the source website publishes a single article in multiple language versions to cater its audience. This means that document-headline pairs in various languages derived from the same article enable cross-lingual summarization. As depicted in Figure 3, most of the articles are available in at least two languages, and 232 articles are available in all languages. This allows for the creation of cross-lingual and multilingual summarization data pairs. Technically, every document is paired with summaries in other languages from the same article, and vice versa. Matching is done via the default English pointer in HTML. Such multi-way parallelism results in 14 × 13 = 182 cross-lingual pairs in addition to monolingual data. The average data size is 5,477 for monolingual and 3,408 for cross-lingual summarization. Appendix A Table 11 details the sizes for all 196 language directions.

Data split
To prevent test data leakage in multilingual models, where a model sees an article in one language and is tested on the same article in a different language, we isolate the 232 articles that are available in all languages for validation and testing. We divide the data equally into validation and test sets, resulting in 116 instances for each language pair in each set. This approach provides consistent validation and test splits for each target summary language, regardless of the language directions. All other data are in the training split.

Quality considerations
We offer multifaceted discussions on writing quality, summary quality, and parallelism, to demonstrate that our design can lead to a reliable corpus.
Text quality. The PMIndiaSum corpus is comprised of text extracted from a governmental news website, where the articles are composed or translated by native speakers, and reflect the government's stance. As such, we expect the writing to be formal, factual and informative. We also carefully remove extraneous segments, like HTML tags and embedded elements, to maintain text quality. Moreover, the popularity of the PMIndia parallel corpus adds to our dataset's credibility.
Summary choices. We take Rush et al. (2015)'s approach of treating the headline as an article's summary, whereas some works opt for the first sentence instead, e.g. XSum and XL-Sum (Narayan et al., 2018;Hasan et al., 2021). Although the lead sentence can be an overview of an article, this paradigm has received criticism due to potential concerns (Zhao et al., 2020;Urlana et al., 2022b): 1. The first sentence can be part of multi-sentence writing, which is difficult to isolate. 2. Further, the second sentence onwards, when employed as a document, may not always contain all information in the first sentence. We conduct an empirical study on the suitability of headlines versus first sentences as summaries.  For each of English, Hindi, and Telugu, we invite three native speakers to perform a manual inspection on 50 random samples. Given an article, we ask if the first sentence is a summary and if the headline is a summary. We record majority votes in Table 4, which suggests that headlines are consistently regarded as a summary; on the contrary, first sentences may not always function as a summary, which is also hard to be automatically identified. In Appendix C, we outline the evaluation protocol and supply a few examples of problematic first sentences. Based on the evaluation outcome, we determine that it is more appropriate to use the headlines as summaries in our corpus. We argue that having the high parallelism between articles, findings from three languages can generalize to the whole corpus.
Parallelism. Finally, we assert the validity of cross-lingual alignments. We measure the degree of parallelism by calculating cosine similarity between neural presentations of texts in two languages (LaBSE, Feng et al., 2022). We compute LaBSE scores between summaries as well as between entire documents, from the same article, for each language pair. The average crosslingual LaBSE score is 0.86 for summaries and 0.88 for documents. These scores indicate the high parallelism between summaries and between documents in different languages; they also notably exceed the 0.74 summary-summary threshold Bhattacharjee et al. (2022) used to extract Cross-Sum from XL-Sum. Hence, given monolingual document-summary pairs, our choice of substituting the summary (or the document) with one in another language maintains the integrity of crosslingual pairs. We enclose all pairwise LaBSE scores in Appendix B Table 12 for reference.

Task and evaluation
Formally, given a source document D, the process of summarization should produce a target summary S with a shorter length, yet conveying the most important message in D. We explore three types of models defined by language directions: 1. Monolingual: document D L and summary S L are in the same language L. 2. Cross-lingual: document D L and summary S L ′ are in different languages L and L ′ . 3. Multilingual: monolingual and cross-lingual summarization from D {L 1 ,L 2 ,...,Ln} to S {L 1 ,L 2 ,...,Ln} within a single model. In our context, cross-lingual models summarize from one single language to another. Monolingual and cross-lingual models could be more accurate as they concentrate on one language (pair), whereas multilingual models can significantly save storage and computational resources, and likely transfer knowledge across languages.

Methodology
We intend to provide a benchmark for four conventional headline summarization approaches. We introduce the paradigms below and label the language settings tested with each method in Table 5.
Extractive. We employ two training-free baselines: 1) selecting the lead sentence, and 2) scoring each sentence in the document against the reference and picking the best in an oracle way.
Fine-tuning. Recent advances in fine-tuning pre-trained language models (PLMs) have shown promising progress in summarization for Indian languages (Taunk and Varma, 2022;Urlana et al., 2022a). We load a PLM and further train it for summarization with our data. In monolingual and cross-lingual settings, a PLM is only fine-tuned for a single language direction. On the other hand, in the multilingual setting, we simply mix and shuffle data for all language pairs, because the data sizes are of the same magnitude across all directions.
Summarization-and-translation. In crosslingual settings, it is practical to leverage a translation system for language conversion. Two common pipelines are 1) summarization-translation, where a document is summarized in its original language and then the summary is translated into the target language, and 2) translation-summarization, which means translating the document into the target language followed by summarizing it in that language. Zero-shot. We fine-tune a PLM on all monolingual data only. This only allows training on same-language document-headline data, but not any cross-lingual pairs. We then perform crosslingual summarization on the test set with this model to measure its zero-shot capability.
Other techniques. Lastly, we provide a preliminary discussion on methods involving prompting large language models and adapters in Appendix E.

Systems
Our fine-tuning paradigm is tested on two PLMs: IndicBART (Dabre et al., 2022) and mBART-50 (Liu et al., 2020). 9,10 We follow each PLM's convention to add language identification tokens to inform the PLM of the source and target languages. The PLMs support various languages 11 , yet neither supports Manipuri; therefore, we randomly initialize an embedding entry as the mni language token in both IndicBART and mBART.
In the summarization-and-translation workflow, monolingual summarization is done using our monolingual fine-tuned PLMs introduced above. The translation process is delegated to a public engine that supports all the involved languages. 12 Training configurations are detailed in Appendix D. All trainable experiments are conducted three times, to obtain the mean and standard deviation statistics. We report average scores, and attach standard deviations in Appendix F.

Monolingual
We list monolingual results in Table 6. Comparing the two extractive baselines, oracle performance is better than using the first sentence, yet both are nowhere near perfect scores. This indicates that the summary information is scattered across a document and the headline is abstractive in nature.

Oracle
IndicBART   Our PLM fine-tuning yields significantly higher numbers than extractions, implying a non-trivial headline summarization task. Generally, mBART is ahead of IndicBART, but it supports fewer languages. In terms of automatic metrics, most languages have reasonable performance except that Telugu slightly lags behind.

Cross-lingual
Fine-tuning all 182 cross-lingual directions is infeasible given our resource constraint. Thus, we shortlist language pairs to fulfil as much as possible: 1) high and low data availability, 2) combinations of language families as in Table 1, and 3) languages supported by both IndicBART and mBART for comparison.
According to results in Table 7, we observe that summarization-and-translation outperforms finetuning, but the gaps are not wide. It is worth noting that the comparison is not strictly fair because the summarization-and-translation pipeline uses an external translation model which presumably ingests much more data. Specifically, the order of translation and summarization matters in higher-resource scenarios: for Hindi and English, translation-summarization achieves 10 more points on all three metrics than summarization-translation.
The zero-shot columns reveal that both PLMs are unable to perform cross-lingual summarization even though they have been fine-tuned on monolingual data in all languages. This observation indicates the necessity of our true parallel for practical cross-lingual headline summarization between languages in India.

Multilingual
In this setting, both PLMs are fine-tuned on all data, subject to the languages they support. Results are documented in Table 8 for IndicBART and Table 9 for mBART. Regarding cross-lingual directions, multilingual IndicBART seems to only produce reasonable numbers for summarization into hi, pa, and en; mBART performs remarkably better than IndicBART.
Referring to the differences in results (∆) in the monolingual result Table 6, multilingual models are inferior to monolingual models for both In-dicBART and mBART. In clear contrast, for 11 out of 12 cross-lingual directions in Table 7, multilingual mBART surpasses separate mBART models fine-tuned for each direction, implying that the availability of data in multiple language pairs helps cross-lingual summarization.

Languages and PLMs
Resource availability. In comparison with Indian languages, we see superior summarization scores for English in the monolingual case, probably due to a larger training size. Multilingual models, too, tend to perform better for languages that have more monolingual data on the target side, for instance, en, hi, and gu.
Unseen. Despite that Manipuri is not seen during the pre-training of both PLMs, it has results close to other languages in monolingual summarization, as well as with multilingual mBART.
Language family. In agreement with previous work (Hasan et al., 2021), both IndicBART and mBART work better for Indo-Aryan languages compared to other families.
IndicBART versus mBART. In monolingual and cross-lingual settings, IndicBART is only slightly behind mBART, being only one-third of the size of mBART. However, for multilingual summarization, mBART is clearly preferred in our context.  Table 7: Cross-lingual benchmarks: separate models for each language pair. Bold indicates the best result; (∆) refers to the result minus the corresponding multilingual result in Table 8 or Table 9.    Document-summary pairs are the key components of a summarization dataset. A typical acquisition workflow is to utilize publicly available articles, especially in the news domain, given their high availability and language coverage. To form a document-summary pair, an article can be paired with either human-written summaries, or accompanying texts such as the headline or first sentence.
The CNN/DM corpus utilized English news articles as documents and highlights as summaries (Hermann et al., 2015;Nallapati et al., 2016) Exploiting public resources usually results in same-language document-summary pairs. In order to create cross-lingual data, one can have existing datasets undergoing secondary processing. Zhu et al. (2019) translated summaries in monolingual datasets into another language while ensuring that the translation of a translated summary did not deviate from the original. The CrossSum corpus (Bhattacharjee et al., 2022) was derived from XL-Sum by thresholding similarity between cross-lingual summaries, and pairing a summary with the document of a parallel summary. WikiLingua aligned summaries and documents in different languages using image pivots in the how-to articles on the WikiHow website (Ladhak et al., 2020).
Our PMIndiaSum falls in the category of using public article-headline pairs. The data itself is massively cross-lingual due to the multilingualism of the published news. We greatly benefit from the tools released by the PMIndia corpus (Haddow and Kirefu, 2020), which is a machine translation dataset between English and Indian languages, extracted from the same website.

Coverage for languages in India
The current provision of Indian languages is still limited for summariztaion research. TeSum is monolingual, ILSUM covers three, and M3LS covers ten Indian languages. XL-Sum supports eight, however, using a document's first sentence as a summary may not be ideal as mentioned earlier.
Covering the largest number of Indian languages are MassiveSumm and Vārta, but these works release URLs and processing scripts instead of the data; copyright holders' conditions and licenses remain unknown. Such practices transfer the risk and responsibility to dataset users.
Moreover, the above multilingual datasets do not support cross-lingual research for Indian languages. WikiLingua only has one Indian language, and to the best of our knowledge, before ours, Cross-Sum is the largest that endows 56 pairs from 8 languages in XL-Sum. In comparison, even excluding English, our PMIndiaSum contains 13 Indian languages and 156 cross-lingual pairs.
In a zero-shot fashion, one can train on multilingual datasets for cross-lingual capability. Nonetheless, as we show, the performance may be inferior. Eminently, while these datasets can be used for monolingual and multilingual purposes, they do not provide a benchmark to evaluate cross-lingual research. Our work fills the gap by providing both training data and a trustable test suite, for a vast number of Indian language pairs.

Conclusion and Future Work
We have presented PMIndiaSum, a headline summarization corpus focused on 14 languages in India, supporting monolingual, cross-lingual, and multilingual summarization in 196 directions. We outlined the processing steps, explained data statistics, discussed quality considerations, and made the dataset publicly accessible. We also published benchmarks for extractive baselines, summarization-and-translation, and fine-tuning involving two pre-trained language models. Experiment results emphasize the value of our dataset for summarization between languages in India.
A natural extension is to continuously integrate new articles from the source website to update the dataset. However, one must make sure that experimental comparisons remain fair by maintaining consistent data sizes. Another direction is to ingest more websites to expand the size, domain coverage, language availability, and modality.

Limitations
Our articles are scraped from a governmental website, leading to a domain bias towards political news, and a style bias towards headline-like summaries. Also, we selected the articles available in all languages to be validation and testing instances, to prevent information leakage. While we believe this to be a sensible decision, those articles might carry a distributional drift, such as being more important for all readers or easier to translate.

Ethics Statement
We place trust in the Indian governmental website to eliminate inappropriate content, but we could not inspect each data instance ourselves. On the other hand, the values promoted by a governmental agency might not align with every potential user of the data. Also, the provision of data in widely spoken languages might further edge out lesserused languages in research.

C Human evaluation and samples
We carry out a human evaluation to compare headlines and first sentences as summaries. For each of English, Hindi, and Telugu, we ask three native speakers to consider the accuracy, informativeness, and quality of the potential summaries from 50 samples based on the following guideline. In this task, you will look at a set of articles and summaries. Each article will be presented with two (2) potential summaries. The aim of the evaluation is to identify whether the presented text is a summary. Here are the steps you need to take. 1. Read the article to understand the context. 2. Read both summaries carefully. Consider the accuracy, informativeness, and quality of the summaries. We provide the definition of these criteria below.
• Accuracy: how closely the summary reflects the factual content of a news article. An accurate summary should not include any false information or misrepresentations of the content. • Informativeness: how much information the summary provides about the main points of the news article. An informative summary should provide enough detail to give the reader a good understanding of the article's content. • Quality: the overall standard of the summary. A high-quality summary should be fluent, well-written, free of errors, and easy to read. 3. Provide your binary decision on whether each text is a summary. Base your decision solely on the quality, accuracy, and informativeness of the content, without being influenced by factors such as writing style, personal preferences, etc. Please note that an incomplete sentence is acceptable as long as it does not compromise accuracy, informativeness, or quality. For each sample document, we record the majority voting (yes/no) from three annotators in Section 2.6 Table 4. Alongside the results, in Table 10, we compute the inter-rater reliability using Intraclass Correlation Coefficient as per (Koo and Li, 2016), following the method described by Shrout and Fleiss (1979). We observe that the annotators are in moderate to good agreement for all languages and both potential summaries.  We also note samples from the dataset to show potential problems with using first sentences as summaries in Figure 15. Sometimes the first sentence is part of a speech expressing gratitude.

D Experimental setup
IndicBART and mBART-large-50 models have 244M and 610M parameters each. For fine-tuning, we always set a training budget of 100 epochs and also apply early stopping after three consecutive non-improving validation cross-entropy. We use an effective batch size of 96 by combining different batch sizes, numbers of GPUs, and gradient accumulation. For document-summary pairs, we set the maximum lengths to 1024 and 64 respectively. All other configurations follow the default in the     , 2023). We query an LLM with a prompt "Article: ${article}. Summarize the article above into a headline in ${language} language. Summary:", and parse the model's completion as the candidate summary. Our discovery is that although the LLMs can 13 https://huggingface.co/docs/transformers/main_classes/ trainer generate some Indian languages, the quality is subpar, potentially due to the lack of exposure to these languages. Monolingual English summarization performance is shown in Table 13, along with finetuning baselines. These indicate that despite Vicuna achieving higher scores than Alpaca, prompting underperforms fine-tuning.
We suggest that our benchmark cannot be trivially solved by prompting LLMs, because of language inadequacy and domain specificity. Further research is required to overcome the two problems.

E.2 Adapters
Adapters are small trainable modules inserted into a giant PLM; during PLM fine-tuning, the entire model can be frozen except for the adapters (Houlsby et al., 2019). This paradigm achieves great parameter efficiency as only the adapter weights need to be updated and saved. Adapters are relevant to summarization research, but we omit experiments on these because Zhao and Chen (2022) show that with a few thousand training data, the performance of adapters is not comparable to fine-tuning a PLM. We, nevertheless, encourage future research to try this out.

F Standard deviations
Standard deviations are listed in Table 14 for monolingual summarization, Table 16 for crosslingual summarization, Table 17 for multilingual IndicBART, and Table 18 for multilingual mBART. These correspond to the mean ROUGE and BLEU scores reported in Section 3.4. We do not notice any peculiar numbers.    Table 15: Two examples of PMIndiaSum document-headline pairs in English, Hindi, and Telugu. The documents' first sentences could be part of a speech, which poses a problem when used as a summary, whereas the headline is a more appropriate summary.

Summarization-Translation
Translation-Summarization      Q: Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? A: The dataset consists of all instances derived from the raw data we gathered and processed.
Q: Is any information missing from individual instances? If so, please provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g., redacted text. A: No.
Q: Are relationships between individual instances made explicit (e.g., users' movie ratings, social network links)? If so, please describe how these relationships are made explicit.) A: Yes. Data instances are mostly independent of each other. Data instances of document-headline pairs in different languages from the same article convey the same content in the original news articles.
Q: Are there recommended data splits (e.g., training, development/validation, testing)? If so, please provide a description of these splits, explaining the rationale behind them. A: Yes. Refer to Section 2.5 for an explanation. We have also provided the data preparation script in the GitHub repository footnoted in the main content.
Q: Are there any errors, sources of noise, or redundancies in the dataset? A: Our expectation is that the articles are professionally written with no errors, and we take measures to ensure that there are no redundancies by removing duplicated data. However, it is possible that extraneous HTML elements may introduce errors that remain unfiltered from the raw crawled data. It is not feasible to manually inspect all data instances or automatically identify this type of noise. Q: Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? (If it links to or relies on external resources, a) are there guarantees that they will exist, and remain constant, over time; b) are there official archival versions of the complete dataset (i.e., including the external resources as they existed at the time the dataset was created); c) are there any restrictions (e.g., licenses, fees) associated with any of the external resources that might apply to a dataset consumer? Please provide descriptions of all external resources and any restrictions associated with them, as well as links or other access points, as appropriate.) A: The dataset is self-contained. The dataset can be downloaded, used, adapted, and re-distributed without restrictions.
Q: Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor-patient confidentiality, data that includes the content of individuals' non-public communications)? If so, please provide a description. A: No, as all articles in the dataset are publicly available from the PMIndia's website.
Q: Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why. A: Unlikely, but we cannot completely rule this out because the data contains news articles from a governmental website.
Q: Does the dataset relate to people? (If not, you may skip the remaining questions in this section.) A: Yes, the majority of the articles are about the Prime Minister of India who is a public figure. The data also contains news about other real people.
Q: Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset. A: While the dataset does not explicitly identify any subpopulations based on factors such as age or gender, it is possible that such information may be mentioned in the news articles themselves, such as when referring to specific individuals. As news articles are likely to include details such as gender, age, occupation, etc., it is possible that these factors may be indirectly associated with the data instances in the dataset. However, we do not have any explicit information on the distributions of these subpopulations within the dataset. Q: Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how. A: Yes, there are individuals' names present in the data.
Q: Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description. A: While it is unlikely that the dataset contains sensitive information that is not already public, there is a possibility that certain news articles in the dataset may contain details that could be considered sensitive. For example, news articles may refer to individuals' personal information. However, as our dataset is derived from a public governmental news website, any sensitive information that may be present in the dataset is likely to have already been publicly disclosed.
Q: Any other comments? A: No.
G.3 Collection process Q: How was the data associated with each instance acquired? (Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, modelbased guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.) A: The data is crawled from the Prime Minister of India website followed by processing. It is observable under the "news-update" section on the website. The data is reported directly by the website.
Q: What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? (How were these mechanisms or procedures validated?)