Ashok Urlana


2023

pdf bib
PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India
Ashok Urlana | Pinzhen Chen | Zheng Zhao | Shay Cohen | Manish Shrivastava | Barry Haddow
Findings of the Association for Computational Linguistics: EMNLP 2023

This paper introduces PMIndiaSum, a multilingual and massively parallel summarization corpus focused on languages in India. Our corpus provides a training and testing ground for four language families, 14 languages, and the largest to date with 196 language pairs. We detail our construction workflow including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding summarization between Indian languages. Our dataset is publicly available and can be freely modified and re-distributed.

2022

pdf bib
TeSum: Human-Generated Abstractive Summarization Corpus for Telugu
Ashok Urlana | Nirmal Surange | Pavan Baswani | Priyanka Ravva | Manish Shrivastava
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Expert human annotation for summarization is definitely an expensive task, and can not be done on huge scales. But with this work, we show that even with a crowd sourced summary generation approach, quality can be controlled by aggressive expert informed filtering and sampling-based human evaluation. We propose a pipeline that crowd-sources summarization data and then aggressively filters the content via: automatic and partial expert evaluation. Using this pipeline we create a high-quality Telugu Abstractive Summarization dataset (TeSum) which we validate with sampling-based human evaluation. We also provide baseline numbers for various models commonly used for summarization. A number of recently released datasets for summarization, scraped the web-content relying on the assumption that summary is made available with the article by the publishers. While this assumption holds for multiple resources (or news-sites) in English, it should not be generalised across languages without thorough analysis and verification. Our analysis clearly shows that this assumption does not hold true for most Indian language news resources. We show that our proposed filtration pipeline can even be applied to these large-scale scraped datasets to extract better quality article-summary pairs.

pdf bib
LTRC @MuP 2022: Multi-Perspective Scientific Document Summarization Using Pre-trained Generation Models
Ashok Urlana | Nirmal Surange | Manish Shrivastava
Proceedings of the Third Workshop on Scholarly Document Processing

The MuP-2022 shared task focuses on multiperspective scientific document summarization. Given a scientific document, with multiple reference summaries, our goal was to develop a model that can produce a generic summary covering as many aspects of the document as covered by all of its reference summaries. This paper describes our best official model, a finetuned BART-large, along with a discussion on the challenges of this task and some of our unofficial models including SOTA generation models. Our submitted model out performedthe given, MuP 2022 shared task, baselines on ROUGE-2, ROUGE-L and average ROUGE F1-scores. Code of our submission can be ac- cessed here.