Daya Sagar Baral
2025
Abstractive Summarization of Low resourced Nepali language using Multilingual Transformers
Prakash Dhakal
|
Daya Sagar Baral
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Nepali, one of the prominent languages of South Asia, remains underrepresented in natural language processing (NLP) research, particularly in the domain of abstractive summarization. While significant progress has been made in extractive summarization, the complexity of generating coherent, human-like summaries from low-resource languages like Nepali is still largely unexplored. This paper introduces the first comprehensive study on applying multilingual transformer-based models, specifically mBART and mT5, to the task of generating headlines for Nepali news articles through abstractive summarization. Given the absence of large-scale datasets for this task, a new Nepali news headline summarization corpus was created by scraping data from multiple online news portals. The models were fine-tuned with this novel dataset using Low-Rank Adaptation (LoRA) and quantization techniques, allowing for more computationally efficient training while preserving performance. The models’ effectiveness was evaluated using ROUGE scores and a human evaluation approach that focused on relevance, fluency, conciseness, informativeness, factual accuracy, and coverage. The findings demonstrate that a 4-bit quantized mBART model achieves superior performance, offering significant potential for improving digital content summarization for Nepali. This study highlights key challenges in processing Nepali, particularly its orthographic and resource limitations, while providing a path forward for advancing NLP tools for South Asian languages.