Lokesh Madasu


2024

pdf bib
SemRel2024: A Collection of Semantic Textual Relatedness Datasets for 13 Languages
Nedjma Ousidhoum | Shamsuddeen Muhammad | Mohamed Abdalla | Idris Abdulmumin | Ibrahim Ahmad | Sanchit Ahuja | Alham Aji | Vladimir Araujo | Abinew Ayele | Pavan Baswani | Meriem Beloucif | Chris Biemann | Sofia Bourhim | Christine Kock | Genet Dekebo | Oumaima Hourrane | Gopichand Kanumolu | Lokesh Madasu | Samuel Rutunda | Manish Shrivastava | Thamar Solorio | Nirmal Surange | Hailegnaw Tilaye | Krishnapriya Vishnubhotla | Genta Winata | Seid Yimam | Saif Mohammad
Findings of the Association for Computational Linguistics: ACL 2024

Exploring and quantifying semantic relatedness is central to representing language and holds significant implications across various NLP tasks. While earlier NLP research primarily focused on semantic similarity, often within the English language context, we instead investigate the broader phenomenon of semantic relatedness. In this paper, we present SemRel, a new semantic relatedness dataset collection annotated by native speakers across 13 languages: Afrikaans, Algerian Arabic, Amharic, English, Hausa, Hindi, Indonesian, Kinyarwanda, Marathi, Moroccan Arabic, Modern Standard Arabic, Spanish, and Telugu. These languages originate from five distinct language families and are predominantly spoken in Africa and Asia – regions characterised by a relatively limited availability of NLP resources. Each instance in the SemRel datasets is a sentence pair associated with a score that represents the degree of semantic textual relatedness between the two sentences. The scores are obtained using a comparative annotation framework. We describe the data collection and annotation processes, challenges when building the datasets, baseline experiments, and their impact and utility in NLP.

pdf bib
TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu
Gopichand Kanumolu | Lokesh Madasu | Nirmal Surange | Manish Shrivastava
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

2023

pdf bib
Mukhyansh: A Headline Generation Dataset for Indic Languages
Lokesh Madasu | Gopichand Kanumolu | Nirmal Surange | Manish Shrivastava
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation