TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Gopichand Kanumolu; Lokesh Madasu; Nirmal Surange; Manish Shrivastava

TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu

Gopichand Kanumolu, Lokesh Madasu, Nirmal Surange, Manish Shrivastava

Abstract

News headline generation is a crucial task in increasing productivity for both the readers and producers of news. This task can easily be aided by automated News headline-generation models. However, the presence of irrelevant headlines in scraped news articles results in sub-optimal performance of generation models. We propose that relevance-based headline classification can greatly aid the task of generating relevant headlines. Relevance-based headline classification involves categorizing news headlines based on their relevance to the corresponding news articles. While this task is well-established in English, it remains under-explored in low-resource languages like Telugu due to a lack of annotated data. To address this gap, we present TeClass, the first-ever human-annotated Telugu news headline classification dataset, containing 78,534 annotations across 26,178 article-headline pairs. We experiment with various baseline models and provide a comprehensive analysis of their results. We further demonstrate the impact of this work by fine-tuning various headline generation models using TeClass dataset. The headlines generated by the models fine-tuned on highly relevant article-headline pairs, showed about a 5 point increment in the ROUGE-L scores. To encourage future research, the annotated dataset as well as the annotation guidelines will be made publicly available.

Anthology ID:: 2024.lrec-main.1364
Volume:: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:: May
Year:: 2024
Address:: Torino, Italia
Editors:: Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:: LREC | COLING
SIG:
Publisher:: ELRA and ICCL
Note:
Pages:: 15711–15720
Language:
URL:: https://aclanthology.org/2024.lrec-main.1364/
DOI:
Bibkey:
Cite (ACL):: Gopichand Kanumolu, Lokesh Madasu, Nirmal Surange, and Manish Shrivastava. 2024. TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15711–15720, Torino, Italia. ELRA and ICCL.
Cite (Informal):: TeClass: A Human-Annotated Relevance-based Headline Classification and Generation Dataset for Telugu (Kanumolu et al., LREC-COLING 2024)
Copy Citation:
PDF:: https://aclanthology.org/2024.lrec-main.1364.pdf

PDF Cite Search Fix data