Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks

Dagnachew Mekonnen Marilign; Eyob Nigussie Alemu

doi:10.18653/v1/2025.winlp-main.23

Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks

Dagnachew Mekonnen Marilign, Eyob Nigussie Alemu

Abstract

News classification is a downstream task in Natural Language Processing (NLP) that involves the automatic categorization of news articles into predefined thematic categories. Although notable advancements have been made for high-resource languages, low-resource languages such as Amharic continue to encounter significant challenges, largely due to the scarcity of annotated corpora and the limited availability of language-specific, state-of-the-art model adaptations. To address these limitations, this study significantly expands an existing Amharic news dataset, increasing its size from 50,000 to 144,000 articles, thus enriching the linguistic and topical diversity available for the model training and evaluation. Using this expanded dataset, we systematically evaluated the performance of five transformer-based models: mBERT, XLM-R, DistilBERT, AfriBERTa, and AfroXLM in the context of Amharic news classification. Among these, AfriBERTa and XLM-R achieved the highest F1-scores of 90.25% and 90.11%, respectively, establishing a new performance baseline for the task. These findings underscore the efficacy of advanced multilingual and Africa-centric transformer architectures when applied to under-resourced languages, and further emphasize the critical importance of large-scale, high-quality datasets in enabling robust model generalization. This study offers a robust empirical foundation for advancing NLP research in low-resource languages, which remain underrepresented in current NLP resources and methodologies.

Anthology ID:: 2025.winlp-main.23
Volume:: Proceedings of the 9th Widening NLP Workshop
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim
Venues:: WiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 130–135
Language:
URL:: https://aclanthology.org/2025.winlp-main.23/
DOI:: 10.18653/v1/2025.winlp-main.23
Bibkey:
Cite (ACL):: Dagnachew Mekonnen Marilign and Eyob Nigussie Alemu. 2025. Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks. In Proceedings of the 9th Widening NLP Workshop, pages 130–135, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks (Marilign & Alemu, WiNLP 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.winlp-main.23.pdf

PDF Cite Search Fix data