ShahiEmotion: A Benchmark Dataset for Punjabi Shahmukhi Emotion Detection

Usman Nawaz; Muhammad Junaid Iqbal; Tahir Alyas; Muhammad Asaf; Shumayla Yaqoob; Usman Ahmed Raza; Muhammad Amin Nadim; Aftab Rafique; Faisal Rehman

ShahiEmotion: A Benchmark Dataset for Punjabi Shahmukhi Emotion Detection

Usman Nawaz, Muhammad Junaid Iqbal, Tahir Alyas, Muhammad Asaf, Shumayla Yaqoob, Usman Ahmed Raza, Muhammad Amin Nadim, Aftab Rafique, Faisal Rehman

Abstract

Emotion detection is an important text classification task with applications in sentiment analysis, social media monitoring, human-computer interaction, and affective language understanding. However, Punjabi written in the Shahmukhi script remains severely under-resourced for emotion detection, with limited benchmark-style resources available for supervised evaluation. This paper introduces ShahiEmotion, a new Punjabi Shahmukhi emotion detection dataset containing 30379 sentence-level instances annotated with seven emotion categories: sadness, surprise, happiness, anger, neutral, fear, and disgust. The dataset is designed to support research in a low-resource setting characterized by script-specific challenges, lexical variation, and substantial class imbalance. We establish baseline results using several pretrained transformer-based models and formulate emotion detection as a sentence-level classification task. In particular, we fine-tune multilingual BERT, multilingual DistilBERT, XLM-RoBERTa, and Urdu RoBERTa under the same training and evaluation setting using standard cross-entropy loss. Experimental results show that XLM-RoBERTa provides the strongest overall performance among the compared models. The best model achieves 77.95% accuracy, 58.47% macro-F1, and 77.60% weighted-F1 on the test set. The dataset, evaluation protocol, and baseline results introduced in this work are intended to support future research on Punjabi Shahmukhi emotion analysis and low-resource NLP.

Anthology ID:: 2026.mellm-1.20
Volume:: Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Month:: July
Year:: 2026
Address:: San Diego, United States
Editors:: Kaiyu Huang, Fengran Mo, Pinzhen Chen, Meng Jiang
Venues:: MeLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 211–220
Language:
URL:: https://aclanthology.org/2026.mellm-1.20/
DOI:
Bibkey:
Cite (ACL):: Usman Nawaz, Muhammad Junaid Iqbal, Tahir Alyas, Muhammad Asaf, Shumayla Yaqoob, Usman Ahmed Raza, Muhammad Amin Nadim, Aftab Rafique, and Faisal Rehman. 2026. ShahiEmotion: A Benchmark Dataset for Punjabi Shahmukhi Emotion Detection. In Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026), pages 211–220, San Diego, United States. Association for Computational Linguistics.
Cite (Informal):: ShahiEmotion: A Benchmark Dataset for Punjabi Shahmukhi Emotion Detection (Nawaz et al., MeLLM 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.mellm-1.20.pdf

PDF Cite Search Fix data