Enhancing Arabic NLP Tasks through Character-Level Models and Data Augmentation

Mohanad Mohamed, Sadam Al-Azani


Abstract
This study introduces a character-level approach specifically designed for Arabic NLP tasks, offering a novel and highly effective solution to the unique challenges inherent in Arabic language processing. It presents a thorough comparative study of various character-level models, including Convolutional Neural Networks (CNNs), pre-trained transformers (CANINE), and Bidirectional Long Short-Term Memory networks (BiLSTMs), assessing their performance and exploring the impact of different data augmentation techniques on enhancing their effectiveness. Additionally, it introduces two innovative Arabic-specific data augmentation methods—vowel deletion and style transfer—and rigorously evaluates their effectiveness. The proposed approach was evaluated on Arabic privacy policy classification task as a case study, demonstrating significant improvements in model performance, reporting a micro-averaged F1-score of 93.8%, surpassing state-of-the-art models.
Anthology ID:
2025.coling-main.186
Volume:
Proceedings of the 31st International Conference on Computational Linguistics
Month:
January
Year:
2025
Address:
Abu Dhabi, UAE
Editors:
Owen Rambow, Leo Wanner, Marianna Apidianaki, Hend Al-Khalifa, Barbara Di Eugenio, Steven Schockaert
Venue:
COLING
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
2744–2757
Language:
URL:
https://aclanthology.org/2025.coling-main.186/
DOI:
Bibkey:
Cite (ACL):
Mohanad Mohamed and Sadam Al-Azani. 2025. Enhancing Arabic NLP Tasks through Character-Level Models and Data Augmentation. In Proceedings of the 31st International Conference on Computational Linguistics, pages 2744–2757, Abu Dhabi, UAE. Association for Computational Linguistics.
Cite (Informal):
Enhancing Arabic NLP Tasks through Character-Level Models and Data Augmentation (Mohamed & Al-Azani, COLING 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.coling-main.186.pdf