Fatimah Mohamed Emad Eldin
2026
Tashkees-AI at AbjadMed 2026: Flat vs. Hierarchical Classification for Fine-Grained Arabic Medical QA
Fatimah Mohamed Emad Eldin
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Fatimah Mohamed Emad Eldin
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
This paper describes Tashkees-AI, a system developed for the AbjadMed 2026 Shared Task on Arabic Medical Question Classification. A comprehensive empirical study was conducted across 82 fine-grained categories, investigating three paradigms: fine-tuned encoder models, hierarchical classification, and ensemble methods. Leveraging a dataset of 27k Arabic medical question-answer pairs, an extensive ablation studies was conducted, comparing MARBERTv2, CAMeLBERT, two-stage hierarchical classifiers, and RAG-based approaches. The findings reveal that fine-tuned MARBERTv2 with data cleaning yields the best performance, achieving a macro F1-score of 0.3659 on the blind test set. In contrast, hierarchical methods surprisingly underperformed (0.332 F1) due to error propagation. The system ranked 26th on the official leaderboard.
Kashif-AI at AbjadGenEval Shared Task: A Transformer-based Approach for Arabic AI-Generated Text Detection
Fatimah Mohamed Emad Eldin
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Fatimah Mohamed Emad Eldin
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
As Large Language Models (LLMs) become increasingly proficient at generating human-like text, distinguishing between human-written and machine-generated content has become a critical challenge for information integrity. This paper presents Kashif-AI, a system developed for the AbjadGenEval Task 1: AI-Generated Arabic Text Detection. The approach leverages fine-tuned Arabic Pre-trained Language Models (PLMs), specifically MARBERT and CAMeLBERT, to classify news articles. A rigorous ablation study was conducted to evaluate the impact of data augmentation, comparing models trained on the official shared task data against those trained on a combined corpus of over 47,000 samples. While near-perfect performance was observed during validation, the blind test set evaluation revealed a significant generalization gap. Contrary to expectations, data augmentation resulted in performance degradation due to domain shifts. The best-performing configuration, which utilized CAMeLBERT-Mix trained on the original dataset, achieved an F1-score of 66.29% and an Accuracy of 70.5% on the blind test set.