Mohamad Bilal Zbib

2026

AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Mohamad Bilal Zbib | Hasan Abed Al Kader Hammoud | Ammar Mohanna | Nadine Rizk | Fatima Karnib | Sina Moukaled | Bernard Ghanem
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

We present AraLingBench, a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language mod- els (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than au- thentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The benchmark and evaluation code are available on Hugging Face and GitHub.

pdf bib abs

Hala Technical Report Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud | Mohamad Bilal Zbib | Bernard Ghanem
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script

We present HALA, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR↔EN teacher to FP8 (yielding ~2× higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2–1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train HALA models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, HALA achieves state-of-the-art results within both the "nano" (≤2B) and "small" (7–9B) categories, outperforming their bases. We are committed to release models, data, evaluation, and recipes to accelerate research in Arabic NLP.

Co-authors

Nadine Rizk 1

Venues

AbjadNLP2
WS2

Fix author