AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohamad Bilal Zbib; Hasan Abed Al Kader Hammoud; Ammar Mohanna; Nadine Rizk; Fatima Karnib; Sina Moukaled; Bernard Ghanem

AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models

Mohamad Bilal Zbib, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Nadine Rizk, Fatima Karnib, Sina Moukaled, Bernard Ghanem

Abstract

We present AraLingBench, a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language mod- els (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than au- thentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The benchmark and evaluation code are available on Hugging Face and GitHub.

Anthology ID:: 2026.abjadnlp-1.45
Volume:: Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Month:: March
Year:: 2026
Address:: Rabat, Morocco
Venues:: AbjadNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 385–393
Language:
URL:: https://aclanthology.org/2026.abjadnlp-1.45/
DOI:
Bibkey:
Cite (ACL):: Mohamad Bilal Zbib, Hasan Abed Al Kader Hammoud, Ammar Mohanna, Nadine Rizk, Fatima Karnib, Sina Moukaled, and Bernard Ghanem. 2026. AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models. In Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script, pages 385–393, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):: AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models (Zbib et al., AbjadNLP 2026)
Copy Citation:
PDF:: https://aclanthology.org/2026.abjadnlp-1.45.pdf

PDF Cite Search Fix data