@inproceedings{althnian-etal-2025-araeval,
title = "{A}ra{E}val: An {A}rabic Multi-Task Evaluation Suite for Large Language Models",
author = "Althnian, Alhanoof and
Alzahrani, Norah A. and
Alsubaie, Shaykhah Z. and
Albilali, Eman and
Abdelali, Ahmed and
Alotaibi, Nouf M. and
Bari, M Saiful and
Alnumay, Yazeed and
Alothaimen, Abdulhamed and
Saif, Maryam and
Alzaidi, Shahad D. and
Mirza, Faisal Abdulrahman and
Almushayqih, Yousef and
Al Saleem, Mohammed and
Alabduljabbar, Ghadah and
Al-Thubaity, Abdulmohsen and
Alowisheq, Areeb and
Al-Twairesh, Nora",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.emnlp-main.1679/",
pages = "33025--33049",
ISBN = "979-8-89176-332-6",
abstract = "The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction- following capabilities of foundation models in the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark perfor- mance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. AraEval datasets can be found at https://huggingface.co/collections/humain-ai/araeval-datasets-687760e04b12a7afb429a4a0."
}<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="althnian-etal-2025-araeval">
<titleInfo>
<title>AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models</title>
</titleInfo>
<name type="personal">
<namePart type="given">Alhanoof</namePart>
<namePart type="family">Althnian</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Norah</namePart>
<namePart type="given">A</namePart>
<namePart type="family">Alzahrani</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Shaykhah</namePart>
<namePart type="given">Z</namePart>
<namePart type="family">Alsubaie</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Eman</namePart>
<namePart type="family">Albilali</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Ahmed</namePart>
<namePart type="family">Abdelali</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Nouf</namePart>
<namePart type="given">M</namePart>
<namePart type="family">Alotaibi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">M</namePart>
<namePart type="given">Saiful</namePart>
<namePart type="family">Bari</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yazeed</namePart>
<namePart type="family">Alnumay</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Abdulhamed</namePart>
<namePart type="family">Alothaimen</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Maryam</namePart>
<namePart type="family">Saif</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Shahad</namePart>
<namePart type="given">D</namePart>
<namePart type="family">Alzaidi</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Faisal</namePart>
<namePart type="given">Abdulrahman</namePart>
<namePart type="family">Mirza</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Yousef</namePart>
<namePart type="family">Almushayqih</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Mohammed</namePart>
<namePart type="family">Al Saleem</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Ghadah</namePart>
<namePart type="family">Alabduljabbar</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Abdulmohsen</namePart>
<namePart type="family">Al-Thubaity</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Areeb</namePart>
<namePart type="family">Alowisheq</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Nora</namePart>
<namePart type="family">Al-Twairesh</namePart>
<role>
<roleTerm authority="marcrelator" type="text">author</roleTerm>
</role>
</name>
<originInfo>
<dateIssued>2025-11</dateIssued>
</originInfo>
<typeOfResource>text</typeOfResource>
<relatedItem type="host">
<titleInfo>
<title>Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing</title>
</titleInfo>
<name type="personal">
<namePart type="given">Christos</namePart>
<namePart type="family">Christodoulopoulos</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Tanmoy</namePart>
<namePart type="family">Chakraborty</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Carolyn</namePart>
<namePart type="family">Rose</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<name type="personal">
<namePart type="given">Violet</namePart>
<namePart type="family">Peng</namePart>
<role>
<roleTerm authority="marcrelator" type="text">editor</roleTerm>
</role>
</name>
<originInfo>
<publisher>Association for Computational Linguistics</publisher>
<place>
<placeTerm type="text">Suzhou, China</placeTerm>
</place>
</originInfo>
<genre authority="marcgt">conference publication</genre>
<identifier type="isbn">979-8-89176-332-6</identifier>
</relatedItem>
<abstract>The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction- following capabilities of foundation models in the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark perfor- mance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. AraEval datasets can be found at https://huggingface.co/collections/humain-ai/araeval-datasets-687760e04b12a7afb429a4a0.</abstract>
<identifier type="citekey">althnian-etal-2025-araeval</identifier>
<location>
<url>https://aclanthology.org/2025.emnlp-main.1679/</url>
</location>
<part>
<date>2025-11</date>
<extent unit="page">
<start>33025</start>
<end>33049</end>
</extent>
</part>
</mods>
</modsCollection>
%0 Conference Proceedings
%T AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models
%A Althnian, Alhanoof
%A Alzahrani, Norah A.
%A Alsubaie, Shaykhah Z.
%A Albilali, Eman
%A Abdelali, Ahmed
%A Alotaibi, Nouf M.
%A Bari, M. Saiful
%A Alnumay, Yazeed
%A Alothaimen, Abdulhamed
%A Saif, Maryam
%A Alzaidi, Shahad D.
%A Mirza, Faisal Abdulrahman
%A Almushayqih, Yousef
%A Al Saleem, Mohammed
%A Alabduljabbar, Ghadah
%A Al-Thubaity, Abdulmohsen
%A Alowisheq, Areeb
%A Al-Twairesh, Nora
%Y Christodoulopoulos, Christos
%Y Chakraborty, Tanmoy
%Y Rose, Carolyn
%Y Peng, Violet
%S Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
%D 2025
%8 November
%I Association for Computational Linguistics
%C Suzhou, China
%@ 979-8-89176-332-6
%F althnian-etal-2025-araeval
%X The rapid advancements of Large Language models (LLMs) necessitate robust benchmarks. In this paper, we present AraEval, a pioneering and comprehensive evaluation suite specifically developed to assess the advanced knowledge, reasoning, truthfulness, and instruction- following capabilities of foundation models in the Arabic context. AraEval includes a diverse set of evaluation tasks that test various dimensions of knowledge and reasoning, with a total of 24,378 samples. These tasks cover areas such as linguistic understanding, factual recall, logical inference, commonsense reasoning, mathematical problem-solving, and domain-specific expertise, ensuring that the evaluation goes beyond basic language comprehension. It covers multiple domains of knowledge, such as science, history, religion, and literature, ensuring that the LLMs are tested on a broad spectrum of topics relevant to Arabic-speaking contexts. AraEval is designed to facilitate comparisons across different foundation models, enabling LLM developers and users to benchmark perfor- mance effectively. In addition, it provides diagnostic insights to identify specific areas where models excel or struggle, guiding further development. AraEval datasets can be found at https://huggingface.co/collections/humain-ai/araeval-datasets-687760e04b12a7afb429a4a0.
%U https://aclanthology.org/2025.emnlp-main.1679/
%P 33025-33049
Markdown (Informal)
[AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models](https://aclanthology.org/2025.emnlp-main.1679/) (Althnian et al., EMNLP 2025)
ACL
- Alhanoof Althnian, Norah A. Alzahrani, Shaykhah Z. Alsubaie, Eman Albilali, Ahmed Abdelali, Nouf M. Alotaibi, M Saiful Bari, Yazeed Alnumay, Abdulhamed Alothaimen, Maryam Saif, Shahad D. Alzaidi, Faisal Abdulrahman Mirza, Yousef Almushayqih, Mohammed Al Saleem, Ghadah Alabduljabbar, Abdulmohsen Al-Thubaity, Areeb Alowisheq, and Nora Al-Twairesh. 2025. AraEval: An Arabic Multi-Task Evaluation Suite for Large Language Models. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33025–33049, Suzhou, China. Association for Computational Linguistics.