Dolphin: A Challenging and Diverse Benchmark for Arabic NLG

We present Dolphin, a novel benchmark that addresses the need for a natural language generation (NLG) evaluation framework dedicated to the wide collection of Arabic languages and varieties. The proposed benchmark encompasses a broad range of 13 different NLG tasks, including dialogue generation, question answering, machine translation, summarization, among others. Dolphin comprises a substantial corpus of 40 diverse and representative public datasets across 50 test splits, carefully curated to reflect real-world scenarios and the linguistic richness of Arabic. It sets a new standard for evaluating the performance and generalization capabilities of Arabic and multilingual models, promising to enable researchers to push the boundaries of current methodologies. We provide an extensive analysis of Dolphin, highlighting its diversity and identifying gaps in current Arabic NLG research. We also offer a public leaderboard that is both interactive and modular and evaluate several models on our benchmark, allowing us to set strong baselines against which researchers can compare.


Introduction
Natural language generation (NLG) systems attempt to produce coherent, contextually appropriate, and linguistically accurate human-like language.These systems have a wide range of applications in everyday life, including in recreation, education, health, etc.The recent rise of generative models has transformed these NLG systems, making them more relevant and engaging than before.Crucial to measuring the performance of NLG systems are high-quality benchmarks.In particular, they provide standardized frameworks for comparing and quantitatively assessing differ- ent algorithms, models, and techniques.For NLG, benchmarks define specific criteria and metrics for evaluating performance, allowing for objectively gauging the strengths and limitations of different approaches and encouraging healthy competition.NLG benchmarks can also facilitate reproducibility and promote transparency across different studies, acting as a catalyst for advancement in the field.
Despite of this significance, efforts for developing nuanced NLG benchmarks that can allow us to track and guide performance on particular languages remain limited.For Arabic, a wide collection of languages and diverse varieties, there is currently no sizeable benchmark that caters to the needs of the community.In this work, we present a large benchmark for Arabic, dubbed Dolphin, to bridge this gap.Our novel benchmark is carefully curated to represent real-world usage of Arabic at scale.Dolphin covers Classical Arabic (CA), a premodern standardized form of Arabic used for old poetry and religious discourse that continues to be employed for literary expression and oration, Modern Standard Arabic (MSA), a modern descendent of CA used in formal settings and in pan-Arab media, dialectal Arabic (DA), such as varieties used in everyday communication in the different Arab countries.Dolphin also encompasses text written in both Arabic and Latin scripts, the latter usually referred to as Arabizi.The benchmark is comprised of 13 different generation tasks based on 40 different datasets across 50 test splits, making it by far the largest Arabic NLG benchmark to date and among the largest for any group of languages.
We build Dolphin on top of exclusively public datasets, adding a number of newly developed datasets of our creation.This makes Dolphin accessible and easy to use.Our benchmark is accompanied by a modular leaderboard with a unified evaluation metric, i.e., a Dolphin score.The leaderboard is designed to serve as a central hub for tracking and showcasing the performance of NLG systems.It functions as a dynamic and transparent platform where users can submit their models to compare their results against the state-of-the-art approaches.It also encourages a culture of transparency and detailed model description.
Overall, we make the following contributions: (1) We introduce a novel benchmark for Arabic NLG that is large, public, diverse, and inclusive.
(2) We develop a dynamic leaderboard with a rich array of best design principles to facilitate the measurement of progress in the field.(3) We evaluate a wide host of Arabic and multilingual models on our benchmark, offering strong baselines.(4) We analyze our benchmark to identify gaps in existing work, hoping to help guide future directions.The rest of the paper is organized as follows: In Section 2, we provide an overview of related work.Section 3 introduces Dolphin design principles and task clusters.In Section 4, we present evaluations of the pretrained models on Dolphin, and discuss the results we acquire.We conclude in Section 5.

Related Works
Existing NLG benchmarks can be classified into three distinct categories: Arabic-specific, X-specific (where X refers to languages other than Arabic, such as English, Chinese, etc.), and multilingual benchmarks.In this section, we provide a brief overview of each category, highlighting their respective characteristics and scope.We offer more  2021) propose GLGE, a generation benchmark for English covering eight datasets across four tasks.CUGE (Yao et al., 2021) and LOT (Guan et al., 2022) are two Chinese benchmarks that cover both language understanding and generation tasks.BanglaNLG (Bhattacharjee et al., 2023) (Chuklin et al., 2022), IndoNLG (Cahyawijaya et al., 2021), IndicNLG (Kumar et al., 2022), and MTG (Chen et al., 2022).As Figure 2 shows, compared to these benchmarks, Dolphin is the largest both in terms of the number of tasks and datasets.We now introduce Dolphin.

Dolphin Benchmark
Our objective is to provide a comprehensive and challenging benchmark for natural language generation that enables the assessment of language models and the tracking of progress in Arabic.To attain this objective, we develop Dolphin , considering several design principles that we will now elucidate.

Design Principles
Wide, diverse coverage.As our goal is to offer a demanding and diverse benchmark, we incorporate as many datasets from as many tasks as is feasible.Standard evaluation metrics.Most generation tasks can be evaluated using traditional automated metrics such as BLEU (Papineni et al., 2002) and ROUGE (Lin, 2004).Both of these metrics evaluate the n-gram overlap between a reference text and the generated text.Nevertheless, in many tasks (e.g., question generation, open domain generation, title generation) there are multiple valid ways to produce a given text.In our benchmark, in addition to F 1 , BLEU, and ROUGE, we use several other evaluation metrics such MaxMatch (M2) (Dahlmeier and Ng, 2012) for grammatical error correction, and Character Error Rate (CER) (Morris et al., 2004) for diacritization.
Modular, interactive leaderboard.To support future research, we develop a public leaderboard that enables the evaluation of multilingual and Arabic LMs on Dolphin.Our leaderboard is interactive and provides detailed metadata about the corpora such as size, training, development, and test splits, data sources (e.g., URL, GitHub), and citations to publications.The leaderboard also offers details of language models assessed such as the number of parameters, epochs to conversion, pretraining and finetuning information, etc.We provide a screenshot from our leaderboard in Figure D.1.We now introduce each of the task clusters in Dolphin.
Table 2: Descriptive statistics of the linguistic diversity in Dolphin across the different data splits.

Machine Translation
The MT cluster is built around three tasks: (1) X → MSA.In this task, we test the ability of the models to translate from six foreign languages into MSA.We use the UN parallel corpus (Ziemski et al., 2016), a dataset covering the six official UN languages (i.e., Arabic, Chinese, English, French, Russian, and Spanish).The UN corpus consists of development and test sets only. 2 For training, we randomly select 50K X-Arabic parallel sentences from the multilingual corpus MultiUN (Eisele and Chen, 2010) where X is a language from the six official languages.
(2) Arabizi → X.The goal of this task is to translate from Arabizi dialectal text3 into one of two foreign languages French and English.For this, we use Darija (Outchakoucht and Es-Samaali, 2021) and NArabizi (Seddah et al., 2020).
(3) Dialects → English.For this task, we focus on MT from six Arabic dialects into English using the MDP corpus (Bouamor et al., 2014).MDP is a human-translated collection of 1K sentences in Egyptian, Tunisian, Jordanian, Palestinian, and Syrian Arabic, in addition to English.For training, we use the 10K MSA-English manually translated sentences proposed by Bouamor et al. (2018) under a 'zero-shot' condition.4

Code-Switching
The purpose of the code-switching (CS) task cluster is to translate Arabic dialect text that includes code-switching with a foreign language into that foreign language.For this, we create six new human-written (natural) code-switched parallel test datasets, under two tasks: (1) DIA-FR → FR.This consists of 300 code-switched Arabic-French tweets collected from Algerian, Moroccan, and Tunisian Twitter.
(2) DIA-EN → EN.This is collected from Egyptian, Jordanian, and Palestinian Twitter and consists of 300 code-switched Arabic-English posts.For both of these DIA-FR and DIA-EN tasks, a human translation is performed by one native speaker from each dialect with semi-native English/French fluency.For these two tasks, we perform experiments under the zeroshot setting.That is, we use no actual code-switched training data.Rather, we extract 50K MSA-English and MSA-French sentences from AraOPUS-20 (Nagoudi et al., 2022b) that we use for monolingual training.We then extract 50 pairs from each code-switched dialect pair for development and test on the 250 remainder sentences.

News Title Generation
The news title generation (NTG) task is about producing a suitable title for a given news article.That is, a title generation model is required to output a short grammatical sequence of words that are appropriate for the content of the article.For this, we use two datasets: (1) Arabic NTG (Nagoudi et al., 2022a), and (2) XLSum (Hasan et al., 2021).5

Question Answering
For the QA cluster, we use seven publicly available QA datasets across four tasks.A summary of the QA cluster is in Appendix Table B.2.We also provide brief information about each task here.
Retrieval QA.For this task, we use (5) LAReQA (Roy et al., 2020), a crosslingual retrieval QA dataset built by converting the extractive QA dataset XQuAD (Artetxe et al., 2020) into a retrieval task XQuAD-R.In our benchmark, we focus on the Arabic part of XQuAD-R (AraQuAD-R).
Open-Domain QA.In this task, the goal is to answer fact-based questions in natural language.We add (6) DAWQAS, an Arabic Why QA dataset (Ismail and Nabhan Homsi, 2018) to our QA cluster.
Multi-choice QA.We also use (7) EX-AMS (Hardalov et al., 2020), a cross-lingual multi-choice QA dataset that covers 26 languages (including Arabic).Since we only have this particular test set for Arabic, we follow Hardalov et al. (2020) in evaluating the models on EXAMS under a zero-shot setting.6

Question Generation
The question generation (QG) cluster involves generating a question for a given passage (Gehrmann et al., 2021).The model is trained to generate simple questions relevant to passages along with their answers.For this cluster, we use (passage, answer, and question) triplets from five out of the seven QA question datasets described in Section 3.2.5.7

Transliteration
The task of transliteration (TS) is about converting a word or text from one writing system to another while preserving the pronunciation and sound of the original language.We create our TS component using three word-level datasets, as follows: (1) ANETA, an English-

Text Rewriting
The text rewriting (TR) cluster is about generating a text of the target style while preserving the content of the source input text.The TR cluster contains two tasks: (1) DIA → MSA.This task involves converting a text written in an Arabic dialect into MSA.For this, we use Dial2MSA (Mubarak, 2018).Dial2MSA is a parallel dialectal Arabic corpus for converting Egyptian, Maghrebi, Levantine, and Gulf dialects into MSA.
We use the Arabic parallel gender corpus (APGC) proposed by Alhafni et al. (2022), where the task is to take a given input sentence written in one gender (e.g., male) to produce a target sentence that has the same meaning but employing the opposite gender (i.e., female).

Diacritization
Arabic text diacritization (ATD) is the computational process of restoring missing diacritics or vowels to the orthographic word or a sequence of words (i.e., a sentence or a whole text).For this task, we use the Arabic diacritization dataset proposed by Fadel et al. (2019).

Dialogue Response Generation
Dialogue response generation (DRG) is a humancomputer interaction task with the goal of automatically producing a human-like response given a dialogue context.In this cluster, we have two tasks: (1) MSA DRG.For this task, we use the Arabic empathetic chatbot (AEC) dataset (Naous et al., 2020).It contains open-domain utterances with their corresponding empathetic responses machine translated from English into MSA.
We add the open-domain response generation in Arabic dialects proposed by Naous et al. (2023).Three native translators from the Levantine, Egyptian, and Gulf areas were asked to translate 1K utterance-response pairs from the English opendomain dialogues dataset DailyDialog (Li et al., 2017).

Grammatical Error Correction
The task of grammatical error correction (GEC) is focused on analyzing written text, automatically pinpointing, and rectifying a variety of grammatical errors as illustrated by a typical instance of grammatical error correction and its manual rectification.In this cluster, we use three GEC datasets: (1-2) QALB.We use two datasets extracted from the QALB shared tasks from 2014 (Mohit et al., 2014) and 2015 (Rozovskaya et al., 2015).Both datasets are manually corrected collections of Arabic texts originating from online commentaries on Aljazeera articles written by native Arabic speakers (L1), as well as texts produced by learners of Arabic as a second language (L2).
(3) ZAEBUC.A corpus that focuses on bilingual writers presented by Habash and Palfreyman (2022).It matches comparable texts in different languages written by the same writer on different occasions.The corpus is enhanced by adding multiple layered annotations, including manually corrected versions of the raw text, allowing us to use it for GEC.

Data2Text
The Data2Text (DT) task involves converting structured data like tables as input into descriptive texts without misrepresenting their contents, while sounding natural in writing (i.e., fluently describing this data as output).For the DT task cluster, we use the Arabic subset of the multilingual dataset MD2T proposed by Mille et al. (2020) during the third multilingual surface realization shared task.
Table 3 shows examples from each task included in Dolphin.We now introduce our strong baselines exploiting our benchmark.

Comparative Analysis with ARGEN.
Compared to the previous largest Arabic NLU benchmark, ARGEN (which we list in  2014).As such, Dolphin avoids issues AR-GEN suffers from such as challenges with (i) public distribution of the data and (ii) ease of evaluation.
Interactivity.Dolphin uniquely offers a benchmark leaderboard, a feature absent in ARGEN, providing real-time performance tracking and a dynamic evaluation environment.

Model Evaluation on Dolphin
In order to establish a conducive environment for meaningful comparisons on Dolphin, we offer a number of strong baselines for both finetuning and k-shot settings as described next.

Finetuned Models
For finetuning, we benchmark five different Arabic and multilingual models on Dolphin.).Table 4 presents the results of all pretrained models on each task cluster of Dolphin independently using the relevant metric.
Discussion.As Table 4 shows, models dedicated to Arabic outperform multilingual models on tasks where higher is better (in Dolphin H ). We also note that AraT5 v2 the new model we build on top of (Nagoudi et al., 2022a) Model Computational Costs.We assess the computational efficiency of the Arabic and multilingual models we finetune.Figure 3 shows for each model the total time needed for convergence (under our 20 epochs constraint with a patience of 5) and the conversion epoch.AraBART is the fastest (2.07 hours), with an average of 10.58 epochs to convergence, followed by mT5, AraT5 v2 , mT0, and finally AraT5.).We report CER for diacritization and transliteration, ROUGE for summarization, F 0.5 (M 2 ) for GEC, and F 1 for QA.All other tasks reported in BLEU.↓: lower is better.
We also carry out k-shot evaluations of both BLOOMZ11 (7.1B) (Muennighoff et al., 2022) and ChatGPT (gpt-3.5-turbo)12 on 12 different NLG tasks across 16 test sets extracted from Dolphin. 13o keep the cost manageable, we randomly sample a set of 200 examples from the test set of each task for evaluation.We then evaluate both models under 0-, 5-, and 10-shot settings.For all experiments, we set the temperature to zero to generate deterministic and reproducible results.We compare both models' performance to our best fully finetuned model, AraT5 v2 , blind-tested on the same sampled 200 examples.Discussion.Tables 5, shows that ChatGPT outperforms BLOOMZ in all the 16 NLG tasks under 0-, 5-, and 10-shot settings.The only exception is the text rewriting task in the 0-shot setting.It is worth mentioning that AraT5 v2 outperforms both ChatGPT and BLOOMZ by 14 out of 16.However, ChatGPT (10-shot) achieves the highest score in both code-switching tasks, perhaps due to its multilingual pretraining data.

Conclusion
We presented Dolphin, a large and diverse benchmark for Arabic NLG composed of 40 datasets that are arranged in 13 tasks.Dolphin is designed to facilitate meaningful comparisons and encourage healthy competition in Arabic.We also provide an interactive leaderboard with a range of useful tools and detailed metadata to help situate future research in a rich context of information sharing.Dolphin datasets are all publicly available, which should facilitate the adoption and further development of the benchmark.In the future, we intend to build on top of Dolphin by extending it to more tasks and Arabic varieties.

Limitations
In spite of the diversity, wide-coverage, highquality datasets, accessibility, and challenging nature of Dolphin, it is not without limitations.In particular, we identify the following limitations.
1. Coverage of Arabic Varieties.While we make efforts to incorporate tasks from all Arabic varieties, it is important to note that there is a lack of available downstream datasets from countries such as Djibouti, Mauritania, and Yemen.Consequently, these varieties are not currently included in Dolphin.We hope that the community will develop resources representing all Arab countries, including these, across the various tasks.We also hope that future versions of our benchmark will have extended dialectal coverage in ways that enhance its representation of the Arabic language and help foster technological inclusion.

Machine-Translated
Datasets.Dolphin includes two machine-translated data, AEC (Naous et al., 2021) and Ara-Para (Nagoudi et al., 2022a)).While these datasets increase task coverage in Dolphin, the MT process may inadvertently introduce some biases.For example, MT can result in a narrow representation of language patterns and structures, leading to a limited understanding of the complexities and nuances of different languages.Additionally, benchmark datasets may not adequately capture the wide range of domains, genres, and styles that exist in real-world translation scenarios.This can limit the generalizability of models trained on such data, as they may struggle to handle unfamiliar or specialized content.We hope that future versions of Dolphin will involve real-world data that further complement (or even substitute) these translated datasets.
3. Automated Evaluation.Although all NLP depends heavily on automated evaluation to speed up model development, automated methods have their limitations, especially for some tasks.That is, in addition to automated evaluation, some tasks may need human evaluation.In particular, we believe human evaluation can play a crucial role in NLG tasks such as open-domain dialogue generation.For example, it can capture the nuanced aspects of dialogue quality, such as coherence, relevance, and appropriateness.In addition, human evaluation can allow for a comprehensive assessment of the generated dialogues, taking into account contextual understanding, fluency, and overall user experience.This feedback is invaluable in refining and improving dialogue generation models, ensuring that they meet the high standards of human-like conversation.

Ethics Statement
Data Collection and Release.Dolphin is based on publicly available datasets that would not be possible without the hard work of a large number of researchers over the years.We are grateful for these efforts invested by pioneer colleagues.One downside of benchmarking could be that the original authors of the different datasets are not sufficiently acknowledged.In our work, we make sure that all publications of resources we use are properly cited, both by referencing these in this paper (Section 3) and highlighting them in our GitHub and leaderboard website.
1. Data Privacy.Regarding data involved in Dolphin, we develop the benchmark using publicly available data.For this reason, we do not have significant privacy concerns.In addition, the new datasets we develop and release for code-switched machine translation have undergone manual inspection to ensure there is no unintended leak of privacy information in any of the samples.
2. Intended Use.We believe our work will spur further research on studying language models on Arabic NLG benchmark.We create a publicly available leaderboard and benchmark several multilingual and Arabicdedicated SOTA models on Dolphin.The benchmark will facilitate a unified evaluation and pave the way for a healthy competition that could push SoTA on Arabic language generation.
3. Potential Misuse and Bias.The datasets we collect to create Dolphin may contain potential harmful contents.Additionally, the models we evaluate might be exposed to bias and as a result may generate unintended contents.Therefore, we recommend that these datasets and models not be used in applications without careful prior consideration of potential misuse and bias.

Figure 2 :
Figure 2: Comparison of the number of datasets and tasks supported by the Arabic (including Dolphin), Xspecific, and Multilingual NLG benchmarks.
is a generation benchmark designed for Bangala comprising seven datasets across six tasks.Guntara et al. (2020) and Doan et al. (2021) present two MT benchmarks for Bahasa Indonesia and Vietnamese languages, respectively.Multi-Lingual NLG Benchmarks.The generation evaluation and metrics benchmark (GEM v1 ) (Gehrmann et al., 2021) is a multilingual benchmark environment for NLG.GEM v1 features 18 languages across 13 datasets spanning five tasks.Gehrmann et al. (2022) propose a second version, GEM v2 , with a new set of datasets and

Figure 3 :
Figure 3: Finetuning time (in hrs) and no. of epoch.We report the average of three runs across all tasks.

Table 3 :
Examples from datasets included in Dolphin .

Table 1
, achieves the best Dolphin H and Dolphin L , at 27.82 and 11.67, respectively.It is followed by AraBART with Dolphin H of 26.44, where a higher score indicates better performance.Conversely, mT5 achieves a Dolphin L of 12.42, which is considered better in the opposite scenario.We also note that AraT5 v2 achieves the best results in 30 individual tasks out of 50, followed by AraBART and mT0, where each one excels in 11 and 8 individual tasks, respectively.10

Table 4 :
Average of three runs of finetuned Arabic and multilingual models on Dolphin test.Dolphin L Score: refers to the macro-average scores of tasks where a lower score ↓ is better.Dolphin H Score: refers to the macro-average scores of tasks where a higher score ↑ is better.

Table 5 :
K-shot results with BLOOMZ and ChatGPT, compared to best finetuned model (AraT5 v2 (Lewis et al., 2020))a)1)resources languages widely spoken in Indonesia: Indonesian, Javanese, and SundaneseCahyawijaya et al. (2021).It consists of ten distinct datasets, encompassing four tasks.These are summarization, question answering, chit-chat, and machine translation.CLSE.The Corpus of Linguistically Significant EntitiesChuklin et al. (2022)is a multilingual named entities corpus that covers 34 languages, 74 semantic classes, and 222 distinguishable linguistic signatures.The authors also developed an expanded version of the Schema-Guided Dialog Dataset (SG-CLSE) to illustrate one of the potential uses of CLSE in three languages: French, Marathi, and Russian.GEM v1 .The Generation Evaluation and Metrics benchmark (Gehrmann et al., 2021) is a multilingual benchmark environment for NLG.GEM features 18 languages across 13 datasets spanning five NLG tasks: data-to-text, dialog response generation, reasoning, summarization, and simplification.21GEMv2.Gehrmann et al. (2022) propose a second version, GEM v2 , styled after GEM v1 with a new set of datasets and more challenging tasks.This new version supports 40 documented datasets in 51 languages.It introduces a modular infrastructure 21 Two of the datasets do not include English at all.for datasets and models, with an online evaluation process that collects model outputs and computes metrics for all datasets.GEM v2 is built around nine NLG tasks data-to-text, dialog response generation, paraphrasing, generative question answering, question generation, reasoning, slide generation, simplification, and summarization.IndicNLG.The first benchmark for Indic languagesKumar et al. (2022)covers 11 Indic languages belonging to two language families: Indo-Aryan and Dravidian.IndicNLG involves the five following tasks: biography generation, news headline generation, sentence summarization, paraphrase generation, and question generation.MTG.Chen et al. (2022)introduce the Multilingual Text Generation to promote knowledge transfer and cross-lingual generation between arbitrary language pairs.MTG contains 400K of humanly annotated data samples in five languages, covering four generation tasks.These are story generation, question generation, title generation, and text summarization.In this section, we list the Arabic and multilingual sequence-to-sequence (S2S) pretrained LMs we finetune on Dolphin.AraT5.(Nagoudietal., 2022a) is an adaptation of the T5 model specifically designed for the Arabic language.It is pre-trained on a large (248GB of Arabic text) diverse (MSA and Arabic dialects) dataset to effectively handle different Arabic tasks.In addition to Arabic, AraT5's vocabulary covers 11 other languages.In this work, we evaluate a new in-house version of AraT5 dubbed AraT5 v2 .AraT5 v2 .Our analysis shows that AraT5 requires a large number of epochs to converge, making it an expensive model.For this reason, we pretrain a new version of the model from scratch exploiting a larger (∼ 400GB) and more diverse pretraining dataset than used by(Nagoudi et al., 2022a).As we show in our results, the new model converges faster than AraT5 and achieves better results under our cap of 20 epochs for finetuning across all models.AraBART.(Eddineetal., 2022)is a model based on the encoder-decoder BART base architecture(Lewis et al., 2020), featuring six encoder and 6 decoder layers.It is pretrained on the same corpus asAraBERT (Antoun et al., 2020), with reversed preprocessing for more natural text gener-