Hamdy Mubarak - ACL Anthology

Hamdy Mubarak

2026

Nahw: A Comprehensive Benchmark of Arabic Grammar Understanding, Error Detection, Correction, and Explanation
Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Grammar comprehension is a critical capability for large language models (LLMs) to achieve fluency in a target language. In low-resource settings, such as the case with Arabic, limited availability of high-quality data can lead to significant gaps in grammatical understanding, making systematic evaluation essential. We introduce Nahw, a comprehensive benchmark for Arabic grammar that covers both theoretical knowledge and practical applications, including grammatical error detection, correction, and explanation. We evaluate a range of LLMs on these tasks and find that many models still exhibit substantial deficiencies in Arabic grammar comprehension, with GPT-4o achieving a score of 67% on average over all tasks, while the best performing Arabic model in our experiment (ALLaM-7B) achieving 42%. Our experiments also demonstrate that while fine-tuning with synthetic data can improve performance, it does not match the effectiveness of training on natural, high-quality data.

2025

DialG2P: Dialectal Grapheme-to-Phoneme. Arabic as a Case Study
Majd Hawasly | Hamdy Mubarak | Ahmed Abdelali | Ahmed Ali
Proceedings of The Third Arabic Natural Language Processing Conference

Grapheme-to-phoneme (G2P) models are essential components in text-to-speech (TTS) and pronunciation assessment applications. While standard forms of languages have gained attention in that regard, dialectal speech, which often serves as the primary means of spoken communication for many communities, as it is the case for Arabic, has not received the same level of focus. In this paper, we introduce an end-to-end dialectal G2P for Egyptian Arabic, a dialect without standard orthography. Our novel architecture accomplishes three tasks: (i) restores short vowels of the diacritical marks for the dialectal text; (ii) maps certain characters that happen only in the spoken version of the dialectal Arabic to their dialect-specific character transcriptions; and finally (iii) converts the previous step output to the corresponding phoneme sequence. We benchmark G2P on a modular cascaded system, a large language model, and our multi-task end-to-end architecture.

IslamicEval 2025: The First Shared Task of Capturing LLMs Hallucination in Islamic Content
Hamdy Mubarak | Rana Malhas | Watheq Mansour | Abubakr Mohamed | Mahmoud Fawzi | Majd Hawasly | Tamer Elsayed | Kareem Mohamed Darwish | Walid Magdy
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

Hallucination in Large Language Models (LLMs) remains a significant challenge and continues to draw substantial research attention. The problem becomes especially critical when hallucinations arise in sensitive domains, such as religious discourse. To address this gap, we introduce IslamicEval 2025—the first shared task specifically focused on evaluating and detecting hallucinations in Islamic content. The task consists of two subtasks: (1) Hallucination Detection and Correction of quoted verses (Ayahs) from the Holy Quran and quoted Hadiths; and (2) Qur’an and Hadith Question Answering, which assesses retrieval models and LLMs by requiring answers to be retrieved from grounded, authoritative sources. Thirteen teams participated in the final phase of the shared task, employing a range of pipelines and frameworks. Their diverse approaches underscore both the complexity of the task and the importance of effectively managing hallucinations in Islamic discourse.

Advancing Arabic Diacritization: Improved Datasets, Benchmarking, and State-of-the-Art Models
Abubakr Mohamed | Hamdy Mubarak
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Arabic diacritics, similar to short vowels in English, provide phonetic and grammatical information but are typically omitted in written Arabic, leading to ambiguity. Diacritization (aka diacritic restoration or vowelization) is essential for natural language processing. This paper advances Arabic diacritization through the following contributions: first, we propose a methodology to analyze and refine a large diacritized corpus to improve training quality. Second, we introduce WikiNews-2024, a multi-reference evaluation methodology with an updated version of the standard benchmark “WikiNews-2014”. In addition, we explore various model architectures and propose a BiLSTM-based model that achieves state-of-the-art results with 3.12% and 2.70% WER on WikiNews-2014 and WikiNews-2024, respectively. Moreover, we develop a model that preserves user-specified diacritics while maintaining accuracy. Lastly, we demonstrate that augmenting training data enhances performance in low-resource settings.

BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Al-Matham | Kareem Darwish | Raghad Al-Rasheed | Waad Alshammari | Muneera Alhoshan | Amal Almazrua | Asma Al Wazrah | Mais Alheraki | Firoj Alam | Preslav Nakov | Norah Alzahrani | Eman AlBilali | Nizar Habash | Abdelrahman El-Sheikh | Muhammad Elmallah | Haonan Li | Hamdy Mubarak | Mohamed Anwar | Zaid Alyafeai | Ahmed Abdelali | Nora Altwairesh | Maram Hasanain | Abdulmohsen Al Thubaity | Shady Shehata | Bashar Alhafni | Injy Hamed | Go Inoue | Khalid Elmadani | Ossama Obeid | Fatima Haouari | Tamer Elsayed | Emad Alghamdi | Khalid Almubarak | Saied Alshahrani | Ola Aljarrah | Safa Alajlan | Areej Alshaqarawi | Maryam Alshihri | Sultana Alghurabi | Atikah Alzeghayer | Afrah Altamimi | Abdullah Alfaifi | Abdulrahman AlOsaimy
Proceedings of The Third Arabic Natural Language Processing Conference

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture
Fakhraddin Alwajih | Abdellah El Mekki | Hamdy Mubarak | Majd Hawasly | Abubakr Mohamed | Muhammad Abdul-Mageed
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

Large Language Models (LLMs) inherently reflect the vast data distributions they encounter during their pre-training phase. As this data is predominantly sourced from the web, there is a high chance it will be skewed towards high-resourced languages and cultures, such as those of the West. Consequently, LLMs often exhibit a diminished understanding of certain communities, a gap that is particularly evident in their knowledge of Arabic and Islamic cultures. This issue becomes even more pronounced with increasingly under-represented topics. To address this critical challenge, we introduce PalmX 2025, the first shared task designed to benchmark the cultural competence of LLMs in these specific domains. The task is composed of two subtasks featuring multiple-choice questions (MCQs) in Modern Standard Arabic (MSA): General Arabic Culture and General Islamic Culture. These subtasks cover a wide range of topics, including traditions, food, history, religious practices, and language expressions from across 22 Arab countries. The initiative drew considerable interest, with 26 teams registering for Subtask 1 and 19 for Subtask 2, culminating in nine and six valid submissions, respectively. Our findings reveal that task-specific fine-tuning substantially boosts performance over baseline models. The top-performing systems achieved an accuracy of 72.15% on cultural questions and 84.22% on Islamic knowledge. Parameter-efficient fine-tuning emerged as the predominant and most effective approach among participants, while the utility of data augmentation was found to be domain-dependent. Ultimately, this benchmark provides a crucial, standardized framework to guide the development of more culturally grounded and competent Arabic LLMs. Results of the shared task demonstrate that general cultural and general religious knowledge remain challenging to LLMs, motivating us to continue to offer the shared task in the future.

ArabicWeb-Edu: Educational Quality Data for Arabic LLM Training
Majd Hawasly | Tasnim Mohiuddin | Hamdy Mubarak | Sabri Boughorbel
Proceedings of The Third Arabic Natural Language Processing Conference

The quality of training data plays a critical role in the performance of large language models (LLMs). This is especially true for low-resource languages where high-quality content is relatively scarce. Inspired by the success of FineWeb-Edu for English, we construct a native Arabic educational-quality dataset using similar methodological principles. We begin by sampling 1 million Arabic web documents from Common Crawl and labeling them into six quality classes (0–5) with Qwen-2.5-72B-Instruct model using a classification prompt adapted from FineWeb-Edu. These labeled examples are used to train a robust classifier capable of distinguishing educational content from general web text. We train a classification head on top of a multilingual 300M encoder model, then use this classifier to filter a large Arabic web corpus, discarding documents with low educational value. To evaluate the impact of this curation, we pretrain from scratch two bilingual English-Arabic 7B LLMs on 800 billion tokens using the filtered and unfiltered data and compare their performance across a suite of benchmarks. Our results show a significant improvement when using the filtered educational dataset, validating the effectiveness of quality filtering as a component in a balanced data mixture for Arabic LLM development. This work addresses the scarcity of high-quality Arabic training data and offers a scalable methodology for curating educational quality content in low-resource languages.

AraSafe: Benchmarking Safety in Arabic LLMs
Hamdy Mubarak | Abubakr Mohamed | Majd Hawasly
Findings of the Association for Computational Linguistics: EMNLP 2025

We introduce AraSafe, the first large-scale native Arabic safety benchmark for large language models (LLMs), addressing the pressing need for culturally and linguistically representative evaluation resources. The dataset comprises 12K naturally occurring, human-written Arabic prompts containing both harmful and non-harmful content across diverse domains, including linguistics, social studies, and science. Each prompt was independently annotated by two experts into one of nine fine-grained safety categories, including ‘Safe/Not Harmful’, ‘Illegal Activities’, ‘Violence or Harm’, ‘Privacy Violation’, and ‘Hate Speech’. Additionally, to support training classifiers for harmful content and due to the imbalanced representation of harmful content in the natural dataset, we create a synthetic dataset of additional 12K harmful prompts generated by GPT-4o via carefully designed prompt engineering techniques. We benchmark a number of Arabic-centric and multilingual models in the 7 to 13B parameter range, including Jais, AceGPT, Allam, Fanar, Llama-3, Gemma-2, and Qwen3, as well as BERT-based fine-tuned classifier models on detecting harmful prompts. GPT-4o was used as an upper-bound reference baseline. Our evaluation reveals critical safety blind spots in Arabic LLMs and underscores the necessity of localized, culturally grounded benchmarks for building responsible AI systems.

2024

So Hateful! Building a Multi-Label Hate Speech Annotated Arabic Dataset
Wajdi Zaghouani | Hamdy Mubarak | Md. Rafiul Biswas
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Social media enables widespread propagation of hate speech targeting groups based on ethnicity, religion, or other characteristics. With manual content moderation being infeasible given the volume, automatic hate speech detection is essential. This paper analyzes 70,000 Arabic tweets, from which 15,965 tweets were selected and annotated, to identify hate speech patterns and train classification models. Annotators labeled the Arabic tweets for offensive content, hate speech, emotion intensity and type, effect on readers, humor, factuality, and spam. Key findings reveal 15% of tweets contain offensive language while 6% have hate speech, mostly targeted towards groups with common ideological or political affiliations. Annotations capture diverse emotions, and sarcasm is more prevalent than humor. Additionally, 10% of tweets provide verifiable factual claims, and 7% are deemed important. For hate speech detection, deep learning models like AraBERT outperform classical machine learning approaches. By providing insights into hate speech characteristics, this work enables improved content moderation and reduced exposure to online hate. The annotated dataset advances Arabic natural language processing research and resources.

Beyond Orthography: Automatic Recovery of Short Vowels and Dialectal Sounds in Arabic
Yassine El Kheir | Hamdy Mubarak | Ahmed Ali | Shammur Chowdhury
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper presents a novel Dialectal Sound and Vowelization Recovery framework, designed to recognize borrowed and dialectal sounds within phonologically diverse and dialect-rich languages, that extends beyond its standard orthographic sound sets. The proposed framework utilized quantized sequence of input with(out) continuous pretrained self-supervised representation. We show the efficacy of the pipeline using limited data for Arabic, a dialect-rich language containing more than 22 major dialects. Phonetically correct transcribed speech resources for dialectal Arabic is scare. Therefore, we introduce ArabVoice15, a first of its kind, curated test set featuring 5 hours of dialectal speech across 15 Arab countries, with phonetically accurate transcriptions, including borrowed and dialect-specific sounds. We described in detail the annotation guideline along with the analysis of the dialectal confusion pairs. Our extensive evaluation includes both subjective – human perception tests and objective measures. Our empirical results, reported with three test sets, show that with only one and half hours of training data, our model improve character error rate by ≈7% in ArabVoice15 compared to the baseline.

Halwasa: Quantify and Analyze Hallucinations in Large Language Models: Arabic as a Case Study
Hamdy Mubarak | Hend Al-Khalifa | Khaloud Suliman Alkhalefah
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have shown superb abilities to generate texts that are indistinguishable from human-generated texts in many cases. However, sometimes they generate false, incorrect, or misleading content, which is often described as “hallucinations”. Quantifying and analyzing hallucination in LLMs can increase their reliability and usage. While hallucination is being actively studied for English and other languages, and different benchmarking datsets have been created, this area is not studied at all for Arabic. In our paper, we create the first Arabic dataset that contains 10K of generated sentences by LLMs and annotate it for factuality and correctness. We provide detailed analysis of the dataset to analyze factual and linguistic errors. We found that 25% of the generated sentences are factually incorrect. We share the dataset with the research community.

Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Hend Al-Khalifa | Kareem Darwish | Hamdy Mubarak | Mona Ali | Tamer Elsayed
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

Fact-Checking the Output of Large Language Models via Token-Level Uncertainty Quantification
Ekaterina Fadeeva | Aleksandr Rubashevskii | Artem Shelmanov | Sergey Petrakov | Haonan Li | Hamdy Mubarak | Evgenii Tsymbalov | Gleb Kuzmin | Alexander Panchenko | Timothy Baldwin | Preslav Nakov | Maxim Panov
Findings of the Association for Computational Linguistics: ACL 2024

Large language models (LLMs) are notorious for hallucinating, i.e., producing erroneous claims in their output. Such hallucinations can be dangerous, as occasional factual inaccuracies in the generated text might be obscured by the rest of the output being generally factually correct, making it extremely hard for the users to spot them. Current services that leverage LLMs usually do not provide any means for detecting unreliable generations. Here, we aim to bridge this gap. In particular, we propose a novel fact-checking and hallucination detection pipeline based on token-level uncertainty quantification. Uncertainty scores leverage information encapsulated in the output of a neural network or its layers to detect unreliable predictions, and we show that they can be used to fact-check the atomic claims in the LLM output. Moreover, we present a novel token-level uncertainty quantification method that removes the impact of uncertainty about what claim to generate on the current step and what surface form to use. Our method Claim Conditioned Probability (CCP) measures only the uncertainty of a particular claim value expressed by the model. Experiments on the task of biography generation demonstrate strong improvements for CCP compared to the baselines for seven different LLMs and four languages. Human evaluation reveals that the fact-checking pipeline based on uncertainty quantification is competitive with a fact-checking tool that leverages external knowledge.

LLMeBench: A Flexible Framework for Accelerating LLMs Benchmarking
Fahim Dalvi | Maram Hasanain | Sabri Boughorbel | Basel Mousi | Samir Abdaljalil | Nizi Nazar | Ahmed Abdelali | Shammur Absar Chowdhury | Hamdy Mubarak | Ahmed Ali | Majd Hawasly | Nadir Durrani | Firoj Alam
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

The recent development and success of Large Language Models (LLMs) necessitate an evaluation of their performance across diverse NLP tasks in different languages. Although several frameworks have been developed and made publicly available, their customization capabilities for specific tasks and datasets are often complex for different users. In this study, we introduce the LLMeBench framework, which can be seamlessly customized to evaluate LLMs for any NLP task, regardless of language. The framework features generic dataset loaders, several model providers, and pre-implements most standard evaluation metrics. It supports in-context learning with zero- and few-shot settings. A specific dataset and task can be evaluated for a given LLM in less than 20 lines of code while allowing full flexibility to extend the framework for custom datasets, models, or tasks. The framework has been tested on 31 unique NLP tasks using 53 publicly available datasets within 90 experimental setups, involving approximately 296K data points. We open-sourced LLMeBench for the community (https://github.com/qcri/LLMeBench/) and a video demonstrating the framework is available online (https://youtu.be/9cC2m_abk3A).

Recent advancements in Large Language Models (LLMs) have significantly influenced the landscape of language and speech research. Despite this progress, these models lack specific benchmarking against state-of-the-art (SOTA) models tailored to particular languages and tasks. LAraBench addresses this gap for Arabic Natural Language Processing (NLP) and Speech Processing tasks, including sequence tagging and content classification across different domains. We utilized models such as GPT-3.5-turbo, GPT-4, BLOOMZ, Jais-13b-chat, Whisper, and USM, employing zero and few-shot learning techniques to tackle 33 distinct tasks across 61 publicly available datasets. This involved 98 experimental setups, encompassing ~296K data points, ~46 hours of speech, and 30 sentences for Text-to-Speech (TTS). This effort resulted in 330+ sets of experiments. Our analysis focused on measuring the performance gap between SOTA models and LLMs. The overarching trend observed was that SOTA models generally outperformed LLMs in zero-shot learning, with a few exceptions. Notably, larger computational models with few-shot learning techniques managed to reduce these performance gaps. Our findings provide valuable insights into the applicability of LLMs for Arabic NLP and speech processing tasks.

Wikidata as a Source of Demographic Information
Samir Abdaljalil | Hamdy Mubarak
Proceedings of the Second Arabic Natural Language Processing Conference

Names carry important information about our identities and demographics such as gender, nationality, ethnicity, etc. We investigate the use of individual’s name, in both Arabic and English, to predict important attributes, namely country, region, gender, and language. We extract data from Wikidata, and normalize it, to build a comprehensive dataset consisting of more than 1 million entities and their normalized attributes. We experiment with a Linear SVM approach, as well as two Transformers approaches consisting of BERT model fine-tuning and Transformers pipeline. Our results indicate that we can predict the gender, language and region using the name only with a confidence over 0.65. The country attribute can be predicted with less accuracy. The Linear SVM approach outperforms the other approaches for all the attributes. The best performing approach was also evaluated on another dataset that consists of 1,500 names from 15 countries (covering different regions) extracted from Twitter, and yields similar results.

2023

ArAIEval Shared Task: Persuasion Techniques and Disinformation Detection in Arabic Text
Maram Hasanain | Firoj Alam | Hamdy Mubarak | Samir Abdaljalil | Wajdi Zaghouani | Preslav Nakov | Giovanni Da San Martino | Abed Freihat
Proceedings of ArabicNLP 2023

We present an overview of the ArAIEval shared task, organized as part of the first ArabicNLP 2023 conference co-located with EMNLP 2023. ArAIEval offers two tasks over Arabic text: (1) persuasion technique detection, focusing on identifying persuasion techniques in tweets and news articles, and (2) disinformation detection in binary and multiclass setups over tweets. A total of 20 teams participated in the final evaluation phase, with 14 and 16 teams participating in Task 1 and Task 2, respectively. Across both tasks, we observe that fine-tuning transformer models such as AraBERT is the core of majority of participating systems. We provide a description of the task setup, including description of datasets construction and the evaluation setup. We also provide a brief overview of the participating systems. All datasets and evaluation scripts from the shared task are released to the research community. We hope this will enable further research on such important tasks within the Arabic NLP community.

2022

Overview of the WANLP 2022 Shared Task on Propaganda Detection in Arabic
Firoj Alam | Hamdy Mubarak | Wajdi Zaghouani | Giovanni Da San Martino | Preslav Nakov
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Propaganda is defined as an expression of opinion or action by individuals or groups deliberately designed to influence opinions or actions of other individuals or groups with reference to predetermined ends and this is achieved by means of well-defined rhetorical and psychological devices. Currently, propaganda (or persuasion) techniques have been commonly used on social media to manipulate or mislead social media users. Automatic detection of propaganda techniques from textual, visual, or multimodal content has been studied recently, however, major of such efforts are focused on English language content. In this paper, we propose a shared task on detecting propaganda techniques for Arabic textual content. We have done a pilot annotation of 200 Arabic tweets, which we plan to extend to 2,000 tweets, covering diverse topics. We hope that the shared task will help in building a community for Arabic propaganda detection. The dataset will be made publicly available, which can help in future studies.

ArCovidVac: Analyzing Arabic Tweets About COVID-19 Vaccination
Hamdy Mubarak | Sabit Hassan | Shammur Absar Chowdhury | Firoj Alam
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The emergence of the COVID-19 pandemic and the first global infodemic have changed our lives in many different ways. We relied on social media to get the latest information about COVID-19 pandemic and at the same time to disseminate information. The content in social media consisted not only health related advice, plans, and informative news from policymakers, but also contains conspiracies and rumors. It became important to identify such information as soon as they are posted to make an actionable decision (e.g., debunking rumors, or taking certain measures for traveling). To address this challenge, we develop and publicly release the first largest manually annotated Arabic tweet dataset, ArCovidVac, for COVID-19 vaccination campaign, covering many countries in the Arab region. The dataset is enriched with different layers of annotation, including, (i) Informativeness more vs. less importance of the tweets); (ii) fine-grained tweet content types (e.g., advice, rumors, restriction, authenticate news/information); and (iii) stance towards vaccination (pro-vaccination, neutral, anti-vaccination). Further, we performed in-depth analysis of the data, exploring the popularity of different vaccines, trending hashtags, topics, and presence of offensiveness in the tweets. We studied the data for individual types of tweets and temporal changes in stance towards vaccine. We benchmarked the ArCovidVac dataset using transformer architectures for informativeness, content types, and stance detection.

ArabGend: Gender Analysis and Inference on Arabic Twitter
Hamdy Mubarak | Shammur Absar Chowdhury | Firoj Alam
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Gender analysis of Twitter can reveal important socio-cultural differences between male and female users. There has been a significant effort to analyze and automatically infer gender in the past for most widely spoken languages’ content, however, to our knowledge very limited work has been done for Arabic. In this paper, we perform an extensive analysis of differences between male and female users on the Arabic Twitter-sphere. We study differences in user engagement, topics of interest, and the gender gap in professions. Along with gender analysis, we also propose a method to infer gender by utilizing usernames, profile pictures, tweets, and networks of friends. In order to do so, we manually annotated gender and locations for ~166K Twitter accounts associated with ~92K user location, which we plan to make publicly available. Our proposed gender inference method achieve an F1 score of 82.1% (47.3% higher than majority baseline). We also developed a demo and made it publicly available.

NatiQ: An End-to-end Text-to-Speech System for Arabic
Ahmed Abdelali | Nadir Durrani | Cenk Demiroglu | Fahim Dalvi | Hamdy Mubarak | Kareem Darwish
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

NatiQ is end-to-end text-to-speech system for Arabic. Our speech synthesizer uses an encoder-decoder architecture with attention. We used both tacotron-based models (tacotron- 1 and tacotron-2) and the faster transformer model for generating mel-spectrograms from characters. We concatenated Tacotron1 with the WaveRNN vocoder, Tacotron2 with the WaveGlow vocoder and ESPnet transformer with the parallel wavegan vocoder to synthesize waveforms from the spectrograms. We used in-house speech data for two voices: 1) neu- tral male “Hamza”- narrating general content and news, and 2) expressive female “Amina”- narrating children story books to train our models. Our best systems achieve an aver- age Mean Opinion Score (MOS) of 4.21 and 4.40 for Amina and Hamza respectively. The objective evaluation of the systems using word and character error rate (WER and CER) as well as the response time measured by real- time factor favored the end-to-end architecture ESPnet. NatiQ demo is available online at https://tts.qcri.org.

Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Hend Al-Khalifa | Tamer Elsayed | Hamdy Mubarak | Abdulmohsen Al-Thubaity | Walid Magdy | Kareem Darwish
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

Overview of OSACT5 Shared Task on Arabic Offensive Language and Hate Speech Detection
Hamdy Mubarak | Hend Al-Khalifa | Abdulmohsen Al-Thubaity
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

This paper provides an overview of the shard task on detecting offensive language, hate speech, and fine-grained hate speech at the fifth workshop on Open-Source Arabic Corpora and Processing Tools (OSACT5). The shared task comprised of three subtasks; Subtask A, involving the detection of offensive language, which contains socially unacceptable or impolite content including any kind of explicit or implicit insults or attacks against individuals or groups; Subtask B, involving the detection of hate speech, which contains offensive language targeting individuals or groups based on common characteristics such as race, religion, gender, etc.; and Subtask C, involving the detection of the fine-grained type of hate speech which takes one value from the following types: (i) race/ethnicity/nationality, (ii) religion/belief, (iii) ideology, (iv) disability/disease, (v) social class, and (vi) gender. In total, 40 teams signed up to participate in Subtask A, and 17 of them submitted test runs. For Subtask B, 26 teams signed up to participate and 12 of them submitted runs. And for Subtask C, 23 teams signed up to participate and 10 of them submitted runs. 10 teams submitted papers describing their participation in one subtask or more, and 8 papers were accepted. We present and analyze all submissions in this paper.

2021

UL2C: Mapping User Locations to Countries on Arabic Twitter
Hamdy Mubarak | Sabit Hassan
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Mapping user locations to countries can be useful for many applications such as dialect identification, author profiling, recommendation system, etc. Twitter allows users to declare their locations as free text, and these user-declared locations are often noisy and hard to decipher automatically. In this paper, we present the largest manually labeled dataset for mapping user locations on Arabic Twitter to their corresponding countries. We build effective machine learning models that can automate this mapping with significantly better efficiency compared to libraries such as geopy. We also show that our dataset is more effective than data extracted from GeoNames geographical database in this task as the latter covers only locations written in formal ways.

Adult Content Detection on Arabic Twitter: Analysis and Experiments
Hamdy Mubarak | Sabit Hassan | Ahmed Abdelali
Proceedings of the Sixth Arabic Natural Language Processing Workshop

With Twitter being one of the most popular social media platforms in the Arab region, it is not surprising to find accounts that post adult content in Arabic tweets; despite the fact that these platforms dissuade users from such content. In this paper, we present a dataset of Twitter accounts that post adult content. We perform an in-depth analysis of the nature of this data and contrast it with normal tweet content. Additionally, we present extensive experiments with traditional machine learning models, deep neural networks and contextual embeddings to identify such accounts. We show that from user information alone, we can identify such accounts with F1 score of 94.7% (macro average). With the addition of only one tweet as input, the F1 score rises to 96.8%.

Arabic Offensive Language on Twitter: Analysis and Experiments
Hamdy Mubarak | Ammar Rashed | Kareem Darwish | Younes Samih | Ahmed Abdelali
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Detecting offensive language on Twitter has many applications ranging from detecting/predicting bullying to measuring polarization. In this paper, we focus on building a large Arabic offensive tweet dataset. We introduce a method for building a dataset that is not biased by topic, dialect, or target. We produce the largest Arabic dataset to date with special tags for vulgarity and hate speech. We thoroughly analyze the dataset to determine which topics, dialects, and gender are most associated with offensive tweets and how Arabic speakers useoffensive language. Lastly, we conduct many experiments to produce strong results (F1 =83.2) on the dataset using SOTA techniques.

ArCorona: Analyzing Arabic Tweets in the Early Days of Coronavirus (COVID-19) Pandemic
Hamdy Mubarak | Sabit Hassan
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

Over the past few months, there were huge numbers of circulating tweets and discussions about Coronavirus (COVID-19) in the Arab region. It is important for policy makers and many people to identify types of shared tweets to better understand public behavior, topics of interest, requests from governments, sources of tweets, etc. It is also crucial to prevent spreading of rumors and misinformation about the virus or bad cures. To this end, we present the largest manually annotated dataset of Arabic tweets related to COVID-19. We describe annotation guidelines, analyze our dataset and build effective machine learning and transformer based models for classification.

ASAD: Arabic Social media Analytics and unDerstanding
Sabit Hassan | Hamdy Mubarak | Ahmed Abdelali | Kareem Darwish
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

This system demonstration paper describes ASAD: Arabic Social media Analysis and unDerstanding, a suite of seven individual modules that allows users to determine dialects, sentiment, news category, offensiveness, hate speech, adult content, and spam in Arabic tweets. The suite is made available through a web API and a web interface where users can enter text or upload files.

With the emergence of the COVID-19 pandemic, the political and the medical aspects of disinformation merged as the problem got elevated to a whole new level to become the first global infodemic. Fighting this infodemic has been declared one of the most important focus areas of the World Health Organization, with dangers ranging from promoting fake cures, rumors, and conspiracy theories to spreading xenophobia and panic. Addressing the issue requires solving a number of challenging problems such as identifying messages containing claims, determining their check-worthiness and factuality, and their potential to do harm as well as the nature of that harm, to mention just a few. To address this gap, we release a large dataset of 16K manually annotated tweets for fine-grained disinformation analysis that (i) focuses on COVID-19, (ii) combines the perspectives and the interests of journalists, fact-checkers, social media platforms, policy makers, and society, and (iii) covers Arabic, Bulgarian, Dutch, and English. Finally, we show strong evaluation results using pretrained Transformers, thus confirming the practical utility of the dataset in monolingual vs. multilingual, and single task vs. multitask settings.

QASR: QCRI Aljazeera Speech Resource A Large Scale Annotated Arabic Speech Corpus
Hamdy Mubarak | Amir Hussein | Shammur Absar Chowdhury | Ahmed Ali
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.

QADI: Arabic Dialect Identification in the Wild
Ahmed Abdelali | Hamdy Mubarak | Younes Samih | Sabit Hassan | Kareem Darwish
Proceedings of the Sixth Arabic Natural Language Processing Workshop

Proper dialect identification is important for a variety of Arabic NLP applications. In this paper, we present a method for rapidly constructing a tweet dataset containing a wide range of country-level Arabic dialects —covering 18 different countries in the Middle East and North Africa region. Our method relies on applying multiple filters to identify users who belong to different countries based on their account descriptions and to eliminate tweets that either write mainly in Modern Standard Arabic or mostly use vulgar language. The resultant dataset contains 540k tweets from 2,525 users who are evenly distributed across 18 Arab countries. Using intrinsic evaluation, we show that the labels of a set of randomly selected tweets are 91.5% accurate. For extrinsic evaluation, we are able to build effective country level dialect identification on tweets with a macro-averaged F1-score of 60.6% across 18 classes.

2020

A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection
Shammur Absar Chowdhury | Hamdy Mubarak | Ahmed Abdelali | Soon-gyo Jung | Bernard J. Jansen | Joni Salminen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Access to social media often enables users to engage in conversation with limited accountability. This allows a user to share their opinions and ideology, especially regarding public content, occasionally adopting offensive language. This may encourage hate crimes or cause mental harm to targeted individuals or groups. Hence, it is important to detect offensive comments in social media platforms. Typically, most studies focus on offensive commenting in one platform only, even though the problem of offensive language is observed across multiple platforms. Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment dataset, collected from multiple social media platforms, including Twitter, Facebook, and YouTube. We follow two-step crowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform. Furthermore, we analyze the distinctive lexical content along with the use of emojis in offensive comments. We train and evaluate the classifiers using the annotated multi-platform dataset along with other publicly available data. Our results highlight the importance of multiple platform dataset for (a) cross-platform, (b) cross-domain, and (c) cross-dialect generalization of classifier performance.

Constructing a Bilingual Corpus of Parallel Tweets
Hamdy Mubarak | Sabit Hassan | Ahmed Abdelali
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

In a bid to reach a larger and more diverse audience, Twitter users often post parallel tweets—tweets that contain the same content but are written in different languages. Parallel tweets can be an important resource for developing machine translation (MT) systems among other natural language processing (NLP) tasks. In this paper, we introduce a generic method for collecting parallel tweets. Using this method, we collect a bilingual corpus of English-Arabic parallel tweets and a list of Twitter accounts who post English-Arabictweets regularly. Since our method is generic, it can also be used for collecting parallel tweets that cover less-resourced languages such as Serbian and Urdu. Additionally, we annotate a subset of Twitter accounts with their countries of origin and topic of interest, which provides insights about the population who post parallel tweets. This latter information can also be useful for author profiling tasks.

Overview of OSACT4 Arabic Offensive Language Detection Shared Task
Hamdy Mubarak | Kareem Darwish | Walid Magdy | Tamer Elsayed | Hend Al-Khalifa
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

This paper provides an overview of the offensive language detection shared task at the 4th workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). There were two subtasks, namely: Subtask A, involving the detection of offensive language, which contains unacceptable or vulgar content in addition to any kind of explicit or implicit insults or attacks against individuals or groups; and Subtask B, involving the detection of hate speech, which contains insults or threats targeting a group based on their nationality, ethnicity, race, gender, political or sport affiliation, religious belief, or other common characteristics. In total, 40 teams signed up to participate in Subtask A, and 14 of them submitted test runs. For Subtask B, 33 teams signed up to participate and 13 of them submitted runs. We present and analyze all submissions in this paper.

Arabic Curriculum Analysis
Hamdy Mubarak | Shimaa Amer | Ahmed Abdelali | Kareem Darwish
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

Developing a platform that analyzes the content of curricula can help identify their shortcomings and whether they are tailored to specific desired outcomes. In this paper, we present a system to analyze Arabic curricula and provide insights into their content. It allows users to explore word presence, surface-forms used, as well as contrasting statistics between different countries from which the curricula were selected. Also, it provides a facility to grade text in reference to given grade-level and gives users feedback about the complexity or difficulty of words used in a text.

SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri | Preslav Nakov | Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Hamdy Mubarak | Leon Derczynski | Zeses Pitenis | Çağrı Çöltekin
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages: Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages: a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

ALT Submission for OSACT Shared Task on Offensive Language Detection
Sabit Hassan | Younes Samih | Hamdy Mubarak | Ahmed Abdelali | Ammar Rashed | Shammur Absar Chowdhury
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

In this paper, we describe our efforts at OSACT Shared Task on Offensive Language Detection. The shared task consists of two subtasks: offensive language detection (Subtask A) and hate speech detection (Subtask B). For offensive language detection, a system combination of Support Vector Machines (SVMs) and Deep Neural Networks (DNNs) achieved the best results on development set, which ranked 1st in the official results for Subtask A with F1-score of 90.51% on the test set. For hate speech detection, DNNs were less effective and a system combination of multiple SVMs with different parameters achieved the best results on development set, which ranked 4th in official results for Subtask B with F1-macro score of 80.63% on the test set.

Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Hend Al-Khalifa | Walid Magdy | Kareem Darwish | Tamer Elsayed | Hamdy Mubarak
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

ALT at SemEval-2020 Task 12: Arabic and English Offensive Language Identification in Social Media
Sabit Hassan | Younes Samih | Hamdy Mubarak | Ahmed Abdelali
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the systems submitted by the Arabic Language Technology group (ALT) at SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media. We focus on sub-task A (Offensive Language Identification) for two languages: Arabic and English. Our efforts for both languages achieved more than 90% macro-averaged F1-score on the official test set. For Arabic, the best results were obtained by a system combination of Support Vector Machine, Deep Neural Network, and fine-tuned Bidirectional Encoder Representations from Transformers (BERT). For English, the best results were obtained by fine-tuning BERT.

2019

QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
Younes Samih | Hamdy Mubarak | Ahmed Abdelali | Mohammed Attia | Mohamed Eldesouki | Kareem Darwish
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively.

Highly Effective Arabic Diacritization using Sequence to Sequence Modeling
Hamdy Mubarak | Ahmed Abdelali | Hassan Sajjad | Younes Samih | Kareem Darwish
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Arabic text is typically written without short vowels (or diacritics). However, their presence is required for properly verbalizing Arabic and is hence essential for applications such as text to speech. There are two types of diacritics, namely core-word diacritics and case-endings. Most previous works on automatic Arabic diacritic recovery rely on a large number of manually engineered features, particularly for case-endings. In this work, we present a unified character level sequence-to-sequence deep learning model that recovers both types of diacritics without the use of explicit feature engineering. Specifically, we employ a standard neural machine translation setup on overlapping windows of words (broken down into characters), and then we use voting to select the most likely diacritized form of a word. The proposed model outperforms all previous state-of-the-art systems. Our best settings achieve a word error rate (WER) of 4.49% compared to the state-of-the-art of 12.25% on a standard dataset.

POS Tagging for Improving Code-Switching Identification in Arabic
Mohammed Attia | Younes Samih | Ali Elkahky | Hamdy Mubarak | Ahmed Abdelali | Kareem Darwish
Proceedings of the Fourth Arabic Natural Language Processing Workshop

When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intra-sentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language.

A System for Diacritizing Four Varieties of Arabic
Hamdy Mubarak | Ahmed Abdelali | Kareem Darwish | Mohamed Eldesouki | Younes Samih | Hassan Sajjad
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP): System Demonstrations

Short vowels, aka diacritics, are more often omitted when writing different varieties of Arabic including Modern Standard Arabic (MSA), Classical Arabic (CA), and Dialectal Arabic (DA). However, diacritics are required to properly pronounce words, which makes diacritic restoration (a.k.a. diacritization) essential for language learning and text-to-speech applications. In this paper, we present a system for diacritizing MSA, CA, and two varieties of DA, namely Moroccan and Tunisian. The system uses a character level sequence-to-sequence deep learning model that requires no feature engineering and beats all previous SOTA systems for all the Arabic varieties that we test on.

2018

Build Fast and Accurate Lemmatization for Arabic
Hamdy Mubarak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM
Randah Alharbi | Walid Magdy | Kareem Darwish | Ahmed AbdelAli | Hamdy Mubarak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Multi-Dialect Arabic POS Tagging: A CRF Approach
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali | Mohamed Eldesouki | Younes Samih | Randah Alharbi | Mohammed Attia | Walid Magdy | Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

Arabic POS Tagging: Don’t Abandon Feature Engineering Just Yet
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali | Mohamed Eldesouki
Proceedings of the Third Arabic Natural Language Processing Workshop

This paper focuses on comparing between using Support Vector Machine based ranking (SVM-Rank) and Bidirectional Long-Short-Term-Memory (bi-LSTM) neural-network based sequence labeling in building a state-of-the-art Arabic part-of-speech tagging system. Using SVM-Rank leads to state-of-the-art results, but with a fair amount of feature engineering. Using bi-LSTM, particularly when combined with word embeddings, may lead to competitive POS-tagging results by automatically deducing latent linguistic features. However, we show that augmenting bi-LSTM sequence labeling with some of the features that we used for the SVM-Rank based tagger yields to further improvements. We also show that gains that realized by using embeddings may not be additive with the gains achieved by the features. We are open-sourcing both the SVM-Rank and the bi-LSTM based systems for free.

Arabic Diacritization: Stats, Rules, and Hacks
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali
Proceedings of the Third Arabic Natural Language Processing Workshop

In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29% and 12.77% without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes.

QCRI Live Speech Translation System
Fahim Dalvi | Yifan Zhang | Sameer Khurana | Nadir Durrani | Hassan Sajjad | Ahmed Abdelali | Hamdy Mubarak | Ahmed Ali | Stephan Vogel
Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics

This paper presents QCRI’s Arabic-to-English live speech translation system. It features modern web technologies to capture live audio, and broadcasts Arabic transcriptions and English translations simultaneously. Our Kaldi-based ASR system uses the Time Delay Neural Network (TDNN) architecture, while our Machine Translation (MT) system uses both phrase-based and neural frameworks. Although our neural MT system is slower than the phrase-based system, it produces significantly better translations and is memory efficient. The demo is available at https://st.qcri.org/demos/livetranslation.

Learning from Relatives: Unified Dialectal Arabic Segmentation
Younes Samih | Mohamed Eldesouki | Mohammed Attia | Kareem Darwish | Ahmed Abdelali | Hamdy Mubarak | Laura Kallmeyer
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Arabic dialects do not just share a common koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

SemEval-2017 Task 3: Community Question Answering
Preslav Nakov | Doris Hoogeveen | Lluís Màrquez | Alessandro Moschitti | Hamdy Mubarak | Timothy Baldwin | Karin Verspoor
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe SemEval–2017 Task 3 on Community Question Answering. This year, we reran the four subtasks from SemEval-2016: (A) Question–Comment Similarity, (B) Question–Question Similarity, (C) Question–External Comment Similarity, and (D) Rerank the correct answers for a new question in Arabic, providing all the data from 2015 and 2016 for training, and fresh data for testing. Additionally, we added a new subtask E in order to enable experimentation with Multi-domain Question Duplicate Detection in a larger-scale scenario, using StackExchange subforums. A total of 23 teams participated in the task, and submitted a total of 85 runs (36 primary and 49 contrastive) for subtasks A–D. Unfortunately, no teams participated in subtask E. A variety of approaches and features were used by the participating systems to address the different subtasks. The best systems achieved an official score (MAP) of 88.43, 47.22, 15.46, and 61.16 in subtasks A, B, C, and D, respectively. These scores are better than the baselines, especially for subtasks A–C.

Abusive Language Detection on Arabic Social Media
Hamdy Mubarak | Kareem Darwish | Walid Magdy
Proceedings of the First Workshop on Abusive Language Online

In this paper, we present our work on detecting abusive language on Arabic social media. We extract a list of obscene words and hashtags using common patterns used in offensive and rude communications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classification, and we report results on a newly created dataset of classified Arabic tweets (obscene, offensive, and clean). We make this dataset freely available for research, in addition to the list of obscene words and hashtags. We are also publicly releasing a large corpus of classified user comments that were deleted from a popular Arabic news site due to violations the site’s rules and guidelines.

A Neural Architecture for Dialectal Arabic Segmentation
Younes Samih | Mohammed Attia | Mohamed Eldesouki | Ahmed Abdelali | Hamdy Mubarak | Laura Kallmeyer | Kareem Darwish
Proceedings of the Third Arabic Natural Language Processing Workshop

The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.

2016

SemEval-2016 Task 3: Community Question Answering
Preslav Nakov | Lluís Màrquez | Alessandro Moschitti | Walid Magdy | Hamdy Mubarak | Abed Alhakim Freihat | Jim Glass | Bilal Randeree
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

Farasa: A New Fast and Accurate Arabic Word Segmenter
Kareem Darwish | Hamdy Mubarak
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

In this paper, we present Farasa (meaning insight in Arabic), which is a fast and accurate Arabic segmenter. Segmentation involves breaking Arabic words into their constituent clitics. Our approach is based on SVMrank using linear kernels. The features that we utilized account for: likelihood of stems, prefixes, suffixes, and their combination; presence in lexicons containing valid stems and named entities; and underlying stem templates. Farasa outperforms or equalizes state-of-the-art Arabic segmenters, namely QATARA and MADAMIRA. Meanwhile, Farasa is nearly one order of magnitude faster than QATARA and two orders of magnitude faster than MADAMIRA. The segmenter should be able to process one billion words in less than 5 hours. Farasa is written entirely in native Java, with no external dependencies, and is open-source.

Arabic to English Person Name Transliteration using Twitter
Hamdy Mubarak | Ahmed Abdelali
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Social media outlets are providing new opportunities for harvesting valuable resources. We present a novel approach for mining data from Twitter for the purpose of building transliteration resources and systems. Such resources are crucial in translation and retrieval tasks. We demonstrate the benefits of the approach on Arabic to English transliteration. The contribution of this approach includes the size of data that can be collected and exploited within the span of a limited time; the approach is very generic and can be adopted to other languages and the ability of the approach to cope with new transliteration phenomena and trends. A statistical transliteration system built using this data improved a comparable system built from Wikipedia wikilinks data.

Farasa: A Fast and Furious Segmenter for Arabic
Ahmed Abdelali | Kareem Darwish | Nadir Durrani | Hamdy Mubarak
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

2015

Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach
Yonatan Belinkov | Alberto Barrón-Cedeño | Hamdy Mubarak
Proceedings of the Second Workshop on Arabic Natural Language Processing

QCRI: Answer Selection for Community Question Answering - Experiments for Arabic and English
Massimo Nicosia | Simone Filice | Alberto Barrón-Cedeño | Iman Saleh | Hamdy Mubarak | Wei Gao | Preslav Nakov | Giovanni Da San Martino | Alessandro Moschitti | Kareem Darwish | Lluís Màrquez | Shafiq Joty | Walid Magdy
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

Classifying Arab Names Geographically
Hamdy Mubarak | Kareem Darwish
Proceedings of the Second Workshop on Arabic Natural Language Processing

QCRI@QALB-2015 Shared Task: Correction of Arabic Text for Native and Non-Native Speakers’ Errors
Hamdy Mubarak | Kareem Darwish | Ahmed Abdelali
Proceedings of the Second Workshop on Arabic Natural Language Processing

Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription
Samantha Wray | Hamdy Mubarak | Ahmed Ali
Proceedings of the Second Workshop on Arabic Natural Language Processing

2014

Using Stem-Templates to Improve Arabic POS and Gender/Number Tagging
Kareem Darwish | Ahmed Abdelali | Hamdy Mubarak
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents an end-to-end automatic processing system for Arabic. The system performs: correction of common spelling errors pertaining to different forms of alef, ta marbouta and ha, and alef maqsoura and ya; context sensitive word segmentation into underlying clitics, POS tagging, and gender and number tagging of nouns and adjectives. We introduce the use of stem templates as a feature to improve POS tagging by 0.5% and to help ascertain the gender and number of nouns and adjectives. For gender and number tagging, we report accuracies that are significantly higher on previously unseen words compared to a state-of-the-art system.

Advances in dialectal Arabic speech recognition: a study using Twitter to improve Egyptian ASR
Ahmed Ali | Hamdy Mubarak | Stephan Vogel
Proceedings of the 11th International Workshop on Spoken Language Translation: Papers

This paper reports results in building an Egyptian Arabic speech recognition system as an example for under-resourced languages. We investigated different approaches to build the system using 10 hours for training the acoustic model, and results for both grapheme system and phoneme system using MADA. The phoneme-based system shows better results than the grapheme-based system. In this paper, we explore the use of tweets written in dialectal Arabic. Using 880K Egyptian tweets reduced the Out Of Vocabulary (OOV) rate from 15.1% to 3.2% and the WER from 59.6% to 44.7%, a relative gain 25% in WER.

Automatic Correction of Arabic Text: a Cascaded Approach
Hamdy Mubarak | Kareem Darwish
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Using Twitter to Collect a Multi-Dialectal Corpus of Arabic
Hamdy Mubarak | Kareem Darwish
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Verifiably Effective Arabic Dialect Identification
Kareem Darwish | Hassan Sajjad | Hamdy Mubarak
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Co-authors

Preslav Nakov 9

Shammur Absar Chowdhury 8

Hend Al-Khalifa 6

Nadir Durrani 6

Mohamed Eldesouki 6

Tamer Elsayed 6

Mohammed Attia 5

Abubakr Mohamed 5

Hassan Sajjad 5

Samir Abdaljalil 4

Giovanni Da San Martino 4

Maram Hasanain 4

Wajdi Zaghouani 4

Abdulmohsen Al-Thubaity 3

Sabri Boughorbel 3

Laura Kallmeyer 3

Alessandro Moschitti 3

Lluís Màrquez 3

Randah Alharbi 2

Timothy Baldwin 2

Alberto Barrón-Cedeño 2

Yassine El Kheir 2

Stephan Vogel 2

Muhammad Abdul-Mageed 1

Asma Al Wazrah 1

Abdulaziz Al-Homaid 1

Rawan Al-Matham 1

Raghad Al-Rasheed 1

Abdulrahman AlOsaimy 1

Eman Albilali 1

Abdullah Alfaifi 1

Emad Alghamdi 1

Sultana Alghurabi 1

Bashar Alhafni 1

Mais Alheraki 1

Muneera Alhoshan 1

Khaloud Suliman Alkhalefah 1

Amal Almazrua 1

Khalid Almubarak 1

Saied Alshahrani 1

Waad Thuwaini Alshammari 1

Areej Alshaqarawi 1

Maryam Alshihri 1

Afrah Altamimi 1

Nora Altwairesh 1

Fakhraddin Alwajih 1

Zaid Alyafeai 1

Norah A. Alzahrani 1

Atikah Alzeghayer 1

Mohamed Anwar 1

Pepa Atanasova 1

Yonatan Belinkov 1

Md. Rafiul Biswas 1

Britt Bruntink 1

Tommaso Caselli 1

Cagri Coltekin 1

Kareem Mohamed Darwish 1

Cenk Demiroglu 1

Leon Derczynski 1

Abdellah El Mekki 1

Abdelrahman El-Sheikh 1

Khalid Elmadani 1

Muhammad Elmallah 1

Youssef Elshahawy 1

Ekaterina Fadeeva 1

Mahmoud Fawzi 1

Simone Filice 1

Abed Alhakim Freihat 1

Fatima Haouari 1

Doris Hoogeveen 1

Bernard J Jansen 1

Soon-gyo Jung 1

Georgi Karadzhov 1

Sameer Khurana 1

Watheq Mansour 1

Nataša Milić-Frayling 1

Muhammad Tasnim Mohiuddin 1

Massimo Nicosia 1

Alexander Panchenko 1

Sergey Petrakov 1

Zeses Pitenis 1

Bilal Randeree 1

Sara Rosenthal 1

Aleksandr Rubashevskii 1

Joni Salminen 1

Shady Shehata 1

Artem Shelmanov 1

Evgenii Tsymbalov 1

Karin Verspoor 1

Samantha Wray 1

Marcos Zampieri 1

Venues