Workshop on Open-Source Arabic Corpora and Processing Tools (2024)


up

pdf (full)
bib (full)
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024

pdf bib
Proceedings of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT) with Shared Tasks on Arabic LLMs Hallucination and Dialect to MSA Machine Translation @ LREC-COLING 2024
Hend Al-Khalifa | Kareem Darwish | Hamdy Mubarak | Mona Ali | Tamer Elsayed

pdf bib
AraTar: A Corpus to Support the Fine-grained Detection of Hate Speech Targets in the Arabic Language
Seham Alghamdi | Youcef Benkhedda | Basma Alharbi | Riza Batista-Navarro

We are currently witnessing a concerning surge in the spread of hate speech across various social media platforms, targeting individuals or groups based on their protected characteristics such as race, religion, nationality and gender. This paper focuses on the detection of hate type (Task 1) and hate target (Task 2) in the Arabic language. To comprehensively address this problem, we have combined and re-annotated hate speech tweets from existing publicly available corpora, resulting in the creation of AraTar, the first and largest Arabic corpus annotated with support for multi-label classification for both hate speech types and target detection with a high inter-annotator agreement. Additionally, we sought to determine the most effective machine learning-based approach for addressing this issue. To achieve this, we compare and evaluate different approaches, including: (1) traditional machine learning-based models, (2) deep learning-based models fed with contextual embeddings, and (3) fine-tuning language models (LMs). Our results demonstrate that fine-tuning LMs, specifically using AraBERTv0.2-twitter (base), achieved the highest performance, with a micro-averaged F1-score of 84.5% and 85.03%, and a macro-averaged F1-score of 77.46% and 73.15%, for Tasks 1 and 2, respectively.

pdf bib
CLEANANERCorp: Identifying and Correcting Incorrect Labels in the ANERcorp Dataset
Mashael AlDuwais | Hend Al-Khalifa | Abdulmalik AlSalman

Label errors are a common issue in machine learning datasets, particularly for tasks such as Named Entity Recognition. Such label erros might hurt model training, affect evaluation results, and lead to an inaccurate assessment of model performance. In this study, we dived deep into one of the widely adopted Arabic NER benchmark datasets (ANERcorp) and found a significant number of annotation errors, missing labels, and inconsistencies. Therefore, in this study, we conducted empirical research to understand these erros, correct them and propose a cleaner version of the dataset named CLEANANERCorp. CLEANANERCorp will serve the research community as a more accurate and consistent benchmark.

pdf bib
Munazarat 1.0: A Corpus of Arabic Competitive Debates
Mohammad M. Khader | AbdulGabbar Al-Sharafi | Mohamad Hamza Al-Sioufy | Wajdi Zaghouani | Ali Al-Zawqari

This paper introduces the Corpus of Arabic Competitive Debates (Munazarat). Despite the significance of competitive debating as an activity of fostering critical thinking and promoting dialogue, researchers within the fields of Arabic Natural Language Processing (NLP), linguistics, argumentation studies, and education have access to very limited datasets about competitive debating. At this study stage, we introduce Munazarat 1.0, which combines recordings of approximately 50 hours collected from 73 debates at QatarDebate-recognized tournaments, where all of those debates were available on YouTube. Munazarat is a novel specialized speech Arabic corpus, mostly in Modern Standard Arabic (MSA), consisting of diverse debating topics and showing rich metadata for each debate. The transcription of debates was done using Fenek, a speech-to-text Kanari AI tool, and three native Arabic speakers reviewed each transcription file to enhance the quality provided by the machine. The Munazarat 1.0 dataset can be used to train Arabic NLP tools, develop an argumentation mining machine, and analyze Arabic argumentation and rhetoric styles. Keywords: Arabic Speech Corpus, Modern Standard Arabic, Debates

pdf bib
Leveraging Corpus Metadata to Detect Template-based Translation: An Exploratory Case Study of the Egyptian Arabic Wikipedia Edition
Saied Alshahrani | Hesham Haroon Mohammed | Ali Elfilali | Mariama Njie | Jeanna Matthews

Wikipedia articles (content pages) are commonly used corpora in Natural Language Processing (NLP) research, especially in low-resource languages other than English. Yet, a few research studies have studied the three Arabic Wikipedia editions, Arabic Wikipedia (AR), Egyptian Arabic Wikipedia (ARZ), and Moroccan Arabic Wikipedia (ARY), and documented issues in the Egyptian Arabic Wikipedia edition regarding the massive automatic creation of its articles using template-based translation from English to Arabic without human involvement, overwhelming the Egyptian Arabic Wikipedia with articles that do not only have low-quality content but also with articles that do not represent the Egyptian people, their culture, and their dialect. In this paper, we aim to mitigate the problem of template translation that occurred in the Egyptian Arabic Wikipedia by identifying these template-translated articles and their characteristics through exploratory analysis and building automatic detection systems. We first explore the content of the three Arabic Wikipedia editions in terms of density, quality, and human contributions and utilize the resulting insights to build multivariate machine learning classifiers leveraging articles’ metadata to detect the template-translated articles automatically. We then publicly deploy and host the best-performing classifier as an online application called ‘Egyptian Wikipedia Scanner’ and release the extracted, filtered, labeled, and preprocessed datasets to the research community to benefit from our datasets and the online, web-based detection system.

pdf bib
A Novel Approach for Root Selection in the Dependency Parsing
Sharefah Ahmed Al-Ghamdi | Hend Al-Khalifa | Abdulmalik AlSalman

Although syntactic analysis using the sequence labeling method is promising, it can be problematic when the labels sequence does not contain a root label. This can result in errors in the final parse tree when the postprocessing method assumes the first word as the root. In this paper, we present a novel postprocessing method for BERT-based dependency parsing as sequence labeling. Our method leverages the root’s part of speech tag to select a more suitable root for the dependency tree, instead of using the default first token. We conducted experiments on nine dependency treebanks from different languages and domains, and demonstrated that our technique consistently improves the labeled attachment score (LAS) on most of them.

pdf bib
AraMed: Arabic Medical Question Answering using Pretrained Transformer Language Models
Ashwag Alasmari | Sarah Alhumoud | Waad Alshammari

Medical Question Answering systems have gained significant attention in recent years due to their potential to enhance medical decision-making and improve patient care. However, most of the research in this field has focused on English-language datasets, limiting the generalizability of MQA systems to non-English speaking regions. This study introduces AraMed, a large-scale Arabic Medical Question Answering dataset addressing the limited resources available for Arabic medical question answering. AraMed comprises of 270k question-answer pairs based on health consumer questions submitted to online medical forum. Experiments using various deep learning models showcase the dataset’s effectiveness, particularly with AraBERT models achieving highest results, specifically AraBERTv2 obtained an F1 score of 96.73% in the answer selection task. The comparative analysis of different deep learning models provides insights into their strengths and limitations. These findings highlight the potential of AraMed for advancing Arabic medical question answering research and development.

pdf bib
The Multilingual Corpus of World’s Constitutions (MCWC)
Mo El-Haj | Saad Ezzini

The “Multilingual Corpus of World’s Constitutions” (MCWC) serves as a valuable resource for the NLP community, offering a comprehensive collection of constitutions from around the world. Its focus on data quality and breadth of coverage enables advanced research in constitutional analysis, machine translation, and cross-lingual legal studies. The MCWC prepares its data to ensure high quality and minimal noise, while also providing valuable mappings of constitutions to their respective countries and continents, facilitating comparative analysis. Notably, the corpus offers pairwise sentence alignments across languages, supporting machine translation experiments. We utilise a leading Machine Translation model, fine-tuned on the MCWC to achieve accurate and context-aware translations. Additionally, we introduce an independent Machine Translation model as a comparative baseline. Fine-tuning the model on the MCWC improves accuracy, highlighting the significance of such a legal corpus for NLP and Machine Translation. The MCWC’s rich multilingual content and rigorous data quality standards raise the bar for legal text analysis and inspire innovation in the NLP community, opening new avenues for studying constitutional texts and multilingual data analysis.

pdf bib
TafsirExtractor: Text Preprocessing Pipeline preparing Classical Arabic Literature for Machine Learning Applications
Carl Kruse | Sajawel Ahmed

In this paper, we present a comprehensive tool of preprocessing Classical Arabic (CA) literature in the field of historical exegetical studies for machine learning (ML) evaluations. Most recent ML models require the training data to be in a specific format (e.g. XML, TEI, CoNLL) to use it afterwards for ML applications such as Named Entity Recognition (NER) or Topic Modeling (TM). We report on how our method works and can be applied by other researchers with similar endeavors. Thereby, the importance of this comprehensive tool of preprocessing is demonstrated, as this novel approach has no predecessors for CA yet. We achieve results that enable the training of current ML models leading to state-of-the art performance for NER and TM on CA literature. We make our tool along its source code and data freely available for the Natural Language Processing (NLP) research community.

pdf bib
Advancing the Arabic WordNet: Elevating Content Quality
Abed Alhakim Freihat | Hadi Mahmoud Khalilia | Gábor Bella | Fausto Giunchiglia

High-quality WordNets are crucial for achieving high-quality results in NLP applications that rely on such resources. However, the wordnets of most languages suffer from serious issues of correctness and completeness with respect to the words and word meanings they define, such as incorrect lemmas, missing glosses and example sentences, or an inadequate, Western-centric representation of the morphology and the semantics of the language. Previous efforts have largely focused on increasing lexical coverage while ignoring other qualitative aspects. In this paper, we focus on the Arabic language and introduce a major revision of the Arabic WordNet that addresses multiple dimensions of lexico-semantic resource quality. As a result, we updated more than 58% of the synsets of the existing Arabic WordNet by adding missing information and correcting errors. In order to address issues of language diversity and untranslatability, we also extended the wordnet structure by new elements: phrasets and lexical gaps.

pdf bib
Arabic Speech Recognition of zero-resourced Languages: A case of Shehri (Jibbali) Language
Norah A. Alrashoudi | Omar Said Alshahri | Hend Al-Khalifa

Many under-resourced languages lack computational resources for automatic speech recognition (ASR) due to data scarcity issues. This makes developing accurate ASR models challenging. Shehri or Jibbali, spoken in Oman, lacks extensive annotated speech data. This paper aims to improve an ASR model for this under-resourced language. We collected a Shehri (Jibbali) speech corpus and utilized transfer learning by fine-tuning pre-trained ASR models on this dataset. Specifically, models like Wav2Vec2.0, HuBERT and Whisper were fine-tuned using techniques like parameter-efficient fine-tuning. Evaluation using word error rate (WER) and character error rate (CER) showed that the Whisper model, fine-tuned on the Shehri (Jibbali) dataset, significantly outperformed other models, with the best results from Whisper-medium achieving 3.5% WER. This demonstrates the effectiveness of transfer learning for resource-constrained tasks, showing high zero-shot performance of pre-trained models.

pdf bib
OSACT6 Dialect to MSA Translation Shared Task Overview
Ashraf Hatim Elneima | AhmedElmogtaba Abdelmoniem Ali Abdelaziz | Kareem Darwish

This paper presents the Dialectal Arabic (DA) to Modern Standard Arabic (MSA) Machine Translation (MT) shared task in the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). The paper describes the creation of the validation and test data and the metrics used; and provides a brief overview of the submissions to the shared task. In all, 29 teams signed up and 6 teams made actual submissions. The teams used a variety of datasets and approaches to build their MT systems. The most successful submission involved using zero-shot and n-shot prompting of chatGPT.

pdf bib
OSACT 2024 Task 2: Arabic Dialect to MSA Translation
Hanin Atwany | Nour Rabih | Ibrahim Mohammed | Abdul Waheed | Bhiksha Raj

We present the results of Shared Task “Dialect to MSA Translation”, which tackles challenges posed by the diverse Arabic dialects in machine translation. Covering Gulf, Egyptian, Levantine, Iraqi and Maghrebi dialects, the task offers 1001 sentences in both MSA and dialects for fine-tuning, alongside 1888 blind test sentences. Leveraging GPT-3.5, a state-of-the-art language model, our method achieved the a BLEU score of 29.61. This endeavor holds significant implications for Neural Machine Translation (NMT) systems targeting low-resource langu ages with linguistic variation. Additionally, negative experiments involving fine-tuning AraT5 and No Language Left Behind (NLLB) using the MADAR Dataset resulted in BLEU scores of 10.41 and 11.96, respectively. Future directions include expanding the dataset to incorporate more Arabic dialects and exploring alternative NMT architectures to further enhance translation capabilities.

pdf bib
ASOS at OSACT6 Shared Task: Investigation of Data Augmentation in Arabic Dialect-MSA Translation
Omer Nacar | Abdullah Alharbi | Serry Sibaee | Samar Ahmed | Lahouari Ghouti | Anis Koubaa

The translation between Modern Standard Arabic (MSA) and the various Arabic dialects presents unique challenges due to the significant linguistic, cultural, and contextual variations across the regions where Arabic is spoken. This paper presents a system description of our participation in the OSACT 2024 Dialect to MSA Translation Shared Task. We explain our comprehensive approach which combines data augmentation techniques using generative pre-trained transformer models (GPT-3.5 and GPT-4) with fine-tuning of AraT5 V2, a model specifically designed for Arabic translation tasks. Our methodology has significantly expanded the training dataset, thus improving the model’s performance across five major Arabic dialects, namely Gulf, Egyptian, Levantine, Iraqi, and Maghrebi. We have rigorously evaluated our approach, using BLEU score, to ensure translation accuracy, fluency, and the preservation of meaning. Our results showcase the effectiveness of our refined models in addressing the challenges posed by diverse Arabic dialects and Modern Standard Arabic (MSA), achieving a BLEU score of 80% on the validation test set and 22.25% on the blind test set. However, it’s important to note that while utilizing a larger dataset, such as Madar + Dev, resulted in significantly higher evaluation BLEU scores, the performance on the blind test set was relatively lower. This observation underscores the importance of dataset size in model training, revealing potential limitations in generalization to unseen data due to variations in data distribution and domain mismatches.

pdf bib
LLM-based MT Data Creation: Dialectal to MSA Translation Shared Task
AhmedElmogtaba Abdelmoniem Ali Abdelaziz | Ashraf Hatim Elneima | Kareem Darwish

This paper presents our approach to the Dialect to Modern Standard Arabic (MSA) Machine Translation shared task, conducted as part of the sixth Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6). Our primary contribution is the development of a novel dataset derived from The Saudi Audio Dataset for Arabic (SADA) an Arabic audio corpus. By employing an automated method utilizing ChatGPT 3.5, we translated the dialectal Arabic texts to their MSA equivalents. This process not only yielded a unique and valuable dataset but also showcased an efficient method for leveraging language models in dataset generation. Utilizing this dataset, alongside additional resources, we trained a machine translation model based on the Transformer architecture. Through systematic experimentation with model configurations, we achieved notable improvements in translation quality. Our findings highlight the significance of LLM-assisted dataset creation methodologies and their impact on advancing machine translation systems, particularly for languages with considerable dialectal diversity like Arabic.

pdf bib
Sirius_Translators at OSACT6 2024 Shared Task: Fin-tuning Ara-T5 Models for Translating Arabic Dialectal Text to Modern Standard Arabic
Salwa Saad Alahmari

This paper presents the findings from our participation in the 6th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT6) in 2024. Our specific focus was on the second task (Task 2), which involved translating text at the sentence level from five distinct Dialectal Arabic (DA) (Gulf, Egyptian, Levantine, Iraqi, and Maghrebi) into Modern Standard Arabic (MSA). Our team, Sirius_Translators, fine-tuned four AraT5 models namely; AraT5 base, AraT5v2-base-1024, AraT5-MSA-Small, and AraT5-MSA-Base for the Arabic machine translation (MT) task. These models were fine-tuned using a variety of parallel corpora containing Dialectal Arabic and Modern Standard Arabic. Based on the evaluation results of OSACT6 2024 Shared Task2, our fine-tuned AraT5v2-base-1024 model achieved an overall BLEU score of 21.0 on the development (Dev) set and 9.57 on the test set, respectively.

pdf bib
AraT5-MSAizer: Translating Dialectal Arabic to MSA
Murhaf Fares

This paper outlines the process of training the AraT5-MSAizer model, a transformer-based neural machine translation model aimed at translating five regional Arabic dialects into Modern Standard Arabic (MSA). Developed for Task 2 of the 6th Workshop on Open-Source Arabic Corpora and Processing Tools, the model attained a BLEU score of 21.79% on the test set associated with this task.

pdf bib
ASOS at Arabic LLMs Hallucinations 2024: Can LLMs detect their Hallucinations :)
Serry Taiseer Sibaee | Abdullah I. Alharbi | Samar Ahmed | Omar Nacar | Lahouri Ghouti | Anis Koubaa

This research delves into the issue of hallucination detection in Large Language Models (LLMs) using Arabic language datasets. As LLMs are increasingly being used in various applications, the phenomenon of hallucination, which refers to generating factually inaccurate content despite grammatical coherence, poses significant challenges. We participate in the OSACT 2024 Shared-task (Detection of Hallucination in Arabic Factual Claims Generated by ChatGPT and GPT4). We explore various approaches for detecting and mitigating hallucination, using models such as GPT-4, Mistral, and Gemini within a novel experimental framework. Our research findings reveal that the effectiveness of these models in classifying claims into Fact-Claim, Fact-Improvement, and Non-Fact categories varies greatly, underscoring the complexities of addressing hallucination in morphologically rich languages. The study emphasizes the need for advanced modelling and training strategies to enhance the reliability and factual accuracy of LLM-generated content, laying the groundwork for future explorations in mitigating hallucination risks. In our experiments we achieved a 0.54 F1 in GPT-4 LLM.