Nizar Habash - ACL Anthology

Nizar Habash

2026

Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Chaimae Abouzahir | Congbo Ma | Nizar Habash | Farah E. Shamout
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)

In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic & English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis shows that model-reported confidence and explanations are poor indicators of correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.

Do Diacritics Matter? Evaluating the Impact of Arabic Diacritics on Tokenization and LLM Benchmarks
Go Inoue | Bashar Alhafni | Nizar Habash | Timothy Baldwin
Findings of the Association for Computational Linguistics: EACL 2026

Diacritics are orthographic marks added to letters to specify pronunciation, disambiguate lexical meanings, or indicate grammatical distinctions. Diacritics can significantly influence language processing tasks, especially in languages like Arabic, where diacritic usage varies widely across domains and contexts. While diacritics provide valuable linguistic information, their presence can increase subword fragmentation during tokenization, potentially degrading the performance of NLP models. In this paper, we systematically analyze the impact of diacritics on tokenization and benchmark task performance across major Large Language Models (LLMs). Our results demonstrate that while modern LLMs show robustness to the limited diacritics naturally found in texts, full diacritization leads to substantially increased token fragmentation and degraded performance, highlighting the need for careful handling of diacritics in the future development of Arabic LLMs.

A Tale of Two Scripts: Transliteration and Post-Correction for Judeo-Arabic
Juan Moreno Gonzalez | Bashar Alhafni | Nizar Habash
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Judeo-Arabic refers to Arabic variants historically spoken by Jewish communities across the Arab world, primarily during the Middle Ages. Unlike standard Arabic, it is written in Hebrew script by Jewish writers and for Jewish audiences. Transliterating Judeo-Arabic into Arabic script is challenging due to ambiguous letter mappings, inconsistent orthographic conventions, and frequent code-switching into Hebrew. In this paper, we introduce a two-step approach to automatically transliterate Judeo-Arabic into Arabic script: simple character-level mapping followed by post-correction to address grammatical and orthographic errors. We also present the first benchmark evaluation of LLMs on this task. Finally, we show that transliteration enables Arabic NLP tools to perform morphosyntactic tagging and machine translation, which would have not been feasible on the original texts. We make our code and data publicly available.

Computational Benchmarks for Egyptian Arabic Child Directed Speech
Salam Khalifa | Abed Qaddoumi | Nizar Habash | Owen Rambow
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

We present AraBabyTalk-EGY, an enriched release of the Egyptian Arabic CHILDES corpus, that opens the child-adult interactions genre to modern Arabic NLP research. Starting from the original CHILDES recordings and IPA transcriptions of caregiver-child sessions, we (i) map each IPA token to fully diacritized Arabic script, and (ii) add core part-of-speech tags and lemmas aligned with existing dialectal Arabic morphological resources. These layers yield ~26K annotated tokens suitable for both text- and speech-based NLP tasks. We provide a benchmark on morphological disambiguation and Arabic ASR. We outline lexical and morphosyntactic differences between AraBabyTalk-EGY and general Egyptian Arabic resources, highlighting the value of genre-specific training data for language acquisition studies and Arabic speech technology.

2025

NADI 2025: The First Multidialectal Arabic Speech Processing Shared Task
Bashar Talafha | Hawau Olamide Toyin | Peter Sullivan | AbdelRahim A. Elmadany | Abdurrahman Juma | Amirbek Djanibekov | Chiyu Zhang | Hamad Alshehhi | Hanan Aldarmaki | Mustafa Jarrar | Nizar Habash | Muhammad Abdul-Mageed
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present the findings of the sixth Nuanced Arabic Dialect Identification (NADI 2025) Shared Task, which focused on Arabic speech dialect processing across three subtasks: spoken dialect identification (Subtask 1), speech recognition (Subtask 2), and diacritic restoration for spoken dialects (Subtask 3). A total of 44 teams registered, and during the testing phase, 100 valid submissions were received from eight unique teams. The distribution was as follows: 34 submissions for Subtask 1 five teams, 47 submissions for Subtask 2 six teams, and 19 submissions for Subtask 3 two teams. The best-performing systems achieved 79.8% accuracy on Subtask 1, 35.68/12.20 WER/CER (overall average) on Subtask 2, and 55/13 WER/CER on Subtask 3. These results highlight the ongoing challenges of Arabic dialect speech processing, particularly in dialect identification, recognition, and diacritic restoration. We also summarize the methods adopted by participating teams and briefly outline directions for future editions of NADI.

A Large and Balanced Corpus for Fine-grained Arabic Readability Assessment
Khalid N. Elmadani | Nizar Habash | Hanada Taha-Thomure
Findings of the Association for Computational Linguistics: ACL 2025

This paper introduces the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale, fine-grained dataset for Arabic readability assessment. BAREC consists of 69,441 sentences spanning 1+ million words, carefully curated to cover 19 readability levels, from kindergarten to postgraduate comprehension. The corpus balances genre diversity, topical coverage, and target audiences, offering a comprehensive resource for evaluating Arabic text complexity. The corpus was fully manually annotated by a large team of annotators. The average pairwise inter-annotator agreement, measured by Quadratic Weighted Kappa, is 81.8%, reflecting a high level of substantial agreement.Beyond presenting the corpus, we benchmark automatic readability assessment across different granularity levels, comparing a range of techniques. Our results highlight the challenges and opportunities in Arabic readability modeling, demonstrating competitive performance across various methods.To support research and education, we make BAREC openly available, along with detailed annotation guidelines and benchmark results: http://barec.camel-lab.com.

Evaluating Prompt Relevance in Arabic Automatic Essay Scoring: Insights from Synthetic and Real-World Data
Chatrine Qwaider | Kirill Chirkunov | Bashar Alhafni | Nizar Habash | Ted Briscoe
Proceedings of The Third Arabic Natural Language Processing Conference

Prompt relevance is a critical yet underexplored dimension in Arabic Automated Essay Scoring (AES). We present the first systematic study of binary prompt-essay relevance classification, supporting both AES scoring and dataset annotation. To address data scarcity, we built a synthetic dataset of on-topic and off-topic pairs and evaluated multiple models, including threshold-based classifiers, SVMs, causal LLMs, and a fine-tuned masked SBERT model. For real-data evaluation, we combined QAES with ZAEBUC, creating off-topic pairs via mismatched prompts. We also tested prompt expansion strategies using AraVec, CAMeL, and GPT-4o. Our fine-tuned SBERT achieved 98% F1 on synthetic data and strong results on QAES+ZAEBUC, outperforming SVMs and threshold-based baselines and offering a resource-efficient alternative to LLMs. This work establishes the first benchmark for Arabic prompt relevance and provides practical strategies for low-resource AES.

From Multiple-Choice to Extractive QA: A Case Study for English and Arabic
Teresa Lynn | Malik H. Altakrori | Samar M. Magdy | Rocktim Jyoti Das | Chenyang Lyu | Mohamed Nasr | Younes Samih | Kirill Chirkunov | Alham Fikri Aji | Preslav Nakov | Shantanu Godbole | Salim Roukos | Radu Florian | Nizar Habash
Proceedings of the 31st International Conference on Computational Linguistics

The rapid evolution of Natural Language Processing (NLP) has favoured major languages such as English, leaving a significant gap for many others due to limited resources. This is especially evident in the context of data annotation, a task whose importance cannot be underestimated, but which is time-consuming and costly. Thus, any dataset for resource-poor languages is precious, in particular when it is task-specific. Here, we explore the feasibility of repurposing an existing multilingual dataset for a new NLP task: we repurpose a subset of the BELEBELE dataset (Bandarkar et al., 2023), which was designed for multiple-choice question answering (MCQA), to enable the more practical task of extractive QA (EQA) in the style of machine reading comprehension. We present annotation guidelines and a parallel EQA dataset for English and Modern Standard Arabic (MSA). We also present QA evaluation results for several monolingual and cross-lingual QA pairs including English, MSA, and five Arabic dialects. We aim to help others adapt our approach for the remaining 120 BELEBELE language variants, many of which are deemed under-resourced. We also provide a thorough analysis and share insights to deepen understanding of the challenges and opportunities in NLP task reformulation.

Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study
Bashar Alhafni | Nizar Habash
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Text editing frames grammatical error correction (GEC) as a sequence tagging problem, where edit tags are assigned to input tokens, and applying these edits results in the corrected text. This approach has gained attention for its efficiency and interpretability. However, while extensively explored for English, text editing remains largely underexplored for morphologically rich languages like Arabic. In this paper, we introduce a text editing approach that derives edit tags directly from data, eliminating the need for language-specific edits. We demonstrate its effectiveness on Arabic, a diglossic and morphologically rich language, and investigate the impact of different edit representations on model performance. Our approach achieves SOTA results on two Arabic GEC benchmarks and performs on par with SOTA on two others. Additionally, our models are over six times faster than existing Arabic GEC systems, making our approach more practical for real-world applications. Finally, we explore ensemble models, demonstrating how combining different models leads to further performance improvements. We make our code, data, and pretrained models publicly available.

Proceedings of the first International Workshop on Nakba Narratives as Language Resources
Mustafa Jarrar | Nizar Habash | Mo El-Haj | Amal Haddad Haddad | Zeina Jallad | Camille Mansour | Diana Allan | Paul Rayson | Tymaa Hammouda | Sanad Malaysha
Proceedings of the first International Workshop on Nakba Narratives as Language Resources

Data Augmentation for Maltese NLP using Transliterated and Machine Translated Arabic Data
Kurt Micallef | Nizar Habash | Claudia Borg
Findings of the Association for Computational Linguistics: EMNLP 2025

Maltese is a unique Semitic language that has evolved under extensive influence from Romance and Germanic languages, particularly Italian and English. Despite its Semitic roots, its orthography is based on the Latin script, creating a gap between it and its closest linguistic relatives in Arabic. In this paper, we explore whether Arabic-language resources can support Maltese natural language processing (NLP) through cross-lingual augmentation techniques. We investigate multiple strategies for aligning Arabic textual data with Maltese, including various transliteration schemes and machine translation (MT) approaches. As part of this, we also introduce novel transliteration systems that better represent Maltese orthography. We evaluate the impact of these augmentations on monolingual and mutlilingual models and demonstrate that Arabic-based augmentation can significantly benefit Maltese NLP tasks.

Guidelines for Fine-grained Sentence-level Arabic Readability Annotation
Nizar Habash | Hanada Taha-Thomure | Khalid N. Elmadani | Zeina Zeino | Abdallah Abushmaes
Proceedings of the 19th Linguistic Annotation Workshop (LAW-XIX-2025)

This paper presents the annotation guidelines of the Balanced Arabic Readability Evaluation Corpus (BAREC), a large-scale resource for fine-grained sentence-level readability assessment in Arabic. BAREC includes 69,441 sentences (1M+ words) labeled across 19 levels, from kindergarten to postgraduate. Based on the Taha/Arabi21 framework, the guidelines were refined through iterative training with native Arabic-speaking educators. We highlight key linguistic, pedagogical, and cognitive factors in determining readability and report high inter-annotator agreement: Quadratic Weighted Kappa 81.8% (substantial/excellent agreement) in the last annotation phase. We also benchmark automatic readability models across multiple classification granularities (19-, 7-, 5-, and 3-level). The corpus and guidelines are publicly available: http://barec.camel-lab.com.

Enhancing Arabic Automated Essay Scoring with Synthetic Data and Error Injection
Chatrine Qwaider | Bashar Alhafni | Kirill Chirkunov | Nizar Habash | Ted Briscoe
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Automated Essay Scoring (AES) plays a crucial role in assessing language learners’ writingquality, reducing grading workload, and providing real-time feedback. The lack of annotatedessay datasets inhibits the development of Arabic AES systems. This paper leverages LargeLanguage Models (LLMs) and Transformermodels to generate synthetic Arabic essays forAES. We prompt an LLM to generate essaysacross the Common European Framework ofReference (CEFR) proficiency levels and introduce and compare two approaches to errorinjection. We create a dataset of 3,040 annotated essays with errors injected using our twomethods. Additionally, we develop a BERTbased Arabic AES system calibrated to CEFRlevels. Our experimental results demonstratethe effectiveness of our synthetic dataset in improving Arabic AES performance. We makeour code and data publicly available

Lemmatization as a Classification Task: Results from Arabic across Multiple Genres
Mostafa Saeed | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Lemmatization is crucial for NLP tasks in morphologically rich languages with ambiguous orthography like Arabic, but existing tools face challenges due to inconsistent standards and limited genre coverage. This paper introduces two novel approaches that frame lemmatization as classification into a Lemma-POS-Gloss (LPG) tagset, leveraging machine translation and semantic clustering. We also present a new Arabic lemmatization test set covering diverse genres, standardized alongside existing datasets. We evaluate character-level sequence-to-sequence models, which perform competitively and offer complementary value, but are limited to lemma prediction (not LPG) and prone to hallucinating implausible forms. Our results show that classification and clustering yield more robust, interpretable outputs, setting new benchmarks for Arabic lemmatization.

BAREC Demo: Resources and Tools for Sentence-level Arabic Readability Assessment
Kinda Altarbouch | Khalid N. Elmadani | Ossama Obeid | Hanada Taha | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present BAREC Demo, a web-based system for fine-grained, sentence-level Arabic readability assessment. The demo is part of the Balanced Arabic Readability Evaluation Corpus (BAREC) project, which manually annotated 69,000 sentences (over one million words) from diverse genres and domains using a 19-level readability scale inspired by the Taha/Arabi21 framework, covering reading abilities from kindergarten to postgraduate levels. The project also developed models for automatic readability assessment.The demo provides two main functionalities for educators, content creators, language learners, and researchers: (1) a Search interface to explore the annotated dataset for text selection and resource development, and (2) an Analyze interface, which uses trained models to assign detailed readability labels to Arabic texts at the sentence level.The system and all of its resources are accessible at https://barec.camel-lab.com.

We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.

Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Firoj Alam | Preslav Nakov | Nizar Habash | Iryna Gurevych | Shammur Chowdhury | Artem Shelmanov | Yuxia Wang | Ekaterina Artemova | Mucahid Kutlu | George Mikros
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

Proper Noun Diacritization for Arabic Wikipedia: A Benchmark Dataset
Rawan Bondok | Mayar Nassar | Salam Khalifa | Kurt Micallef | Nizar Habash
Proceedings of the 2nd Workshop on Advancing Natural Language Processing for Wikipedia (WikiNLP 2025)

Proper nouns in Arabic Wikipedia are frequently undiacritized, creating ambiguity in pronunciation and interpretation, especially for transliterated named entities of foreign origin. While transliteration and diacritization have been well-studied separately in Arabic NLP, their intersection remains underexplored. In this paper, we introduce a new manually diacritized dataset of Arabic proper nouns of various origins with their English Wikipedia equivalent glosses, and present the challenges and guidelines we followed to create it. We benchmark GPT-4o on the task of recovering full diacritization given the undiacritized Arabic and English forms, and analyze its performance. Achieving 73% accuracy, our results underscore both the difficulty of the task and the need for improved models and resources. We release our dataset to facilitate further research on Arabic Wikipedia proper noun diacritization.

The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR
Injy Hamed | Thang Vu | Nizar Habash
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.

A Survey of Code-switched Arabic NLP: Progress, Challenges, and Future Directions
Injy Hamed | Caroline Sabty | Slim Abdennadher | Ngoc Thang Vu | Thamar Solorio | Nizar Habash
Proceedings of the 31st International Conference on Computational Linguistics

Language in the Arab world presents a complex diglossic and multilingual setting, involving the use of Modern Standard Arabic, various dialects and sub-dialects, as well as multiple European languages. This diverse linguistic landscape has given rise to code-switching, both within Arabic varieties and between Arabic and foreign languages. The widespread occurrence of code-switching across the region makes it vital to address these linguistic needs when developing language technologies. In this paper, we provide a review of the current literature in the field of code-switched Arabic NLP, offering a broad perspective on ongoing efforts, challenges, research gaps, and recommendations for future research directions.

AraHealthQA 2025: The First Shared Task on Arabic Health Question Answering
Hassan Alhuzali | Walid Al-Eisawi | Muhammad Abdul-Mageed | Chaimae Abouzahir | Mouath Abu-Daoud | Ashwag Alasmari | Renad Al-Monef | Ali Alqahtani | Lama Ayash | Leen Kharouf | Farah E. Shamout | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We introduce AraHealthQA 2025, the Comprehensive Arabic Health Question Answering Shared Task, held in conjunction with ArabicNLP 2025 co-located with EMNLP 2025. This shared task addresses the paucity of high-quality Arabic medical QA resources by offering two complementary tracks: MentalQA, focusing on Arabic mental health Q&A (e.g., anxiety, depression, stigma reduction), and MedArabiQ, covering broader medical domains such as internal medicine, pediatrics, and clinical decision making. Each track comprises multiple subtasks, evaluation datasets, and standardized metrics, facilitating fair benchmarking. The task was structured to promote modeling under realistic, multilingual, and culturally nuanced healthcare contexts. We outline the dataset creation, task design and evaluation framework, participation statistics, baseline systems, and summarize the overall outcomes. We conclude with reflections on the performance trends observed and prospects for future iterations in Arabic health QA.

ARWI: Arabic Write and Improve
Kirill Chirkunov | Bashar Alhafni | Chatrine Qwaider | Nizar Habash | Ted Briscoe
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment (https://arwi.mbzuai.ac.ae/). Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.

BAREC Shared Task 2025 on Arabic Readability Assessment
Khalid N. Elmadani | Bashar Alhafni | Hanada Taha | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference: Shared Tasks

We present the results and findings of the BAREC Shared Task 2025 on Arabic Readability Assessment, organized as part of The Third Arabic Natural Language Processing Conference (ArabicNLP 2025). The BAREC 2025 shared task focuses on automatic readability assessment using BAREC Corpus, addressing fine-grained classification into 19 readability levels. The shared task includes two sub-tasks: sentence-level classification and document-level classification, and three tracks: (1) Strict Track, where only BAREC Corpus is allowed; (2) Constrained Track, restricted to the BAREC Corpus, SAMER Corpus, and SAMER Lexicon, and (3) Open Track, allowing any external resources. A total of 22 teams from 12 countries registered for the task. Among these, 17 teams submitted system description papers. The winning team achieved 87.5 QWK on the sentence-level task and 87.4 QWK on the document-level task.

The Arabic Generality Score: Another Dimension of Modeling Arabic Dialectness
Sanad Sha’ban | Nizar Habash
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Arabic dialects form a diverse continuum, yet NLP models often treat them as discrete categories. Recent work addresses this issue by modeling dialectness as a continuous variable, notably through the Arabic Level of Dialectness (ALDi). However, ALDi reduces complex variation to a single dimension. We propose a complementary measure: the Arabic Generality Score (AGS), which quantifies how widely a word is used across dialects. We introduce a pipeline that combines word alignment, etymology-aware edit distance, and smoothing to annotate a parallel corpus with word-level AGS. A regression model is then trained to predict AGS in context. Our approach outperforms strong baselines, including state-of-the-art dialect ID systems, on a multi-dialect benchmark. AGS offers a scalable, linguistically grounded way to model lexical generality, enriching representations of Arabic dialectness. Code is publicly available at https://github.com/CAMeL-Lab/arabic-generality-score.

Radical Allomorphy: Phonological Surface Forms without Phonology
Salam Khalifa | Nizar Habash | Owen Rambow
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent computational work typically frames morphophonology as generating surface forms (SFs) from abstract underlying representations (URs) by applying phonological rules or constraints. This generative stance presupposes that every morpheme has a well-defined UR from which all allomorphs can be derived, a theory-laden assumption that is expensive to annotate, especially in low-resource settings.We adopt an alternative view. Allomorphs and their phonological variants are treated as the basic, observed lexicon, not as outputs of abstract URs. The modeling task therefore shifts from deriving SFs to selecting the correct SF, given a meaning and a phonological context. This discriminative formulation removes the need to posit or label URs and lets the model exploit the surface evidence directly.

A Derivational ChainBank for Modern Standard Arabic
Reham Marzouk | Sondos Krouna | Nizar Habash
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

We introduce the new concept of an Arabic Derivational Chain Bank (CHAINBANK) to leverage the relationship between form and meaning in modeling Arabic derivational morphology. We constructed a knowledge graph network of abstract patterns and their derivational relations, and aligned it with the lemmas of the CAMELMORPH morphological analyzer database. This process produced chains of derived words’ lemmas linked to their correspond- ing lemma bases through derivational relations, encompassing 23,333 derivational connections. The CHAINBANK is publicly available.1

Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis
Mai Mohamed Eida | Nizar Habash
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

Egyptian Arabic (EA) NLP resources have mainly focused on Cairene Egyptian Arabic (CEA), leaving sub-dialects like Sa’idi Egyptian Arabic (SEA) underrepresented. This paper introduces the first SEA corpus – an open-source, 4-million-word literary dataset of a dialect spoken by ~30 million Egyptians. To validate its representation, we analyze SEA-specific linguistic features from dialectal surveys, confirming a higher prevalence in our corpus compared to existing EA datasets. Our findings offer insights into SEA’s orthographic representation in morphology, phonology, and lexicon, incorporating CODA* guidelines for normalization.

BALSAM: A Platform for Benchmarking Arabic Large Language Models
Rawan Al-Matham | Kareem Darwish | Raghad Al-Rasheed | Waad Alshammari | Muneera Alhoshan | Amal Almazrua | Asma Al Wazrah | Mais Alheraki | Firoj Alam | Preslav Nakov | Norah Alzahrani | Eman AlBilali | Nizar Habash | Abdelrahman El-Sheikh | Muhammad Elmallah | Haonan Li | Hamdy Mubarak | Mohamed Anwar | Zaid Alyafeai | Ahmed Abdelali | Nora Altwairesh | Maram Hasanain | Abdulmohsen Al Thubaity | Shady Shehata | Bashar Alhafni | Injy Hamed | Go Inoue | Khalid Elmadani | Ossama Obeid | Fatima Haouari | Tamer Elsayed | Emad Alghamdi | Khalid Almubarak | Saied Alshahrani | Ola Aljarrah | Safa Alajlan | Areej Alshaqarawi | Maryam Alshihri | Sultana Alghurabi | Atikah Alzeghayer | Afrah Altamimi | Abdullah Alfaifi | Abdulrahman AlOsaimy
Proceedings of The Third Arabic Natural Language Processing Conference

The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.

Lemmatizing Dialectal Arabic with Sequence-to-Sequence Models
Mostafa Saeed | Nizar Habash
Proceedings of The Third Arabic Natural Language Processing Conference

Lemmatization for dialectal Arabic poses many challenges due to the lack of orthographic standards and limited morphological analyzers. This work explores the effectiveness of Seq2Seq models for lemmatizing dialectal Arabic, both without analyzers and with their integration. We assess how well these models generalize across dialects and benefit from related varieties. Focusing on Egyptian, Gulf, and Levantine dialects with varying resource levels, our analysis highlights both the potential and limitations of data-driven approaches. The proposed method achieves significant gains over baselines, performing well in both low-resource and dialect-rich scenarios.

2024

The ease of access to large language models (LLMs) has enabled a widespread of machine-generated texts, and now it is often hard to tell whether a piece of text was human-written or machine-generated. This raises concerns about potential misuse, particularly within educational and academic domains. Thus, it is important to develop practical systems that can automate the process. Here, we present one such system, LLM-DetectAIve, designed for fine-grained detection. Unlike most previous work on machine-generated text detection, which focused on binary classification, LLM-DetectAIve supports four categories: (i) human-written, (ii) machine-generated, (iii) machine-written, then machine-humanized, and (iv) human-written, then machine-polished. Category (iii) aims to detect attempts to obfuscate the fact that a text was machine-generated, while category (iv) looks for cases where the LLM was used to polish a human-written text, which is typically acceptable in academic writing, but not in education. Our experiments show that LLM-DetectAIve can effectively identify the above four categories, which makes it a potentially useful tool in education, academia, and other domains.LLM-DetectAIve is publicly accessible at https://github.com/mbzuai-nlp/LLM-DetectAIve. The video describing our system is available at https://youtu.be/E8eT_bE7k8c.

Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching
Kurt Micallef | Nizar Habash | Claudia Borg | Fadhl Eryani | Houda Bouamor
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Although multilingual language models exhibit impressive cross-lingual transfer capabilities on unseen languages, the performance on downstream tasks is impacted when there is a script disparity with the languages used in the multilingual model’s pre-training data. Using transliteration offers a straightforward yet effective means to align the script of a resource-rich language with a target language thereby enhancing cross-lingual transfer capabilities. However, for mixed languages, this approach is suboptimal, since only a subset of the language benefits from the cross-lingual transfer while the remainder is impeded. In this work, we focus on Maltese, a Semitic language, with substantial influences from Arabic, Italian, and English, and notably written in Latin script. We present a novel dataset annotated with word-level etymology. We use this dataset to train a classifier that enables us to make informed decisions regarding the appropriate processing of each token in the Maltese language. We contrast indiscriminate transliteration or translation to mixing processing pipelines that only transliterate words of Arabic origin, thereby resulting in text with a mixture of scripts. We fine-tune the processed data on four downstream tasks and show that conditional transliteration based on word etymology yields the best results, surpassing fine-tuning with raw Maltese or Maltese processed with non-selective pipelines.

Arabic Diacritics in the Wild: Exploiting Opportunities for Improved Diacritization
Salman Elgamal | Ossama Obeid | Mhd Kabbani | Go Inoue | Nizar Habash
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The widespread absence of diacritical marks in Arabic text poses a significant challenge for Arabic natural language processing (NLP). This paper explores instances of naturally occurring diacritics, referred to as “diacritics in the wild,” to unveil patterns and latent information across six diverse genres: news articles, novels, children’s books, poetry, political documents, and ChatGPT outputs. We present a new annotated dataset that maps real-world partially diacritized words to their maximal full diacritization in context. Additionally, we propose extensions to the analyze-and-disambiguate approach in Arabic NLP to leverage these diacritics, resulting in notable improvements. Our contributions encompass a thorough analysis, valuable datasets, and an extended diacritization algorithm. We release our code and datasets as open source.

EMAD: A Bridge Tagset for Unifying Arabic POS Annotations
Omar Kallas | Go Inoue | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

There have been many attempts to model the morphological richness and complexity of Arabic, leading to numerous Part-of-Speech (POS) tagsets that differ in terms of (a) which morphological features they represent, (b) how they represent them, and (c) the degree of specification of said features. Tagset granularity plays an important role in determining how annotated data can be used and for what applications. Due to the diversity among existing tagsets, many annotated corpora for Arabic cannot be easily combined, which exacerbates the Arabic resource poverty situation. In this work, we propose an intermediate tagset designed to facilitate the conversion and unification of different tagsets used to annotate Arabic corpora. This new tagset acts as a bridge between different annotation schemes, simplifying the integration of annotated corpora and promoting collaboration across the projects using them.

M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
Yuxia Wang | Jonibek Mansurov | Petar Ivanov | Jinyan Su | Artem Shelmanov | Akim Tsvigun | Osama Mohammed Afzal | Tarek Mahmoud | Giovanni Puccetti | Thomas Arnold | Alham Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain and multi-generator corpus of MGTs — M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.

NADI 2024: The Fifth Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Amr Keleg | AbdelRahim Elmadany | Chiyu Zhang | Injy Hamed | Walid Magdy | Houda Bouamor | Nizar Habash
Proceedings of the Second Arabic Natural Language Processing Conference

We describe the findings of the fifth Nuanced Arabic Dialect Identification Shared Task (NADI 2024). NADI’s objective is to help advance SoTA Arabic NLP by providing guidance, datasets, modeling opportunities, and standardized evaluation conditions that allow researchers to collaboratively compete on prespecified tasks. NADI 2024 targeted both dialect identification cast as a multi-label task (Subtask 1), identification of the Arabic level of dialectness (Subtask 2), and dialect-to-MSA machine translation (Subtask 3). A total of 51 unique teams registered for the shared task, of whom 12 teams have participated (with 76 valid submissions during the test phase). Among these, three teams participated in Subtask 1, three in Subtask 2, and eight in Subtask 3. The winning teams achieved 50.57 F1 on Subtask 1, 0.1403 RMSE for Subtask 2, and 20.44 BLEU in Subtask 3, respectively. Results show that Arabic dialect processing tasks such as dialect identification and machine translation remain challenging. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

Palmyra 3.0: A User-Friendly Cloud-Based Platform for Morphology and Dependency Syntax Annotation
Muhammed AbuOdeh | Long Phan | Ahmed Farouk Zakaria Elshabrawy | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present Palmyra 3.0, a cloud-based, configurable, and user-friendly platform for morphology and syntax annotation through dependency-tree visualization. Palmyra 3.0 implements a robust system that stores data on the cloud. By default, Palmyra 3.0 comes with an Arabic dependency parser that generates highly accurate trees, but it is easily configurable to support dependency parsers in other languages. Palmyra 3.0 provides default configuration files for a number of predefined formalisms, such as UD and CATiB, and a number of user-friendly features to support annotators.

Exploiting Dialect Identification in Automatic Dialectal Text Normalization
Bashar Alhafni | Sarah Al-Towaity | Ziyad Fawzy | Fatema Nassar | Fadhl Eryani | Houda Bouamor | Nizar Habash
Proceedings of the Second Arabic Natural Language Processing Conference

Dialectal Arabic is the primary spoken language used by native Arabic speakers in daily communication. The rise of social media platforms has notably expanded its use as a written language. However, Arabic dialects do not have standard orthographies. This, combined with the inherent noise in user-generated content on social media, presents a major challenge to NLP applications dealing with Dialectal Arabic. In this paper, we explore and report on the task of CODAfication, which aims to normalize Dialectal Arabic into the Conventional Orthography for Dialectal Arabic (CODA). We work with a unique parallel corpus of multiple Arabic dialects focusing on five major city dialects. We benchmark newly developed pretrained sequence-to-sequence models on the task of CODAfication. We further show that using dialect identification information improves the performance across all dialects. We make our code, data, andpretrained models publicly available.

HelloThere: A Corpus of Annotated Dialogues and Knowledge Bases of Time-Offset Avatars
Alberto Chierici | Nizar Habash
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

A Time-Offset Interaction Application (TOIA) is a software system that allows people to engage in face-to-face dialogue with previously recorded videos of other people. There are two TOIA usage modes: (a) creation mode, where users pre-record video snippets of themselves representing their answers to possible questions someone may ask them, and (b) interaction mode, where other users of the system can choose to interact with created avatars. This paper presents the HelloThere corpus that has been collected from two user studies involving several people who recorded avatars and many more who engaged in dialogues with them. The interactions with avatars are annotated by people asking them questions through three modes (card selection, text search, and voice input) and rating the appropriateness of their answers on a 1 to 5 scale. The corpus, made available to the research community, comprises 26 avatars’ knowledge bases and 317 dialogues between 64 interrogators and the avatars in text format.

Computational Morphology and Lexicography Modeling of Modern Standard Arabic Nominals
Christian Khairallah | Reham Marzouk | Salam Khalifa | Mayar Nassar | Nizar Habash
Findings of the Association for Computational Linguistics: EACL 2024

Modern Standard Arabic (MSA) nominals present many morphological and lexical modeling challenges that have not been consistently addressed previously. This paper attempts to define the space of such challenges, and leverage a recently proposed morphological framework to build a comprehensive and extensible model for MSA nominals. Our model design addresses the nominals’ intricate morphotactics, as well as their paradigmatic irregularities. Our implementation showcases enhanced accuracy and consistency compared to a commonly used MSA morphological analyzer and generator. We make our models publicly available.

Strategies for Arabic Readability Modeling
Juan Liberato | Bashar Alhafni | Muhamed Khalil | Nizar Habash
Proceedings of the Second Arabic Natural Language Processing Conference

Automatic readability assessment is relevant to building NLP applications for education, content analysis, and accessibility. However, Arabic readability assessment is a challenging task due to Arabic’s morphological richness and limited readability resources. In this paper, we present a set of experimental results on Arabic readability assessment using a diverse range of approaches, from rule-based methods to Arabic pretrained language models. We report our results on a newly created corpus at different textual granularity levels (words and sentence fragments). Our results show that combining different techniques yields the best results, achieving an overall macro F1 score of 86.7 at the word level and 87.9 at the fragment level on a blind test set. We make our code, data, and pretrained models publicly available.

The SAMER Arabic Text Simplification Corpus
Bashar Alhafni | Reem Hazim | Juan David Pineros Liberato | Muhamed Al Khalil | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the SAMER Corpus, the first manually annotated Arabic parallel corpus for text simplification targeting school-aged learners. Our corpus comprises texts of 159K words selected from 15 publicly available Arabic fiction novels most of which were published between 1865 and 1955. Our corpus includes readability level annotations at both the document and word levels, as well as two simplified parallel versions for each text targeting learners at two different readability levels. We describe the corpus selection process, and outline the guidelines we followed to create the annotations and ensure their quality. Our corpus is publicly available to support and encourage research on Arabic text simplification, Arabic automatic readability assessment, and the development of Arabic pedagogical language technologies.

Camel Morph MSA: A Large-Scale Open-Source Morphological Analyzer for Modern Standard Arabic
Christian Khairallah | Salam Khalifa | Reham Marzouk | Mayar Nassar | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present Camel Morph MSA, the largest open-source Modern Standard Arabic morphological analyzer and generator. Camel Morph MSA has over 100K lemmas, and includes rarely modeled morphological features of Modern Standard Arabic with Classical Arabic origins. Camel Morph MSA can produce ∼1.45B analyses and ∼535M unique diacritizations, almost an order of magnitude larger than SAMA (Maamouri et al., 2010c), in addition to having ∼36% less OOV rate than SAMA on a 10B word corpus. Furthermore, Camel Morph MSA fills the gaps of many lemma paradigms by modeling linguistic phenomena consistently. Camel Morph MSA seamlessly integrates with the Camel Tools Python toolkit (Obeid et al., 2020), ensuring ease of use and accessibility.

ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
Fajri Koto | Haonan Li | Sara Shatnawi | Jad Doughman | Abdelrahman Sadallah | Aisha Alraeesi | Khalid Almubarak | Zaid Alyafeai | Neha Sengupta | Shady Shehata | Nizar Habash | Preslav Nakov | Timothy Baldwin
Findings of the Association for Computational Linguistics: ACL 2024

The focus of language model evaluation has transitioned towards reasoning and knowledge-intensive tasks, driven by advancements in pretraining large models. While state-of-the-art models are partially trained on large Arabic texts, evaluating their performance in Arabic remains challenging due to the limited availability of relevant datasets. To bridge this gap, we present ArabicMMLU, the first multi-task language understanding benchmark for the Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions. Our data comprises 40 tasks and 14,575 multiple-choice questions in Modern Standard Arabic (MSA) and is carefully constructed by collaborating with native speakers in the region. Our comprehensive evaluations of 35 models reveal substantial room for improvement, particularly among the best open-source models. Notably, BLOOMZ, mT0, LLama2, and Falcon struggle to achieve a score of 50%, while even the top-performing Arabic-centric model only achieves a score of 62.3%.

ZAEBUC-Spoken: A Multilingual Multidialectal Arabic-English Speech Corpus
Injy Hamed | Fadhl Eryani | David Palfreyman | Nizar Habash
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present ZAEBUC-Spoken, a multilingual multidialectal Arabic-English speech corpus. The corpus comprises twelve hours of Zoom meetings involving multiple speakers role-playing a work situation where Students brainstorm ideas for a certain topic and then discuss it with an Interlocutor. The meetings cover different topics and are divided into phases with different language setups. The corpus presents a challenging set for automatic speech recognition (ASR), including two languages (Arabic and English) with Arabic spoken in multiple variants (Modern Standard Arabic, Gulf Arabic, and Egyptian Arabic) and English used with various accents. Adding to the complexity of the corpus, there is also code-switching between these languages and dialects. As part of our work, we take inspiration from established sets of transcription guidelines to present a set of guidelines handling issues of conversational speech, code-switching and orthography of both languages. We further enrich the corpus with two layers of annotations; (1) dialectness level annotation for the portion of the corpus where mixing occurs between different variants of Arabic, and (2) automatic morphological annotations, including tokenization, lemmatization, and part-of-speech tagging.

The FIGNEWS Shared Task on News Media Narratives
Wajdi Zaghouani | Mustafa Jarrar | Nizar Habash | Houda Bouamor | Imed Zitouni | Mona Diab | Samhaa El-Beltagy | Muhammed AbuOdeh
Proceedings of the Second Arabic Natural Language Processing Conference

We present an overview of the FIGNEWSshared task, organized as part of the Arabic-NLP 2024 conference co-located with ACL2024. The shared task addresses bias and pro-paganda annotation in multilingual news posts.We focus on the early days of the Israel War onGaza as a case study. The task aims to fostercollaboration in developing annotation guide-lines for subjective tasks by creating frame-works for analyzing diverse narratives high-lighting potential bias and propaganda. In aspirit of fostering and encouraging diversity,we address the problem from a multilingualperspective, namely within five languages: En-glish, French, Arabic, Hebrew, and Hindi. Atotal of 17 teams participated in two annota-tion subtasks: bias (16 teams) and propaganda(6 teams). The teams competed in four evalua-tion tracks: guidelines development, annotationquality, annotation quantity, and consistency.Collectively, the teams produced 129,800 datapoints. Key findings and implications for thefield are discussed.

Large language models (LLMs) have demonstrated remarkable capability to generate fluent responses to a wide variety of user queries. However, this has also raised concerns about the potential misuse of such texts in journalism, education, and academia. In this study, we strive to create automated systems that can detect machine-generated texts and pinpoint potential misuse. We first introduce a large-scale benchmark M4, which is a multi-generator, multi-domain, and multi-lingual corpus for machine-generated text detection. Through an extensive empirical study of this dataset, we show that it is challenging for detectors to generalize well on instances from unseen domains or LLMs. In such cases, detectors tend to misclassify machine-generated text as human-written. These results show that the problem is far from solved and that there is a lot of room for improvement. We believe that our dataset will enable future research towards more robust approaches to this pressing societal problem. The dataset is available at https://github.com/mbzuai-nlp/M4

Investigating Gender Bias in STEM Job Advertisements
Malika Dikshit | Houda Bouamor | Nizar Habash
Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

Gender inequality has been historically prevalent in academia, especially within the fields of Science, Technology, Engineering, and Mathematics (STEM). In this study, we propose to examine gender bias in academic job descriptions in the STEM fields. We go a step further than previous studies that merely identify individual words as masculine-coded and feminine-coded and delve into the contextual language used in academic job advertisements. We design a novel approach to detect gender biases in job descriptions using Natural Language Processing techniques. Going beyond binary masculine-feminine stereotypes, we propose three big group types to understand gender bias in the language of job descriptions, namely agentic, balanced, and communal. We cluster similar information in job descriptions into these three groups using contrastive learning and various clustering techniques. This research contributes to the field of gender bias detection by providing a novel approach and methodology for categorizing gender bias in job descriptions, which can aid more effective and targeted job advertisements that will be equally appealing across all genders.

2023

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation
Injy Hamed | Nizar Habash | Slim Abdennadher | Ngoc Thang Vu
Proceedings of the Sixth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2023)

Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.

Proceedings of ArabicNLP 2023
Hassan Sawaf | Samhaa El-Beltagy | Wajdi Zaghouani | Walid Magdy | Ahmed Abdelali | Nadi Tomeh | Ibrahim Abu Farha | Nizar Habash | Salam Khalifa | Amr Keleg | Hatem Haddad | Imed Zitouni | Khalil Mrini | Rawan Almatham
Proceedings of ArabicNLP 2023

Benchmarking Dialectal Arabic-Turkish Machine Translation
Hasan Alkheder | Houda Bouamor | Nizar Habash | Ahmet Zengin
Proceedings of Machine Translation Summit XIX, Vol. 1: Research Track

Due to the significant influx of Syrian refugees in Turkey in recent years, the Syrian Arabic dialect has become increasingly prevalent in certain regions of Turkey. Developing a machine translation system between Turkish and Syrian Arabic would be crucial in facilitating communication between the Turkish and Syrian communities in these regions, which can have a positive impact on various domains such as politics, trade, and humanitarian aid. Such a system would also contribute positively to the growing Arab-focused tourism industry in Turkey. In this paper, we present the first research effort exploring translation between Syrian Arabic and Turkish. We use a set of 2,000 parallel sentences from the MADAR corpus containing 25 different city dialects from different cities across the Arab world, in addition to Modern Standard Arabic (MSA), English, and French. Additionally, we explore the translation performance into Turkish from other Arabic dialects and compare the results to the performance achieved when translating from Syrian Arabic. We build our MADAR-Turk data set by manually translating the set of 2,000 sentences from the Damascus dialect of Syria to Turkish with the help of two native Arabic speakers from Syria who are also highly fluent in Turkish. We evaluate the quality of the translations and report the results achieved. We make this first-of-a-kind data set publicly available to support research in machine translation between these important but less studied language pairs.

Data Augmentation Techniques for Machine Translation of Code-Switched Texts: A Comparative Study
Injy Hamed | Nizar Habash | Thang Vu
Findings of the Association for Computational Linguistics: EMNLP 2023

Code-switching (CSW) text generation has been receiving increasing attention as a solution to address data scarcity. In light of this growing interest, we need more comprehensive studies comparing different augmentation approaches. In this work, we compare three popular approaches: lexical replacements, linguistic theories, and back-translation (BT), in the context of Egyptian Arabic-English CSW. We assess the effectiveness of the approaches on machine translation and the quality of augmentations through human evaluation. We show that BT and CSW predictive-based lexical replacement, being trained on CSW parallel data, perform best on both tasks. Linguistic theories and random lexical replacement prove to be effective in the lack of CSW parallel data, where both approaches achieve similar results.

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text
Marwa Gaser | Manuel Mager | Injy Hamed | Nizar Habash | Slim Abdennadher | Ngoc Thang Vu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

CamelParser2.0: A State-of-the-Art Dependency Parser for Arabic
Ahmed Elshabrawy | Muhammed AbuOdeh | Go Inoue | Nizar Habash
Proceedings of ArabicNLP 2023

We present CamelParser2.0, an open-source Python-based Arabic dependency parser targeting two popular Arabic dependency formalisms, the Columbia Arabic Treebank (CATiB), and Universal Dependencies (UD). The CamelParser2.0 pipeline handles the processing of raw text and produces tokenization, part-of-speech and rich morphological features. As part of developing CamelParser2.0, we explore many system design hyper-parameters, such as parsing model architecture and pretrained language model selection, achieving new state-of-the-art performance across diverse Arabic genres under gold and predicted tokenization settings.

Exploring the Impact of Transliteration on NLP Performance: Treating Maltese as an Arabic Dialect
Kurt Micallef | Fadhl Eryani | Nizar Habash | Houda Bouamor | Claudia Borg
Proceedings of the Workshop on Computation and Written Language (CAWL 2023)

Multilingual models such as mBERT have been demonstrated to exhibit impressive crosslingual transfer for a number of languages. Despite this, the performance drops for lowerresourced languages, especially when they are not part of the pre-training setup and when there are script differences. In this work we consider Maltese, a low-resource language of Arabic and Romance origins written in Latin script. Specifically, we investigate the impact of transliterating Maltese into Arabic scipt on a number of downstream tasks: Part-of-Speech Tagging, Dependency Parsing, and Sentiment Analysis. We compare multiple transliteration pipelines ranging from deterministic character maps to more sophisticated alternatives, including manually annotated word mappings and non-deterministic character mappings. For the latter, we show that selection techniques using n-gram language models of Tunisian Arabic, the dialect with the highest degree of mutual intelligibility to Maltese, yield better results on downstream tasks. Moreover, our experiments highlight that the use of an Arabic pre-trained model paired with transliteration outperforms mBERT. Overall, our results show that transliterating Maltese can be considered an option to improve the cross-lingual transfer capabilities.

The User-Aware Arabic Gender Rewriter
Bashar Alhafni | Ossama Obeid | Nizar Habash
Proceedings of the First Workshop on Gender-Inclusive Translation Technologies

We introduce the User-Aware Arabic Gender Rewriter, a user-centric web-based system for Arabic gender rewriting in contexts involving two users. The system takes either Arabic or English sentences as input, and provides users with the ability to specify their desired first and/or second person target genders. The system outputs gender rewritten alternatives of the Arabic sentences (provided directly or as translation outputs) to match the target users’ gender preferences.

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | AbdelRahim Elmadany | Chiyu Zhang | El Moatez Billah Nagoudi | Houda Bouamor | Nizar Habash
Proceedings of ArabicNLP 2023

We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comparisons between different approaches. NADI 2023 targeted both dialect identification (Subtask1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams have participated (with 76 valid submissions during test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27 F1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation
Bashar Alhafni | Go Inoue | Christian Khairallah | Nizar Habash
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Grammatical error correction (GEC) is a well-explored problem in English with many existing models and datasets. However, research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. In this paper, we present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models. We also define the task of multi-class Arabic grammatical error detection (GED) and present the first results on multi-class Arabic GED. We show that using GED information as auxiliary input in GEC models improves GEC performance across three datasets spanning different genres. Moreover, we also investigate the use of contextual morphological preprocessing in aiding GEC systems. Our models achieve SOTA results on two Arabic GEC shared task datasets and establish a strong benchmark on a recently created dataset. We make our code, data, and pretrained models publicly available.

2022

Arabic Natural Language Processing
Nizar Habash
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The Arabic language continues to be the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). This tutorial provides NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic in its various forms: Classical, Modern Standard and Dialectal. We discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing from enabling technologies and resources, to common tasks and applications. The tutorial will explain important concepts, common wisdom, and common pitfalls in Arabic processing. Given the wide range of possible issues, we invite tutorial attendees to bring up interesting challenges and problems they are working on to discuss during the tutorial.

ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus
Nizar Habash | David Palfreyman
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates. We describe and discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on Arabic and English texts using consistent guidelines as much as possible, with tracked alignments among the different annotations, and to the original raw texts. For morphological tokenization, POS tagging, and lemmatization, we use existing automatic annotation tools followed by manual correction. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The publicly available ZAEBUC corpus and its annotations are intended to be the stepping stones for additional annotations.

The Shared Task on Gender Rewriting
Bashar Alhafni | Nizar Habash | Houda Bouamor | Ossama Obeid | Sultan Alrowili | Daliyah Alzeer | Khawlah M. Alshanqiti | Ahmed ElBakry | Muhammad ElNokrashy | Mohamed Gabr | Abderrahmane Issam | Abdelrahim Qaddoumi | K. Vijay-Shanker | Mahmoud Zyate
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., a female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.

NADI 2022: The Third Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Chiyu Zhang | AbdelRahim Elmadany | Houda Bouamor | Nizar Habash
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

We describe the findings of the third Nuanced Arabic Dialect Identification Shared Task (NADI 2022). NADI aims at advancing state-of-the-art Arabic NLP, including Arabic dialects. It does so by affording diverse datasets and modeling opportunities in a standardized context where meaningful comparisons between models and approaches are possible. NADI 2022 targeted both dialect identification (Subtask 1) and dialectal sentiment analysis (Subtask 2) at the country level. A total of 41 unique teams registered for the shared task, of whom 21 teams have participated (with 105 valid submissions). Among these, 19 teams participated in Subtask 1, and 10 participated in Subtask 2. The winning team achieved F1=27.06 on Subtask 1 and F1=75.16 on Subtask 2, reflecting that both subtasks remain challenging and motivating future work in this area. We describe the methods employed by the participating teams and offer an outlook for NADI.

AraBART: a Pretrained Arabic Sequence-to-Sequence Model for Abstractive Summarization
Moussa Kamal Eddine | Nadi Tomeh | Nizar Habash | Joseph Le Roux | Michalis Vazirgiannis
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

Like most natural language understanding and generation tasks, state-of-the-art models for summarization are transformer-based sequence-to-sequence architectures that are pretrained on large corpora. While most existing models focus on English, Arabic remains understudied. In this paper we propose AraBART, the first Arabic model in which the encoder and the decoder are pretrained end-to-end, based on BART. We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines including a pretrained Arabic BERT-based model, multilingual BART, Arabic T5, and a multilingual T5 model. AraBART is publicly available.

UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

Arabic Word-level Readability Visualization for Assisted Text Simplification
Reem Hazim | Hind Saddiki | Bashar Alhafni | Muhamed Al Khalil | Nizar Habash
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization. The add-on includes a lemmatization component that is connected to a five-level readability lexicon and Arabic WordNet-based substitution suggestions. The add-on can be used for assessing the reading difficulty of a text and identifying difficult words as part of the task of manual text simplification. We make our add-on and its code publicly available.

The Arabic Parallel Gender Corpus 2.0: Extensions and Analyses
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)’s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available

Maknuune: A Large Open Palestinian Arabic Lexicon
Shahd Salah Uddin Dibas | Christian Khairallah | Nizar Habash | Omar Fayez Sadi | Tariq Sairafy | Karmel Sarabta | Abrar Ardah
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

We present Maknuune, a large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information such as broken plurals and templatic feminine forms, associated phrases and collocations, Standard Arabic glosses, and examples or notes on grammar, usage, or location of collected entry

Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects
Go Inoue | Salam Khalifa | Nizar Habash
Findings of the Association for Computational Linguistics: ACL 2022

We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.

AraSAS: The Open Source Arabic Semantic Tagger
Mahmoud El-Haj | Elvis de Souza | Nouran Khallaf | Paul Rayson | Nizar Habash
Proceedinsg of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

This paper presents (AraSAS) the first open-source Arabic semantic analysis tagging system. AraSAS is a software framework that provides full semantic tagging of text written in Arabic. AraSAS is based on the UCREL Semantic Analysis System (USAS) which was first developed to semantically tag English text. Similarly to USAS, AraSAS uses a hierarchical semantic tag set that contains 21 major discourse fields and 232 fine-grained semantic field tags. The paper describes the creation, validation and evaluation of AraSAS. In addition, we demonstrate a first case study to illustrate the affordances of applying USAS and AraSAS semantic taggers on the Zayed University Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) (Palfreyman and Habash, 2022), where we show and compare the coverage of the two semantic taggers through running them on Arabic and English essays on different topics. The analysis expands to compare the taggers when run on texts in Arabic and English written by the same writer and texts written by male and by female students. Variables for comparison include frequency of use of particular semantic sub-domains, as well as the diversity of semantic elements within a text.

Hierarchical Aggregation of Dialectal Data for Arabic Dialect Identification
Nurpeiis Baimukan | Houda Bouamor | Nizar Habash
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Arabic is a collection of dialectal variants that are historically related but significantly different. These differences can be seen across regions, countries, and even cities in the same countries. Previous work on Arabic Dialect identification has focused mainly on specific dialect levels (region, country, province, or city) using level-specific resources; and different efforts used different schemas and labels. In this paper, we present the first effort aiming at defining a standard unified three-level hierarchical schema (region-country-city) for dialectal Arabic classification. We map 29 different data sets to this unified schema, and use the common mapping to facilitate aggregating these data sets. We test the value of such aggregation by building language models and using them in dialect identification. We make our label mapping code and aggregated language models publicly available.

User-Centric Gender Rewriting
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

In this paper, we define the task of gender rewriting in contexts involving two users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. We develop a multi-step system that combines the positive aspects of both rule-based and neural rewriting models. Our results successfully demonstrate the viability of this approach on a recently created corpus for Arabic gender rewriting, achieving 88.42 M2 F0.5 on a blind test set. Our proposed system improves over previous work on the first-person-only version of this task, by 3.05 absolute increase in M2 F0.5. We demonstrate a use case of our gender rewriting system by using it to post-edit the output of a commercial MT system to provide personalized outputs based on the users’ grammatical gender preferences. We make our code, data, and pretrained models publicly available.

Camel Treebank: An Open Multi-genre Arabic Dependency Treebank
Nizar Habash | Muhammed AbuOdeh | Dima Taji | Reem Faraj | Jamila El Gizuli | Omar Kallas
Proceedings of the Thirteenth Language Resources and Evaluation Conference

We present the Camel Treebank (CAMELTB), a 188K word open-source dependency treebank of Modern Standard and Classical Arabic. CAMELTB 1.0 includes 13 sub-corpora comprising selections of texts from pre-Islamic poetry to social media online commentaries, and covering a range of genres from religious and philosophical texts to news, novels, and student essays. The texts are all publicly available (out of copyright, creative commons, or under open licenses). The texts were morphologically tokenized and syntactically parsed automatically, and then manually corrected by a team of trained annotators. The annotations follow the guidelines of the Columbia Arabic Treebank (CATiB) dependency representation. We discuss our annotation process and guideline extensions, and we present some initial observations on lexical and syntactic differences among the annotated sub-corpora. This corpus will be publicly available to support and encourage research on Arabic NLP in general and on new, previously unexplored genres that are of interest to a wider spectrum of researchers, from historical linguistics and digital humanities to computer-assisted language pedagogy.

The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic
Dana Abdulrahim | Go Inoue | Latifa Shamsan | Salam Khalifa | Nizar Habash
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In recent years, the focus on developing natural language processing (NLP) tools for Arabic has shifted from Modern Standard Arabic to various Arabic dialects. Various corpora of various sizes and representing different genres, have been created for a number of Arabic dialects. As far as Gulf Arabic is concerned, Gumar Corpus (Khalifa et al., 2016) is the largest corpus, to date, that includes data representing the dialectal Arabic of the six Gulf Cooperation Council countries (Bahrain, Kuwait, Saudi Arabia, Qatar, United Arab Emirates, and Oman), particularly in the genre of “online forum novels”. In this paper, we present the Bahrain Corpus. Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.). The corpus comprises 620K words, carefully curated. We provide automatic morphological annotations of the full corpus using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We plan to make the annotated sample as well as the full corpus publicly available to support researchers interested in Arabic NLP.

ArzEn-ST: A Three-way Speech Translation Corpus for Code-Switched Egyptian Arabic-English
Injy Hamed | Nizar Habash | Slim Abdennadher | Ngoc Thang Vu
Proceedings of the Seventh Arabic Natural Language Processing Workshop (WANLP)

We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic-English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.

Camelira: An Arabic Multi-Dialect Morphological Disambiguator
Ossama Obeid | Go Inoue | Nizar Habash
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine.Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas. Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at http://camelira.camel-lab.com.

Morphotactic Modeling in an Open-source Multi-dialectal Arabic Morphological Analyzer and Generator
Nizar Habash | Reham Marzouk | Christian Khairallah | Salam Khalifa
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Arabic is a morphologically rich and complex language, with numerous dialectal variants. Previous efforts on Arabic morphology modeling focused on specific variants and specific domains using a range of techniques with different degrees of linguistic modeling transparency. In this paper we propose a new approach to modeling Arabic morphology with an eye towards multi-dialectness, resource openness, and easy extensibility and use. We demonstrate our approach by modeling verbs from Standard Arabic and Egyptian Arabic, within a common framework, and with high coverage.

2021

Proceedings of the Sixth Arabic Natural Language Processing Workshop
Nizar Habash | Houda Bouamor | Hazem Hajj | Walid Magdy | Wajdi Zaghouani | Fethi Bougares | Nadi Tomeh | Ibrahim Abu Farha | Samia Touileb
Proceedings of the Sixth Arabic Natural Language Processing Workshop

A Cloud-based User-Centered Time-Offset Interaction Application
Alberto Chierici | Tyeece Kiana Fredorcia Hensley | Wahib Kamran | Kertu Koss | Armaan Agrawal | Erin Meekhof | Goffredo Puccetti | Nizar Habash
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue

Time-offset interaction applications (TOIA) allow simulating conversations with people who have previously recorded relevant video utterances, which are played in response to their interacting user. TOIAs have great potential for preserving cross-generational and cross-cultural histories, online teaching, simulated interviews, etc. Current TOIAs exist in niche contexts involving high production costs. Democratizing TOIA presents different challenges when creating appropriate pre-recordings, designing different user stories, and creating simple online interfaces for experimentation. We open-source TOIA 2.0, a user-centered time-offset interaction application, and make it available for everyone who wants to interact with people’s pre-recordings, or create their pre-recordings.

A View From the Crowd: Evaluation Challenges for Time-Offset Interaction Applications
Alberto Chierici | Nizar Habash
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Dialogue systems like chatbots, and tasks like question-answering (QA) have gained traction in recent years; yet evaluating such systems remains difficult. Reasons include the great variety in contexts and use cases for these systems as well as the high cost of human evaluation. In this paper, we focus on a specific type of dialogue systems: Time-Offset Interaction Applications (TOIAs) are intelligent, conversational software that simulates face-to-face conversations between humans and pre-recorded human avatars. Under the constraint that a TOIA is a single output system interacting with users with different expectations, we identify two challenges: first, how do we define a ‘good’ answer? and second, what’s an appropriate metric to use? We explore both challenges through the creation of a novel dataset that identifies multiple good answers to specific TOIA questions through the help of Amazon Mechanical Turk workers. This ‘view from the crowd’ allows us to study the variations of how TOIA interrogators perceive its answers. Our contributions include the annotated dataset that we make publicly available and the proposal of Success Rate @k as an evaluation metric that is more appropriate than the traditional QA’s and information retrieval’s metrics.

NADI 2021: The Second Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Chiyu Zhang | AbdelRahim Elmadany | Houda Bouamor | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

We present the findings and results of theSecond Nuanced Arabic Dialect IdentificationShared Task (NADI 2021). This Shared Taskincludes four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask1.1), country-level dialect identification (Subtask1.2), province-level MSA identification (Subtask2.1), and province-level sub-dialect identifica-tion (Subtask 2.2). The shared task dataset cov-ers a total of 100 provinces from 21 Arab coun-tries, collected from the Twitter domain. A totalof 53 teams from 23 countries registered to par-ticipate in the tasks, thus reflecting the interestof the community in this area. We received 16submissions for Subtask 1.1 from five teams, 27submissions for Subtask 1.2 from eight teams,12 submissions for Subtask 2.1 from four teams,and 13 Submissions for subtask 2.2 from fourteams.

Automatic Romanization of Arabic Bibliographic Records
Fadhl Eryani | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

International library standards require cataloguers to tediously input Romanization of their catalogue records for the benefit of library users without specific language expertise. In this paper, we present the first reported results on the task of automatic Romanization of undiacritized Arabic bibliographic entries. This complex task requires the modeling of Arabic phonology, morphology, and even semantics. We collected a 2.5M word corpus of parallel Arabic and Romanized bibliographic entries, and benchmarked a number of models that vary in terms of complexity and resource dependence. Our best system reaches 89.3% exact word Romanization on a blind test set. We make our data and code publicly available.

Automatic Error Type Annotation for Arabic
Riadh Belkebir | Nizar Habash
Proceedings of the 25th Conference on Computational Natural Language Learning

We present ARETA, an automatic error type annotation system for Modern Standard Arabic. We design ARETA to address Arabic’s morphological richness and orthographic ambiguity. We base our error taxonomy on the Arabic Learner Corpus (ALC) Error Tagset with some modifications. ARETA achieves a performance of 85.8% (micro average F1 score) on a manually annotated blind test portion of ALC. We also demonstrate ARETA’s usability by applying it to a number of submissions from the QALB 2014 shared task for Arabic grammatical error correction. The resulting analyses give helpful insights on the strengths and weaknesses of different submissions, which is more useful than the opaque M2 scoring metrics used in the shared task. ARETA employs a large Arabic morphological analyzer, but is completely unsupervised otherwise. We make ARETA publicly available.

The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models
Go Inoue | Bashar Alhafni | Nurpeiis Baimukan | Houda Bouamor | Nizar Habash
Proceedings of the Sixth Arabic Natural Language Processing Workshop

In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.

SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
Tiago Pimentel | Maria Ryskina | Sabrina J. Mielke | Shijie Wu | Eleanor Chodroff | Brian Leonard | Garrett Nicolai | Yustinus Ghanggo Ate | Salam Khalifa | Nizar Habash | Charbel El-Khaissi | Omer Goldman | Michael Gasser | William Lane | Matt Coler | Arturo Oncevay | Jaime Rafael Montoya Samame | Gema Celeste Silva Villegas | Adam Ek | Jean-Philippe Bernardy | Andrey Shcherbakov | Aziyana Bayyr-ool | Karina Sheifer | Sofya Ganieva | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Andrew Krizhanovsky | Natalia Krizhanovsky | Clara Vania | Sardana Ivanova | Aelita Salchak | Christopher Straughn | Zoey Liu | Jonathan North Washington | Duygu Ataman | Witold Kieraś | Marcin Woliński | Totok Suhardijanto | Niklas Stoehr | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Richard J. Hatcher | Emily Prud’hommeaux | Ritesh Kumar | Mans Hulden | Botond Barta | Dorina Lakatos | Gábor Szolnok | Judit Ács | Mohit Raj | David Yarowsky | Ryan Cotterell | Ben Ambridge | Ekaterina Vylomova
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

2020

CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing
Ossama Obeid | Nasser Zalmout | Salam Khalifa | Dima Taji | Mai Oudah | Bashar Alhafni | Go Inoue | Fadhl Eryani | Alexander Erdmann | Nizar Habash
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.

Joint Diacritization, Lemmatization, Normalization, and Fine-Grained Morphological Tagging
Nasser Zalmout | Nizar Habash
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The written forms of Semitic languages are both highly ambiguous and morphologically rich: a word can have multiple interpretations and is one of many inflected forms of the same concept or lemma. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the lexicalized and non-lexicalized features can identify more intricate morphological patterns, which provide better context modeling, and further disambiguate ambiguous lexical choices. However, the different modeling granularity can make joint modeling more difficult. Our approach models the different features jointly, whether lexicalized (on the character-level), or non-lexicalized (on the word-level). We use Arabic as a test case, and achieve state-of-the-art results for Modern Standard Arabic with 20% relative error reduction, and Egyptian Arabic with 11% relative error reduction.

The Margarita Dialogue Corpus: A Data Set for Time-Offset Interactions and Unstructured Dialogue Systems
Alberto Chierici | Nizar Habash | Margarita Bicec
Proceedings of the Twelfth Language Resources and Evaluation Conference

Time-Offset Interaction Applications (TOIAs) are systems that simulate face-to-face conversations between humans and digital human avatars recorded in the past. Developing a well-functioning TOIA involves several research areas: artificial intelligence, human-computer interaction, natural language processing, question answering, and dialogue systems. The first challenges are to define a sensible methodology for data collection and to create useful data sets for training the system to retrieve the best answer to a user’s question. In this paper, we present three main contributions: a methodology for creating the knowledge base for a TOIA, a dialogue corpus, and baselines for single-turn answer retrieval. We develop the methodology using a two-step strategy. First, we let the avatar maker list pairs by intuition, guessing what possible questions a user may ask to the avatar. Second, we record actual dialogues between random individuals and the avatar-maker. We make the Margarita Dialogue Corpus available to the research community. This corpus comprises the knowledge base in text format, the video clips for each answer, and the annotated dialogues.

An Online Readability Leveled Arabic Thesaurus
Zhengyang Jiang | Nizar Habash | Muhamed Al Khalil
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

This demo paper introduces the online Readability Leveled Arabic Thesaurus interface. For a given user input word, this interface provides the word’s possible lemmas, roots, English glosses, related Arabic words and phrases, and readability on a five-level readability scale. This interface builds on and connects multiple existing Arabic resources and processing tools. This one-of-a-kind system enables Arabic speakers and learners to benefit from advances in Arabic computational linguistics technologies. Feedback from users of the system will help the developers to identify lexical coverage gaps and errors. A live link to the demo is available at: http://samer.camel-lab.com/.

A Unified Model for Arabizi Detection and Transliteration using Sequence-to-Sequence Models
Ali Shazal | Aiza Usman | Nizar Habash
Proceedings of the Fifth Arabic Natural Language Processing Workshop

While online Arabic is primarily written using the Arabic script, a Roman-script variety called Arabizi is often seen on social media. Although this representation captures the phonology of the language, it is not a one-to-one mapping with the Arabic script version. This issue is exacerbated by the fact that Arabizi on social media is Dialectal Arabic which does not have a standard orthography. Furthermore, Arabizi tends to include a lot of code mixing between Arabic and English (or French). To map Arabizi text to Arabic script in the context of complete utterances, previously published efforts have split Arabizi detection and Arabic script target in two separate tasks. In this paper, we present the first effort on a unified model for Arabizi detection and transliteration into a code-mixed output with consistent Arabic spelling conventions, using a sequence-to-sequence deep learning model. Our best system achieves 80.6% word accuracy and 58.7% BLEU on a blind test set.

Utilizing Subword Entities in Character-Level Sequence-to-Sequence Lemmatization Models
Nasser Zalmout | Nizar Habash
Proceedings of the 28th International Conference on Computational Linguistics

In this paper we present a character-level sequence-to-sequence lemmatization model, utilizing several subword features in multiple configurations. In addition to generic n-gram embeddings (using FastText), we experiment with concatenative (stems) and templatic (roots and patterns) morphological subwords. We present several architectures that embed these features directly at the encoder side, or learn them jointly at the decoder side with a multitask learning architecture. The results indicate that using the generic n-gram embeddings (through FastText) outperform the other linguistically-driven subwords. We use Modern Standard Arabic and Egyptian Arabic as test cases, with up to 22% and 13% relative error reduction, respectively, from a strong baseline. An error analysis shows that our best system is even able to handle word/lemma pairs that are both unseen in the training data.

Gender-Aware Reinflection using Linguistically Enhanced Neural Models
Bashar Alhafni | Nizar Habash | Houda Bouamor
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

In this paper, we present an approach for sentence-level gender reinflection using linguistically enhanced sequence-to-sequence models. Our system takes an Arabic sentence and a given target gender as input and generates a gender-reinflected sentence based on the target gender. We formulate the problem as a user-aware grammatical error correction task and build an encoder-decoder architecture to jointly model reinflection for both masculine and feminine grammatical genders. We also show that adding linguistic features to our model leads to better reinflection results. The results on a blind test set using our best system show improvements over previous work, with a 3.6% absolute increase in M2 F0.5.

Multitask Easy-First Dependency Parsing: Exploiting Complementarities of Different Dependency Representations
Yash Kankanampati | Joseph Le Roux | Nadi Tomeh | Dima Taji | Nizar Habash
Proceedings of the 28th International Conference on Computational Linguistics

In this paper we present a parsing model for projective dependency trees which takes advantage of the existence of complementary dependency annotations which is the case in Arabic, with the availability of CATiB and UD treebanks. Our system performs syntactic parsing according to both annotation types jointly as a sequence of arc-creating operations, and partially created trees for one annotation are also available to the other as features for the score function. This method gives error reduction of 9.9% on CATiB and 6.1% on UD compared to a strong baseline, and ablation tests show that the main contribution of this reduction is given by sharing tree representation between tasks, and not simply sharing BiLSTM layers as is often performed in NLP multitask systems.

A Spelling Correction Corpus for Multiple Arabic Dialects
Fadhl Eryani | Nizar Habash | Houda Bouamor | Salam Khalifa
Proceedings of the Twelfth Language Resources and Evaluation Conference

Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.

PALMYRA 2.0: A Configurable Multilingual Platform Independent Tool for Morphology and Syntax Annotation
Dima Taji | Nizar Habash
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

We present PALMYRA 2.0, a graphical dependency-tree visualization and editing software. PALMYRA 2.0 is designed to be highly configurable to any dependency parsing representation, and to enable the annotation of a multitude of linguistic features. It uses an intuitive interface that relies on drag-and-drop utilities as well as pop-up menus and keyboard shortcuts that can be easily specified.

Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods
Salam Khalifa | Nasser Zalmout | Nizar Habash
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we present the first full morphological analysis and disambiguation system for Gulf Arabic. We use an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic. We find that in very low settings, morphological analyzers help boost the performance of the full morphological disambiguation task. However, as the size of resources increase, the value of the morphological analyzers decreases.

The Paradigm Discovery Problem
Alexander Erdmann | Micha Elsner | Shijie Wu | Ryan Cotterell | Nizar Habash
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.

A Large-Scale Leveled Readability Lexicon for Standard Arabic
Muhamed Al Khalil | Nizar Habash | Zhengyang Jiang
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a large-scale 26,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world. The annotations show a high degree of agreement; and major differences were limited to regional variations. Comparing lemma readability levels with their frequencies provided good insights in the benefits and pitfalls of frequency-based readability approaches. The lexicon will be publicly available.

NADI 2020: The First Nuanced Arabic Dialect Identification Shared Task
Muhammad Abdul-Mageed | Chiyu Zhang | Houda Bouamor | Nizar Habash
Proceedings of the Fifth Arabic Natural Language Processing Workshop

We present the results and findings of the First Nuanced Arabic Dialect Identification Shared Task (NADI). This Shared Task includes two subtasks: country-level dialect identification (Subtask 1) and province-level sub-dialect identification (Subtask 2). The data for the shared task covers a total of 100 provinces from 21 Arab countries and is collected from the Twitter domain. As such, NADI is the first shared task to target naturally-occurring fine-grained dialectal text at the sub-country level. A total of 61 teams from 25 countries registered to participate in the tasks, thus reflecting the interest of the community in this area. We received 47 submissions for Subtask 1 from 18 teams and 9 submissions for Subtask 2 from 9 teams.

2019

Automatic Gender Identification and Reinflection in Arabic
Nizar Habash | Houda Bouamor | Christine Chung
Proceedings of the First Workshop on Gender Bias in Natural Language Processing

The impressive progress in many Natural Language Processing (NLP) applications has increased the awareness of some of the biases these NLP systems have with regards to gender identities. In this paper, we propose an approach to extend biased single-output gender-blind NLP systems with gender-specific alternative reinflections. We focus on Arabic, a gender-marking morphologically rich language, in the context of machine translation (MT) from English, and for first-person-singular constructions only. Our contributions are the development of a system-independent gender-awareness wrapper, and the building of a corpus for training and evaluating first-person-singular gender identification and reinflection in Arabic. Our results successfully demonstrate the viability of this approach with 8% relative increase in Bleu score for first-person-singular feminine, and 5.3% comparable increase for first-person-singular masculine on top of a state-of-the-art gender-blind MT system on a held-out test set.

The MADAR Shared Task on Arabic Fine-Grained Dialect Identification
Houda Bouamor | Sabit Hassan | Nizar Habash
Proceedings of the Fourth Arabic Natural Language Processing Workshop

In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. This shared task was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019. The shared task includes two subtasks: the MADAR Travel Domain Dialect Identification subtask (Subtask 1) and the MADAR Twitter User Dialect Identification subtask (Subtask 2). This shared task is the first to target a large set of dialect labels at the city and country levels. The data for the shared task was created or collected under the Multi-Arabic Dialect Applications and Resources (MADAR) project. A total of 21 teams from 15 countries participated in the shared task.

The Effectiveness of Simple Hybrid Systems for Hypernym Discovery
William Held | Nizar Habash
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Hypernymy modeling has largely been separated according to two paradigms, pattern-based methods and distributional methods. However, recent works utilizing a mix of these strategies have yielded state-of-the-art results. This paper evaluates the contribution of both paradigms to hybrid success by evaluating the benefits of hybrid treatment of baseline models from each paradigm. Even with a simple methodology for each individual system, utilizing a hybrid approach establishes new state-of-the-art results on two domain-specific English hypernym discovery tasks and outperforms all non-hybrid approaches in a general English hypernym discovery task.

ADIDA: Automatic Dialect Identification for Arabic
Ossama Obeid | Mohammad Salameh | Houda Bouamor | Nizar Habash
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations)

This demo paper describes ADIDA, a web-based system for automatic dialect identification for Arabic text. The system distinguishes among the dialects of 25 Arab cities (from Rabat to Muscat) in addition to Modern Standard Arabic. The results are presented with either a point map or a heat map visualizing the automatic identification probabilities over a geographical map of the Arab World.

Adversarial Multitask Learning for Joint Multi-Feature and Multi-Dialect Morphological Modeling
Nasser Zalmout | Nizar Habash
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Morphological tagging is challenging for morphologically rich languages due to the large target space and the need for more training data to minimize model sparsity. Dialectal variants of morphologically rich languages suffer more as they tend to be more noisy and have less resources. In this paper we explore the use of multitask learning and adversarial training to address morphological richness and dialectal variations in the context of full morphological tagging. We use multitask learning for joint morphological modeling for the features within two dialects, and as a knowledge-transfer scheme for cross-dialectal modeling. We use adversarial training to learn dialect invariant features that can help the knowledge-transfer scheme from the high to low-resource variants. We work with two dialectal variants: Modern Standard Arabic (high-resource “dialect’”) and Egyptian Arabic (low-resource dialect) as a case study. Our models achieve state-of-the-art results for both. Furthermore, adversarial training provides more significant improvement when using smaller training datasets in particular.

Morphologically Annotated Corpora for Seven Arabic Dialects: Taizi, Sanaani, Najdi, Jordanian, Syrian, Iraqi and Moroccan
Faisal Alshargi | Shahd Dibas | Sakhar Alkhereyf | Reem Faraj | Basmah Abdulkareem | Sane Yagi | Ouafaa Kacha | Nizar Habash | Owen Rambow
Proceedings of the Fourth Arabic Natural Language Processing Workshop

We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance
Alexander Erdmann | Salam Khalifa | Mai Oudah | Nizar Habash | Houda Bouamor
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.

The Impact of Preprocessing on Arabic-English Statistical and Neural Machine Translation
Mai Oudah | Amjad Almahairi | Nizar Habash
Proceedings of Machine Translation Summit XVII: Research Track

2018

A Bilingual Interactive Human Avatar Dialogue System
Dana Abu Ali | Muaz Ahmad | Hayat Al Hassan | Paula Dozsa | Ming Hu | Jose Varias | Nizar Habash
Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue

This demonstration paper presents a bilingual (Arabic-English) interactive human avatar dialogue system. The system is named TOIA (time-offset interaction application), as it simulates face-to-face conversations between humans using digital human avatars recorded in the past. TOIA is a conversational agent, similar to a chat bot, except that it is based on an actual human being and can be used to preserve and tell stories. The system is designed to allow anybody, simply using a laptop, to create an avatar of themselves, thus facilitating cross-cultural and cross-generational sharing of narratives to wider audiences. The system currently supports monolingual and cross-lingual dialogues in Arabic and English, but can be extended to other languages.

Improving Domain Independent Question Parsing with Synthetic Treebanks
Halim-Antoine Boukaram | Nizar Habash | Micheline Ziadee | Majd Sakr
Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)

Automatic syntactic parsing for question constructions is a challenging task due to the paucity of training examples in most treebanks. The near absence of question constructions is due to the dominance of the news domain in treebanking efforts. In this paper, we compare two synthetic low-cost question treebank creation methods with a conventional manual high-cost annotation method in the context of three domains (news questions, political talk shows, and chatbots) for Modern Standard Arabic, a language with relatively low resources and rich morphology. Our results show that synthetic methods can be effective at significantly reducing parsing errors for a target domain without having to invest large resources on manual annotation; and the combination of manual and synthetic methods is our best domain-independent performer.

A Parallel Corpus of Arabic-Japanese News Articles
Go Inoue | Nizar Habash | Yuji Matsumoto | Hiroyuki Aoyama
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

CoNLL-UL: Universal Morphological Lattices for Universal Dependency Parsing
Amir More | Özlem Çetinoğlu | Çağrı Çöltekin | Nizar Habash | Benoît Sagot | Djamé Seddah | Dima Taji | Reut Tsarfaty
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

The MADAR Arabic Dialect Corpus and Lexicon
Houda Bouamor | Nizar Habash | Mohammad Salameh | Wajdi Zaghouani | Owen Rambow | Dana Abdulrahim | Ossama Obeid | Salam Khalifa | Fadhl Eryani | Alexander Erdmann | Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Utilizing Character and Word Embeddings for Text Normalization with Sequence-to-Sequence Models
Daniel Watson | Nasser Zalmout | Nizar Habash
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little exploration in this direction. Both the scarcity of annotated data and the complexity of the language increase the difficulty of the problem. To address these challenges, we use a sequence-to-sequence model with character-based attention, which in addition to its self-learned character embeddings, uses word embeddings pre-trained with an approach that also models subword information. This provides the neural model with access to more linguistic information especially suitable for text normalization, without large parallel corpora. We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.

Complementary Strategies for Low Resourced Morphological Modeling
Alexander Erdmann | Nizar Habash
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Morphologically rich languages are challenging for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data sparsity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in addition to being morphologically complex, features ubiquitous ambiguity, exacerbating sparsity with noise. Given a small number of out-of-context rules describing closed class morphology, we combine them with word embeddings leveraging subword strings and noise reduction techniques. The combination outperforms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and funding.

MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction
Ossama Obeid | Salam Khalifa | Nizar Habash | Houda Bouamor | Wajdi Zaghouani | Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Feature Optimization for Predicting Readability of Arabic L1 and L2
Hind Saddiki | Nizar Habash | Violetta Cavalli-Sforza | Muhamed Al Khalil
Proceedings of the 5th Workshop on Natural Language Processing Techniques for Educational Applications

Advances in automatic readability assessment can impact the way people consume information in a number of domains. Arabic, being a low-resource and morphologically complex language, presents numerous challenges to the task of automatic readability assessment. In this paper, we present the largest and most in-depth computational readability study for Arabic to date. We study a large set of features with varying depths, from shallow words to syntactic trees, for both L1 and L2 readability tasks. Our best L1 readability accuracy result is 94.8% (75% error reduction from a commonly used baseline). The comparable results for L2 are 72.4% (45% error reduction). We also demonstrate the added value of leveraging L1 features for L2 readability prediction.

A Leveled Reading Corpus of Modern Standard Arabic
Muhamed Al Khalil | Hind Saddiki | Nizar Habash | Latifa Alfalasi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

A Morphologically Annotated Corpus of Emirati Arabic
Salam Khalifa | Nizar Habash | Fadhl Eryani | Ossama Obeid | Dana Abdulrahim | Meera Al Kaabi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

An Arabic Morphological Analyzer and Generator with Copious Features
Dima Taji | Salam Khalifa | Ossama Obeid | Fadhl Eryani | Nizar Habash
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more. This tool includes a fast engine that can be easily integrated into other systems, as well as an easy-to-use API and a web interface. CALIMA-Star also supports morphological reinflection. We evaluate CALIMA-Star against four commonly used analyzers for Arabic in terms of speed and morphological content.

A Cross-lingual Messenger with Keyword Searchable Phrases for the Travel Domain
Shehroze Khan | Jihyun Kim | Tarik Zulfikarpasic | Peter Chen | Nizar Habash
Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations

We present Qutr (Query Translator), a smart cross-lingual communication application for the travel domain. Qutr is a real-time messaging app that automatically translates conversations while supporting keyword-to-sentence matching. Qutr relies on querying a database that holds commonly used pre-translated travel-domain phrases and phrase templates in different languages with the use of keywords. The query matching supports paraphrases, incomplete keywords and some input spelling errors. The application addresses common cross-lingual communication issues such as translation accuracy, speed, privacy, and personalization.

Fine-Grained Arabic Dialect Identification
Mohammad Salameh | Houda Bouamor | Nizar Habash
Proceedings of the 27th International Conference on Computational Linguistics

Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification). This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic – a very challenging task. We build several classification systems and explore a large space of features. Our results show that we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words. We also report on additional insights from a data analysis of similarity and difference across Arabic dialects.

Noise-Robust Morphological Disambiguation for Dialectal Arabic
Nasser Zalmout | Alexander Erdmann | Nizar Habash
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

User-generated text tends to be noisy with many lexical and orthographic inconsistencies, making natural language processing (NLP) tasks more challenging. The challenging nature of noisy text processing is exacerbated for dialectal content, where in addition to spelling and lexical differences, dialectal text is characterized with morpho-syntactic and phonetic variations. These issues increase sparsity in NLP models and reduce accuracy. We present a neural morphological tagging and disambiguation model for Egyptian Arabic, with various extensions to handle noisy and inconsistent content. Our models achieve about 5% relative error reduction (1.1% absolute improvement) for full morphological analysis, and around 22% relative error reduction (1.8% absolute improvement) for part-of-speech tagging, over a state-of-the-art baseline.

Palmyra: A Platform Independent Dependency Annotation Tool for Morphologically Rich Languages
Talha Javed | Nizar Habash | Dima Taji
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Addressing Noise in Multidialectal Word Embeddings
Alexander Erdmann | Nasser Zalmout | Nizar Habash
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53%, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.

2017

Proceedings of the Third Arabic Natural Language Processing Workshop
Nizar Habash | Mona Diab | Kareem Darwish | Wassim El-Hajj | Hend Al-Khalifa | Houda Bouamor | Nadi Tomeh | Mahmoud El-Haj | Wajdi Zaghouani
Proceedings of the Third Arabic Natural Language Processing Workshop

A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models
Ramy Baly | Gilbert Badaro | Georges El-Khoury | Rawan Moukalled | Rita Aoun | Hazem Hajj | Wassim El-Hajj | Nizar Habash | Khaled Shaban
Proceedings of the Third Arabic Natural Language Processing Workshop

Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs to express opinion. In this paper, we perform an analytical study to observe how such linguistic phenomena vary across different Arab regions. This study of Arabic Twitter characterization aims at providing better understanding of Arabic Tweets, and fostering advanced research on the topic. Furthermore, we explore the performance of the two schools of machine learning on Arabic Twitter, namely the feature engineering approach and the deep learning approach. We consider models that have achieved state-of-the-art performance for opinion mining in English. Results highlight the advantages of using deep learning-based models, and confirm the importance of using morphological abstractions to address Arabic’s complex morphology.

OMAM at SemEval-2017 Task 4: Evaluation of English State-of-the-Art Sentiment Analysis Models for Arabic and a New Topic-based Model
Ramy Baly | Gilbert Badaro | Ali Hamdi | Rawan Moukalled | Rita Aoun | Georges El-Khoury | Ahmad Al Sallab | Hazem Hajj | Nizar Habash | Khaled Shaban | Wassim El-Hajj
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

While sentiment analysis in English has achieved significant progress, it remains a challenging task in Arabic given the rich morphology of the language. It becomes more challenging when applied to Twitter data that comes with additional sources of noise including dialects, misspellings, grammatical mistakes, code switching and the use of non-textual objects to express sentiments. This paper describes the “OMAM” systems that we developed as part of SemEval-2017 task 4. We evaluate English state-of-the-art methods on Arabic tweets for subtask A. As for the remaining subtasks, we introduce a topic-based approach that accounts for topic specificities by predicting topics or domains of upcoming tweets, and then using this information to predict their sentiment. Results indicate that applying the English state-of-the-art method to Arabic has achieved solid results without significant enhancements. Furthermore, the topic-based method ranked 1st in subtasks C and E, and 2nd in subtask D.

OMAM at SemEval-2017 Task 4: English Sentiment Analysis with Conditional Random Fields
Chukwuyem Onyibe | Nizar Habash
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe a supervised system that uses optimized Condition Random Fields and lexical features to predict the sentiment of a tweet. The system was submitted to the English version of all subtasks in SemEval-2017 Task 4.

Robust Dictionary Lookup in Multiple Noisy Orthographies
Lingliang Zhang | Nizar Habash | Godfried Toussaint
Proceedings of the Third Arabic Natural Language Processing Workshop

We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate’s “did you mean” feature, as well as the Yamli smart Arabic keyboard.

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic
Alexander Erdmann | Nizar Habash | Dima Taji | Houda Bouamor
Proceedings of Machine Translation Summit XVI: Research Track

NLP for Arabic and Related Languages
Mona Diab | Nizar Habash | Imed Zitouni
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]

Universal Dependencies for Arabic
Dima Taji | Nizar Habash | Daniel Zeman
Proceedings of the Third Arabic Natural Language Processing Workshop

We describe the process of creating NUDAR, a Universal Dependency treebank for Arabic. We present the conversion from the Penn Arabic Treebank to the Universal Dependency syntactic representation through an intermediate dependency representation. We discuss the challenges faced in the conversion of the trees, the decisions we made to solve them, and the validation of our conversion. We also present initial parsing results on NUDAR.

Don’t Throw Those Morphological Analyzers Away Just Yet: Neural Morphological Disambiguation for Arabic
Nasser Zalmout | Nizar Habash
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

This paper presents a model for Arabic morphological disambiguation based on Recurrent Neural Networks (RNN). We train Long Short-Term Memory (LSTM) cells in several configurations and embedding levels to model the various morphological features. Our experiments show that these models outperform state-of-the-art systems without explicit use of feature engineering. However, adding learning features from a morphological analyzer to model the space of possible analyses provides additional improvement. We make use of the resulting morphological models for scoring and ranking the analyses of the morphological analyzer for morphological disambiguation. The results show significant gains in accuracy across several evaluation metrics. Our system results in 4.4% absolute increase over the state-of-the-art in full morphological analysis accuracy (30.6% relative error reduction), and 10.6% (31.5% relative error reduction) for out-of-vocabulary words.

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]
Mona Diab | Nizar Habash | Imed Zitouni
Traitement Automatique des Langues, Volume 58, Numéro 3 : Traitement automatique de l'arabe et des langues apparentées [NLP for Arabic and Related Languages]

A Parallel Corpus for Evaluating Machine Translation between Arabic and European Languages
Nizar Habash | Nasser Zalmout | Dima Taji | Hieu Hoang | Maverick Alzate
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.

A Morphological Analyzer for Gulf Arabic Verbs
Salam Khalifa | Sara Hassan | Nizar Habash
Proceedings of the Third Arabic Natural Language Processing Workshop

We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively.

2016

The Columbia University - New York University Abu Dhabi SIGMORPHON 2016 Morphological Reinflection Shared Task Submission
Dima Taji | Ramy Eskander | Nizar Habash | Owen Rambow
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

SPLIT: Smart Preprocessing (Quasi) Language Independent Tool
Mohamed Al-Badrashiny | Arfath Pasha | Mona Diab | Nizar Habash | Owen Rambow | Wael Salloum | Ramy Eskander
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.

DALILA: The Dialectal Arabic Linguistic Learning Assistant
Salam Khalifa | Houda Bouamor | Nizar Habash
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP). The number and sophistication of tools and datasets in DA are very limited in comparison to Modern Standard Arabic (MSA) and other languages. MSA tools do not effectively model DA which makes the direct use of MSA NLP tools for handling dialects impractical. This is particularly a challenge for the creation of tools to support learning Arabic as a living language on the web, where authentic material can be found in both MSA and DA. In this paper, we present the Dialectal Arabic Linguistic Learning Assistant (DALILA), a Chrome extension that utilizes cutting-edge Arabic dialect NLP research to assist learners and non-native speakers in understanding text written in either MSA or DA. DALILA provides dialectal word analysis and English gloss corresponding to each word.

Building an Arabic Machine Translation Post-Edited Corpus: Guidelines and Annotation
Wajdi Zaghouani | Nizar Habash | Ossama Obeid | Behrang Mohit | Houda Bouamor | Kemal Oflazer
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.

Applying the Cognitive Machine Translation Evaluation Approach to Arabic
Irina Temnikova | Wajdi Zaghouani | Stephan Vogel | Nizar Habash
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

The goal of the cognitive machine translation (MT) evaluation approach is to build classifiers which assign post-editing effort scores to new texts. The approach helps estimate fair compensation for post-editors in the translation industry by evaluating the cognitive difficulty of post-editing MT output. The approach counts the number of errors classified in different categories on the basis of how much cognitive effort they require in order to be corrected. In this paper, we present the results of applying an existing cognitive evaluation approach to Modern Standard Arabic (MSA). We provide a comparison of the number of errors and categories of errors in three MSA texts of different MT quality (without any language-specific adaptation), as well as a comparison between MSA texts and texts from three Indo-European languages (Russian, Spanish, and Bulgarian), taken from a previous experiment. The results show how the error distributions change passing from the MSA texts of worse MT quality to MSA texts of better MT quality, as well as a similarity in distinguishing the texts of better MT quality for all four languages.

Analysis of Foreign Language Teaching Methods: An Automatic Readability Approach
Nasser Zalmout | Hind Saddiki | Nizar Habash
Proceedings of the 3rd Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA2016)

Much research in education has been done on the study of different language teaching methods. However, there has been little investigation using computational analysis to compare such methods in terms of readability or complexity progression. In this paper, we make use of existing readability scoring techniques and our own classifiers to analyze the textbooks used in two very different teaching methods for English as a Second Language – the grammar-based and the communicative methods. Our analysis indicates that the grammar-based curriculum shows a more coherent readability progression compared to the communicative curriculum. This finding corroborates with the expectations about the differences between these two methods and validates our approach’s value in comparing different teaching methods quantitatively.

A Large Scale Corpus of Gulf Arabic
Salam Khalifa | Nizar Habash | Dana Abdulrahim | Sara Hassan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.

Arabic Corpora for Credibility Analysis
Ayman Al Zaatari | Rim El Ballouli | Shady ELbassouni | Wassim El-Hajj | Hazem Hajj | Khaled Shaban | Nizar Habash | Emad Yahya
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

A significant portion of data generated on blogging and microblogging websites is non-credible as shown in many recent studies. To filter out such non-credible information, machine learning can be deployed to build automatic credibility classifiers. However, as in the case with most supervised machine learning approaches, a sufficiently large and accurate training data must be available. In this paper, we focus on building a public Arabic corpus of blogs and microblogs that can be used for credibility classification. We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic. We discuss our data acquisition approach and annotation process, provide rigid analysis on the annotated data and finally report some results on the effectiveness of our data for credibility classification.

Creating Resources for Dialectal Arabic from a Single Annotation: A Case Study on Egyptian and Levantine
Ramy Eskander | Nizar Habash | Owen Rambow | Arfath Pasha
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Arabic dialects present a special problem for natural language processing because there are few resources, they have no standard orthography, and have not been studied much. However, as more and more written dialectal Arabic is found in social media, NLP for Arabic dialects becomes an important goal. We present a methodology for creating a morphological analyzer and a morphological tagger for dialectal Arabic, and we illustrate it on Egyptian and Levantine Arabic. To our knowledge, these are the first analyzer and tagger for Levantine.

Morphologically Annotated Corpora and Morphological Analyzers for Moroccan and Sanaani Yemeni Arabic
Faisal Al-Shargi | Aidan Kaplan | Ramy Eskander | Nizar Habash | Owen Rambow
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present new language resources for Moroccan and Sanaani Yemeni Arabic. The resources include corpora for each dialect which have been morphologically annotated, and morphological analyzers for each dialect which are derived from these corpora. These are the first sets of resources for Moroccan and Yemeni Arabic. The resources will be made available to the public.

YAMAMA: Yet Another Multi-Dialect Arabic Morphological Analyzer
Salam Khalifa | Nasser Zalmout | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we present YAMAMA, a multi-dialect Arabic morphological analyzer and disambiguator. Our system is almost five times faster than the state-of-art MADAMIRA system with a slightly lower quality. In addition to speed, YAMAMA outputs a rich representation which allows for a wider spectrum of use. In this regard, YAMAMA transcends other systems, such as FARASA, which is faster but provides specific outputs catering to specific applications.

Botta: An Arabic Dialect Chatbot
Dana Abu Ali | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

This paper presents BOTTA, the first Arabic dialect chatbot. We explore the challenges of creating a conversational agent that aims to simulate friendly conversations using the Egyptian Arabic dialect. We present a number of solutions and describe the different components of the BOTTA chatbot. The BOTTA database files are publicly available for researchers working on Arabic chatbot technologies. The BOTTA chatbot is also publicly available for any users who want to chat with it online.

CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation
Anas Shahrour | Salam Khalifa | Dima Taji | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.

Exploiting Arabic Diacritization for High Quality Automatic Annotation
Nizar Habash | Anas Shahrour | Muhamed Al-Khalil
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.

Machine Translation Evaluation for Arabic using Morphologically-enriched Embeddings
Francisco Guzmán | Houda Bouamor | Ramy Baly | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Evaluation of machine translation (MT) into morphologically rich languages (MRL) has not been well studied despite posing many challenges. In this paper, we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an MRL. Specifically we report on Arabic, a language with complex and rich morphology. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task. More importantly, we demonstrate the usefulness of morpho-syntactic representations to model sentence similarity for MT evaluation and address complex linguistic phenomena of Arabic.

2015

A Conventional Orthography for Algerian Arabic
Houda Saadane | Nizar Habash
Proceedings of the Second Workshop on Arabic Natural Language Processing

Annotating Targets of Opinions in Arabic using Crowdsourcing
Noura Farra | Kathy McKeown | Nizar Habash
Proceedings of the Second Workshop on Arabic Natural Language Processing

Predicting the Structure of Cooking Recipes
Jermsak Jermsurawong | Nizar Habash
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

POS-tagging of Tunisian Dialect Using Standard Arabic Resources and Tools
Ahmed Hamdi | Alexis Nasr | Nizar Habash | Núria Gala
Proceedings of the Second Workshop on Arabic Natural Language Processing

Morphological constraints for phrase pivot statistical machine translation
Ahmed El Kholy | Nizar Habash
Proceedings of Machine Translation Summit XV: Papers

The Second QALB Shared Task on Automatic Text Correction for Arabic
Alla Rozovskaya | Houda Bouamor | Nizar Habash | Wajdi Zaghouani | Ossama Obeid | Behrang Mohit
Proceedings of the Second Workshop on Arabic Natural Language Processing

Correction Annotation for Non-Native Arabic Texts: Guidelines and Corpus
Wajdi Zaghouani | Nizar Habash | Houda Bouamor | Alla Rozovskaya | Behrang Mohit | Abeer Heider | Kemal Oflazer
Proceedings of the 9th Linguistic Annotation Workshop

Proceedings of the Second Workshop on Arabic Natural Language Processing
Nizar Habash | Stephan Vogel | Kareem Darwish
Proceedings of the Second Workshop on Arabic Natural Language Processing

Improving Arabic Diacritization through Syntactic Analysis
Anas Shahrour | Salam Khalifa | Nizar Habash
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

2014

Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic
Serena Jeblee | Weston Feely | Houda Bouamor | Alon Lavie | Nizar Habash | Kemal Oflazer
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

A Corpus and Phonetic Dictionary for Tunisian Arabic Speech Recognition
Abir Masmoudi | Mariem Ellouze Khmekhem | Yannick Estève | Lamia Hadrich Belguith | Nizar Habash
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.

Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)
Nizar Habash | Stephan Vogel
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development
Mohamed Maamouri | Ann Bies | Seth Kulick | Michael Ciul | Nizar Habash | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA). By the very nature of Egyptian Arabic, the data collected is informal, for example Discussion Forum text, which we use for the treebank discussed here. In addition, Egyptian Arabic, like other Arabic dialects, is sufficiently different from Modern Standard Arabic (MSA) that tools and techniques developed for MSA cannot be simply transferred over to work on Egyptian Arabic work. In particular, a morphological analyzer for Egyptian Arabic is needed to mediate between the written text and the segmented, vocalized form used for the syntactic trees. This led to the necessity of a feedback loop between the treebank team and the analyzer team, as improvements in each area were fed to the other. Therefore, by necessity, there needed to be close cooperation between the annotation team and the tool development team, which was to their mutual benefit. Collaboration on this type of challenge, where tools and resources are limited, proved to be remarkably synergistic and opens the way to further fruitful work on Arabic dialects.

A Pipeline Approach to Supervised Error Correction for the QALB-2014 Shared Task
Nadi Tomeh | Nizar Habash | Ramy Eskander | Joseph Le Roux
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

The Illinois-Columbia System in the CoNLL-2014 Shared Task
Alla Rozovskaya | Kai-Wei Chang | Mark Sammons | Dan Roth | Nizar Habash
Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task

Automatic Transliteration of Romanized Dialectal Arabic
Mohamed Al-Badrashiny | Ramy Eskander | Nizar Habash | Owen Rambow
Proceedings of the Eighteenth Conference on Computational Natural Language Learning

A Conventional Orthography for Tunisian Arabic
Inès Zribi | Rahma Boujelbane | Abir Masmoudi | Mariem Ellouze | Lamia Belguith | Nizar Habash
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.

Transliteration of Arabizi into Arabic Orthography: Developing a Parallel Annotated Arabizi-Arabic Script SMS/Chat Corpus
Ann Bies | Zhiyi Song | Mohamed Maamouri | Stephen Grimes | Haejoong Lee | Jonathan Wright | Stephanie Strassel | Nizar Habash | Ramy Eskander | Owen Rambow
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Foreign Words and the Automatic Processing of Arabic Social Media Text Written in Roman Script
Ramy Eskander | Mohamed Al-Badrashiny | Nizar Habash | Owen Rambow
Proceedings of the First Workshop on Computational Approaches to Code Switching

A Large Scale Arabic Sentiment Lexicon for Arabic Opinion Mining
Gilbert Badaro | Ramy Baly | Hazem Hajj | Nizar Habash | Wassim El-Hajj
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Building a Corpus for Palestinian Arabic: a Preliminary Study
Mustafa Jarrar | Nizar Habash | Diyam Akra | Nasser Zalmout
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Sentence Level Dialect Identification for Machine Translation System Selection
Wael Salloum | Heba Elfardy | Linda Alamir-Salloum | Nizar Habash | Mona Diab
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Natural Language Processing of Arabic and its Dialects
Mona Diab | Nizar Habash
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

This tutorial introduces the different challenges and current solutions to the automatic processing of Arabic and its dialects. The tutorial has two parts: First, we present a discussion of generic issues relevant to Arabic NLP and detail dialectal linguistic issues and the challenges they pose for NLP. In the second part, we review the state-of-the-art in Arabic processing covering several enabling technologies and applications, e.g., dialect identification, morphological processing (analysis, disambiguation, tokenization, POS tagging), parsing, and machine translation.

A Multidialectal Parallel Corpus of Arabic
Houda Bouamor | Nizar Habash | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.

Large Scale Arabic Error Annotation: Guidelines and Framework
Wajdi Zaghouani | Behrang Mohit | Nizar Habash | Ossama Obeid | Nadi Tomeh | Alla Rozovskaya | Noura Farra | Sarah Alkuhlani | Kemal Oflazer
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.

Unsupervised Morphology-Based Vocabulary Expansion
Mohammad Sadegh Rasooli | Thomas Lippincott | Nizar Habash | Owen Rambow
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
Mona Diab | Mohamed Al-Badrashiny | Maryam Aminian | Mohammed Attia | Heba Elfardy | Nizar Habash | Abdelati Hawwari | Wael Salloum | Pradeep Dasigi | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwas creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.

Generalized Character-Level Spelling Error Correction
Noura Farra | Nadi Tomeh | Alla Rozovskaya | Nizar Habash
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

INVITED TALK 1: Computational Processing of Arabic Dialects
Nizar Habash
Proceedings of the EMNLP’2014 Workshop on Language Technology for Closely Related Languages and Language Variants

Alignment symmetrisation optimization targeting phrase pivot statistical machine translation
Ahmed El Kholy | Nizar Habash
Proceedings of the 17th Annual Conference of the European Association for Machine Translation

The Columbia System in the QALB-2014 Shared Task on Arabic Error Correction
Alla Rozovskaya | Nizar Habash | Ramy Eskander | Noura Farra | Wael Salloum
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic
Arfath Pasha | Mohamed Al-Badrashiny | Mona Diab | Ahmed El Kholy | Ramy Eskander | Nizar Habash | Manoj Pooleery | Owen Rambow | Ryan Roth
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.

The First QALB Shared Task on Automatic Text Correction for Arabic
Behrang Mohit | Alla Rozovskaya | Nizar Habash | Wajdi Zaghouani | Ossama Obeid
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

2013

Automatic Extraction of Morphological Lexicons from Morphologically Annotated Corpora
Ramy Eskander | Nizar Habash | Owen Rambow
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation
Ahmed Hamdi | Rahma Boujelbane | Nizar Habash | Alexis Nasr
Proceedings of Machine Translation Summit XIV: Papers

Automatic Correction and Extension of Morphological Annotations
Ramy Eskander | Nizar Habash | Ann Bies | Seth Kulick | Mohamed Maamouri
Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse

Reranking with Linguistic and Semantic Features for Arabic Optical Character Recognition
Nadi Tomeh | Nizar Habash | Ryan Roth | Noura Farra | Pradeep Dasigi | Mona Diab
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Morphological Analysis and Disambiguation for Dialectal Arabic
Nizar Habash | Ryan Roth | Owen Rambow | Ramy Eskander | Nadi Tomeh
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Language Independent Connectivity Strength Features for Phrase Pivot Statistical Machine Translation
Ahmed El Kholy | Nizar Habash | Gregor Leusch | Evgeny Matusov | Hassan Sawaf
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

SPMRL‘13 Shared Task System: The CADIM Arabic Dependency Parser
Yuval Marton | Nizar Habash | Owen Rambow | Sarah Alkhulani
Proceedings of the Fourth Workshop on Statistical Parsing of Morphologically-Rich Languages

Selective Combination of Pivot and Direct Statistical Machine Translation Models
Ahmed El Kholy | Nizar Habash | Gregor Leusch | Evgeny Matusov | Hassan Sawaf
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Processing Spontaneous Orthography
Ramy Eskander | Nizar Habash | Owen Rambow | Nadi Tomeh
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Dialectal Arabic to English Machine Translation: Pivoting through Modern Standard Arabic
Wael Salloum | Nizar Habash
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Orthographic and Morphological Processing for Persian-to-English Statistical Machine Translation
Mohammad Sadegh Rasooli | Ahmed El Kholy | Nizar Habash
Proceedings of the Sixth International Joint Conference on Natural Language Processing

Dependency Parsing of Modern Standard Arabic with Lexical and Inflectional Features
Yuval Marton | Nizar Habash | Owen Rambow
Computational Linguistics, Volume 39, Issue 1 - March 2013

A Web-based Annotation Framework For Large-Scale Text Correction
Ossama Obeid | Wajdi Zaghouani | Behrang Mohit | Nizar Habash | Kemal Oflazer | Nadi Tomeh
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

Automatic Morphological Enrichment of a Morphologically Underspecified Treebank
Sarah Alkuhlani | Nizar Habash | Ryan Roth
Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

DIRA: Dialectal Arabic Information Retrieval Assistant
Arfath Pasha | Mohammad Al-Badrashiny | Mohamed Altantawy | Nizar Habash | Manoj Pooleery | Owen Rambow | Ryan M. Roth | Mona Diab
The Companion Volume of the Proceedings of IJCNLP 2013: System Demonstrations

Translating verbs between MSA and arabic dialects through deep morphological analysis (Un système de traduction de verbes entre arabe standard et arabe dialectal par analyse morphologique profonde) [in French]
Ahmed Hamdi | Rahma Boujelbane | Nizar Habash | Alexis Nasr
Proceedings of TALN 2013 (Volume 1: Long Papers)

2012

Hebrew Morphological Preprocessing for Statistical Machine Translation
Nimesh Singh | Nizar Habash
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

A Morphological Analyzer for Egyptian Arabic
Nizar Habash | Ramy Eskander | Abdelati Hawwari
Proceedings of the Twelfth Meeting of the Special Interest Group on Computational Morphology and Phonology

Translate, Predict or Generate: Modeling Rich Morphology in Statistical Machine Translation
Ahmed El Kholy | Nizar Habash
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

Conventional Orthography for Dialectal Arabic
Nizar Habash | Mona Diab | Owen Rambow
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard. In this paper, we present CODA, a conventional orthography for dialectal Arabic; it is designed primarily for the purpose of developing computational models of Arabic dialects. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Egyptian Arabic.

Can Automatic Post-Editing Make MT More Meaningful
Kristen Parton | Nizar Habash | Kathleen McKeown | Gonzalo Iglesias | Adrià de Gispert
Proceedings of the 16th Annual Conference of the European Association for Machine Translation

Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Tutorial Abstracts at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

MT and Arabic Language Issues
Nizar Habash
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Tutorials

Arabic poses many interesting challenges to machine translation: ambiguous orthography, rich morphology, complex morpho-syntactic behavior, and numerous dialects. In this tutorial, we introduce the most important themes of challenges and solutions for people working on translation from/to Arabic or any of its dialects. The tutorial is intended for researchers and developers working on MT. The discussion of linguistic issues and how they are addressed in MT will help linguists and professional translators understand the issues machine translation faces when dealing with Arabic and other morphologically rich languages. The tutorial does not expect the attendees to be able to speak/read/write Arabic.

Lost & Found in Translation: Impact of Machine Translated Results on Translingual Information Retrieval
Kristen Parton | Nizar Habash | Kathleen McKeown
Proceedings of the 10th Conference of the Association for Machine Translation in the Americas: Research Papers

In an ideal cross-lingual information retrieval (CLIR) system, a user query would generate a search over documents in a different language and the relevant results would be presented in the user’s language. In practice, CLIR systems are typically evaluated by judging result relevance in the document language, to factor out the effects of translating the results using machine translation (MT). In this paper, we investigate the influence of four different approaches for integrating MT and CLIR on both retrieval accuracy and user judgment of relevancy. We create a corpus with relevance judgments for both human and machine translated results, and use it to quantify the effect that MT quality has on end-to-end relevance. We find that MT errors result in a 16-39% decrease in mean average precision over the ground truth system that uses human translations. MT errors also caused relevant sentences to appear irrelevant – 5-19% of sentences were relevant in human translation, but were judged irrelevant in MT. To counter this degradation, we present two hybrid retrieval models and two automatic MT post-editing techniques and show that these approaches substantially mitigate the errors and improve the end-to-end relevance.

Identifying Broken Plurals, Irregular Gender, and Rationality in Arabic Text
Sarah Alkuhlani | Nizar Habash
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

Rich Morphology Generation Using Statistical Machine Translation
Ahmed El Kholy | Nizar Habash
INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference

Elissa: A Dialectal to Standard Arabic Machine Translation System
Wael Salloum | Nizar Habash
Proceedings of COLING 2012: Demonstration Papers

2011

Using Deep Morphology to Improve Automatic Error Detection in Arabic Handwriting Recognition
Nizar Habash | Ryan Roth
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

One-Step Statistical Parsing of Hybrid Dependency-Constituency Syntactic Representations
Kais Dukes | Nizar Habash
Proceedings of the 12th International Conference on Parsing Technologies

Automatic Error Analysis for Morphologically Rich Languages
Ahmed El Kholy | Nizar Habash
Proceedings of Machine Translation Summit XIII: Papers

Fuzzy Syntactic Reordering for Phrase-based Statistical Machine Translation
Jacob Andreas | Nizar Habash | Owen Rambow
Proceedings of the Sixth Workshop on Statistical Machine Translation

Dialectal to Standard Arabic Paraphrasing to Improve Arabic-English Statistical Machine Translation
Wael Salloum | Nizar Habash
Proceedings of the First Workshop on Algorithms and Resources for Modelling of Dialects and Language Varieties

Filtering Antonymous, Trend-Contrasting, and Polarity-Dissimilar Distributional Paraphrases for Improving Statistical Machine Translation
Yuval Marton | Ahmed El Kholy | Nizar Habash
Proceedings of the Sixth Workshop on Statistical Machine Translation

Improving Arabic Dependency Parsing with Form-based and Functional Morphological Features
Yuval Marton | Nizar Habash | Owen Rambow
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

A Corpus for Modeling Morpho-Syntactic Agreement in Arabic: Gender, Number and Rationality
Sarah Alkuhlani | Nizar Habash
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

Fast Yet Rich Morphological Analysis
Mohamed Altantawy | Nizar Habash | Owen Rambow
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing

2010

Reordering Matrix Post-verbal Subjects for Arabic-to-English SMT
Marine Carpuat | Yuval Marton | Nizar Habash
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

We improve our recently proposed technique for integrating Arabic verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium- and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger.

Improving Arabic Dependency Parsing with Lexical and Inflectional Morphological Features
Yuval Marton | Nizar Habash | Owen Rambow
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

Improving Arabic-to-English Statistical Machine Translation by Reordering Post-Verbal Subjects for Alignment
Marine Carpuat | Yuval Marton | Nizar Habash
Proceedings of the ACL 2010 Conference Short Papers

Machine Translation between Hebrew and Arabic: Needs, Challenges and Preliminary Solutions
Reshef Shilon | Nizar Habash | Alon Lavie | Shuly Wintner
Proceedings of the 9th Conference of the Association for Machine Translation in the Americas: Student Research Workshop

Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their implications for machine translation. We highlight the shortcomings of using English as a pivot language and advocate a direct, transfer-based and linguistically-informed (but still statistical, and hence scalable) approach. We report preliminary results of such a system that we are currently developing.

Morphological Annotation of Quranic Arabic
Kais Dukes | Nizar Habash
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Quranic Arabic Corpus (http://corpus.quran.com) is an annotated linguistic resource with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year old central religious text of Islam. This paper describes a new approach to morphological annotation of Quranic Arabic, a genre difficult to compare with other forms of Arabic. Processing Quranic Arabic is a unique challenge from a computational point of view, since the vocabulary and spelling differ from Modern Standard Arabic. The Quranic Arabic Corpus differs from other Arabic computational resources in adopting a tagset that closely follows traditional Arabic grammar. We made this decision in order to leverage a large body of existing historical grammatical analysis, and to encourage online collaborative annotation. In this paper, we discuss how the unique challenge of morphological annotation of Quranic Arabic is solved using a multi-stage approach. The different stages include automatic morphological tagging using diacritic edit-distance, two-pass manual verification, and online collaborative annotation. This process is evaluated to validate the appropriateness of the chosen methodology.

Orthographic and Morphological Processing for English-Arabic Statistical Machine Translation
Ahmed El Kholy | Nizar Habash
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles longs

Much of the work on Statistical Machine Translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this paper, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic. We explore a space of tokenization schemes and normalization options. We only evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.

Morphological Analysis and Generation of Arabic Nouns: A Morphemic Functional Approach
Mohamed Altantawy | Nizar Habash | Owen Rambow | Ibrahim Saleh
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

MAGEAD is a morphological analyzer and generator for Modern Standard Arabic (MSA) and its dialects. We introduced MAGEAD in previous work with an implementation of MSA and Levantine Arabic verbs. In this paper, we port that system to MSA nominals (nouns and adjectives), which are far more complex to model than verbs. Our system is a functional morphological analyzer and generator, i.e., it analyzes to and generates from a representation consisting of a lexeme and linguistic feature-value pairs, where the features are syntactically (and perhaps semantically) meaningful, rather than just morphologically. A detailed evaluation of the current implementation comparing it to a commonly used morphological analyzer shows that it has good morphological coverage with precision and recall scores in the 90s. An error analysis reveals that the majority of recall and precision errors are problems in the gold standard or a result of the discrepancy between different models of form-based/functional morphology.

2009

Improving the Arabic Pronunciation Dictionary for Phone and Word Recognition with Linguistically-Based Pronunciation Rules
Fadi Biadsy | Nizar Habash | Julia Hirschberg
Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics

Spoken Arabic Dialect Identification Using Phonotactic Modeling
Fadi Biadsy | Julia Hirschberg | Nizar Habash
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages

Automatic Extraction of Lemma-based Bilingual Dictionaries for Morphologically Rich Languages
Ibrahim M. Saleh | Nizar Habash
Proceedings of the Third Workshop on Computational Approaches to Arabic-Script-based Languages (CAASL3)

Improving Arabic-Chinese Statistical Machine Translation using English as Pivot Language
Nizar Habash | Jun Hu
Proceedings of the Fourth Workshop on Statistical Machine Translation

CATiB: The Columbia Arabic Treebank
Nizar Habash | Ryan Roth
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers

Syntactic Reordering for English-Arabic Phrase-Based Machine Translation
Jakob Elming | Nizar Habash
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages

2008

Improving NER in Arabic Using a Morphological Tagger
Benjamin Farber | Dayne Freitag | Nizar Habash | Owen Rambow
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We discuss a named entity recognition system for Arabic, and show how we incorporated the information provided by MADA, a full morphological tagger which uses a morphological analyzer. Surprisingly, the relevant features used are the capitalization of the English gloss chosen by the tagger, and the fact that an analysis is returned (that a word is not OOV to the morphological analyzer). The use of the tagger also improves over a third system which just uses a morphological analyzer, yielding a 14\% reduction in error over the baseline. We conduct a thorough error analysis to identify sources of success and failure among the variations, and show that by combining the systems in simple ways we can significantly influence the precision-recall trade-off.

Identification of Naturally Occurring Numerical Expressions in Arabic
Nizar Habash | Ryan Roth
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

In this paper, we define the task of Number Identification in natural context. We present and validate a language-independent semi-automatic approach to quickly building a gold standard for evaluating number identification systems by exploiting hand-aligned parallel data. We also present and extensively evaluate a robust rule-based system for number identification in natural context for Arabic for a variety of number formats and types. The system is shown to have strong performance, achieving, on a blind test, a 94.8% F-score for the task of correctly identifying number expression spans in natural text, and a 92.1% F-score for the task of correctly determining the core numerical value.

Four Techniques for Online Handling of Out-of-Vocabulary Words in Arabic-English Statistical Machine Translation
Nizar Habash
Proceedings of ACL-08: HLT, Short Papers

Arabic Morphological Tagging, Diacritization, and Lemmatization Using Lexeme Models and Feature Ranking
Ryan Roth | Owen Rambow | Nizar Habash | Mona Diab | Cynthia Rudin
Proceedings of ACL-08: HLT, Short Papers

Using Shallow Syntax Information to Improve Word Alignment and Reordering for SMT
Josep M. Crego | Nizar Habash
Proceedings of the Third Workshop on Statistical Machine Translation

Automatic Learning of Morphological Variations for Handling Out-of-Vocabulary Terms in Urdu-English MT
Nizar Habash | Hayden Metsky
Proceedings of the 8th Conference of the Association for Machine Translation in the Americas: Research Papers

We present an approach for online handling of Out-of-Vocabulary (OOV) terms in Urdu-English MT. Since Urdu is morphologically richer than English, we expect a large portion of the OOV terms to be Urdu morphological variations that are irrelevant to English. We describe an approach to automatically learn English-irrelevant (target-irrelevant) Urdu (source) morphological variation rules from standard phrase tables. These rules are learned in an unsupervised (or lightly supervised) manner by exploiting redundancy in Urdu and collocation with English translations. We use these rules to hypothesize in-vocabulary alternatives to the OOV terms. Our results show that we reduce the OOV rate from a standard baseline average of 2.6% to an average of 0.3% (or 89% relative decrease). We also increase the BLEU score by 0.45 (absolute) and 2.8% (relative) on a standard test set. A manual error analysis shows that 28% of handled OOV cases produce acceptable translations in context.

2007

Syntactic preprocessing for statistical machine translation
Nizar Habash
Proceedings of Machine Translation Summit XI: Papers

Combination of Statistical Word Alignments Based on Multiple Preprocessing Schemes
Jakob Elming | Nizar Habash
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

Determining Case in Arabic: Learning Complex Linguistic Behavior Requires Complex Linguistic Features
Nizar Habash | Ryan Gabbard | Owen Rambow | Seth Kulick | Mitch Marcus
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

NLG is still relevant to MT
Nizar Habash
Proceedings of the Workshop on Using corpora for natural language generation

Arabic diacritization in the context of statistical machine translation
Mona Diab | Mahmoud Ghoneim | Nizar Habash
Proceedings of Machine Translation Summit XI: Papers

Arabic Dialect Processing Tutorial
Mona Diab | Nizar Habash
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Tutorial Abstracts

Semi-automatic error analysis for large-scale statistical machine translation
Katrin Kirchhoff | Owen Rambow | Nizar Habash | Mona Diab
Proceedings of Machine Translation Summit XI: Papers

Arabic Diacritization through Full Morphological Tagging
Nizar Habash | Owen Rambow
Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Companion Volume, Short Papers

2006

Arabic Dialect Processing
Mona Diab | Nizar Habash
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Tutorials

Arabic Preprocessing Schemes for Statistical Machine Translation
Nizar Habash | Fatiha Sadat
Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers

Challenges in Building an Arabic-English GHMT System with SMT Components
Nizar Habash | Bonnie Dorr | Christof Monz
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers

The research context of this paper is developing hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridization question. In this paper, we describe the challenges in building an Arabic-English generation-heavy machine translation (GHMT) system and boosting it with statistical machine translation (SMT) components. We present an extensive evaluation of multiple system variants and report positive results on the advantages of hybridization.

Developing and Using a Pilot Dialectal Arabic Treebank
Mohamed Maamouri | Ann Bies | Tim Buckwalter | Mona Diab | Nizar Habash | Owen Rambow | Dalila Tabessi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus

Presentation
Nizar Habash
Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Panel on hybrid machine translation: why and how?

Parsing Arabic Dialects
David Chiang | Mona Diab | Nizar Habash | Owen Rambow | Safiullah Shareef
11th Conference of the European Chapter of the Association for Computational Linguistics

Combination of Arabic Preprocessing Schemes for Statistical Machine Translation
Fatiha Sadat | Nizar Habash
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

Parallel Syntactic Annotation of Multiple Languages
Owen Rambow | Bonnie Dorr | David Farwell | Rebecca Green | Nizar Habash | Stephen Helmreich | Eduard Hovy | Lori Levin | Keith J. Miller | Teruko Mitamura | Florence Reeder | Advaith Siddharthan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes an effort to investigate the incrementally deepening development of an interlingua notation, validated by human annotation of texts in English plus six languages. We begin with deep syntactic annotation, and in this paper present a series of annotation manuals for six different languages at the deep-syntactic level of representation. Many syntactic differences between languages are removed in the proposed syntactic annotation, making them useful resources for multilingual NLP projects with semantic components.

Inter-annotator Agreement on a Multilingual Semantic Annotation Task
Rebecca Passonneau | Nizar Habash | Owen Rambow
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.

Design, Construction and Validation of an Arabic-English Conceptual Interlingua for Cross-lingual Information Retrieval
Nizar Habash | Clinton Mah | Sabiha Imran | Randy Calistri-Yeh | Páraic Sheridan
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the issues involved in extending a trans-lingual lexicon, the TextWise Conceptual Interlingua (CI), with Arabic terms. The Conceptual Interlingua is based on the Princeton English WordNet (Fellbaum, 1998). It is a central component in the cross-lingual information retrieval (CLIR) system CINDOR (Conceptual INterlingua for DOcument Retrieval). Arabic has a rich morphological system combining templatic and affixational paradigms for both inflectional and derivational morphology. This rich morphology poses a major challenge to the design and building of the Arabic CI and also its validation. This is because the available resources for Arabic, whether manually constructed bilingual lexicons or lexicons automatically derived from bilingual parallel corpora, exist at different levels of morphological representation. We describe here the issues and decisions made in the design and construction of the Arabic-English CI using different types of manual and automatic resources. We also present the results of an extensive validation of the Arabic CI and briefly discuss the evaluation of its use for CLIR on the TREC Arabic Benchmark collection.

MAGEAD: A Morphological Analyzer and Generator for the Arabic Dialects
Nizar Habash | Owen Rambow
Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics

2005

Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages
Kareem Darwish | Mona Diab | Nizar Habash
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

Morphological Analysis and Generation for Arabic Dialects
Nizar Habash | Owen Rambow | George Kiraz
Proceedings of the ACL Workshop on Computational Approaches to Semitic Languages

Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop
Nizar Habash | Owen Rambow
Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05)

2004

Interlingual Annotation of Multilingual Text Corpora
Stephen Helmreich | David Farwell | Bonnie Dorr | Nizar Habash | Lori Levin | Teruko Mitamura | Florence Reeder | Keith Miller | Eduard Hovy | Owen Rambow | Advaith Siddharthan
Proceedings of the Workshop Frontiers in Corpus Annotation at HLT-NAACL 2004

Multi-Align: combining linguistic and statistical techniques to improve alignments for adaptable MT
Necip Fazil Ayan | Bonnie Dorr | Nizar Habash
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

An adaptable statistical or hybrid MT system relies heavily on the quality of word-level alignments of real-world data. Statistical alignment approaches provide a reasonable initial estimate for word alignment. However, they cannot handle certain types of linguistic phenomena such as long-distance dependencies and structural differences between languages. We address this issue in Multi-Align, a new framework for incremental testing of different alignment algorithms and their combinations. Our design allows users to tune their systems to the properties of a particular genre/domain while still benefiting from general linguistic knowledge associated with a language pair. We demonstrate that a combination of statistical and linguistically-informed alignments can resolve translation divergences during the alignment process.

Interlingual annotation for MT development
Florence Reeder | Bonnie Dorr | David Farwell | Nizar Habash | Stephen Helmreich | Eduard Hovy | Lori Levin | Teruko Mitamura | Keith Miller | Owen Rambow | Advaith Siddharthan
Proceedings of the 6th Conference of the Association for Machine Translation in the Americas: Technical Papers

MT systems that use only superficial representations, including the current generation of statistical MT systems, have been successful and useful. However, they will experience a plateau in quality, much like other “silver bullet” approaches to MT. We pursue work on the development of interlingual representations for use in symbolic or hybrid MT systems. In this paper, we describe the creation of an interlingua and the development of a corpus of semantically annotated text, to be validated in six languages and evaluated in several ways. We have established a distributed, well-functioning research methodology, designed a preliminary interlingua notation, created annotation manuals and tools, developed a test collection in six languages with associated English translations, annotated some 150 translations, and designed and applied various annotation metrics. We describe the data sets being annotated and the interlingual (IL) representation language which uses two ontologies and a systematic theta-role list. We present the annotation tools built and outline the annotation process. Following this, we describe our evaluation methodology and conclude with a summary of issues that have arisen.

2003

CatVar: a database of categorial variations for English
Nizar Habash | Bonnie Dorr
Proceedings of Machine Translation Summit IX: System Presentations

We present a new large-scale database called “CatVar” (Habash and Dorr, 2003) which contains categorial variations of English lexemes. Due to the prevalence of cross-language categorial variation in multilingual applications, our categorial-variation resource may serve as an integral part of a diverse range of natural language applications. Thus, the research reported herein overlaps heavily with that of the machine-translation, lexicon-construction, and information-retrieval communities. We demonstrate this database, embedded in a graphical interface; we also show a GUI for user input of corrections to the database.

A Categorial Variation Database for English
Nizar Habash | Bonnie Dorr
Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics

Semitic linguistic phenomena and variations
Nizar Habash
Workshop on Machine Translation for Semitic languages: issues and approaches

Matador: a large-scale Spanish-English GHMT system
Nizar Habash
Proceedings of Machine Translation Summit IX: Papers

This paper describes and evaluates Matador, an implemented large-scale Spanish-English MT system built in the Generation-Heavy Hybrid Machine Translation (GHMT) approach. An extensive evaluation shows that Matador has a higher degree of robustness and superior output quality, in terms of grammaticality and accuracy, when compared to a primarily statistical approach.

Matador: Spanish-English GHMT
Nizar Habash
Proceedings of Machine Translation Summit IX: System Presentations

This paper presents the online demo of Matador, a large-scale Spanish-English machine translation system implemented following the Generation-heavy Hybrid Machine Translation (GHMT) approach.

2002

Handling translation divergences: combining statistical and symbolic techniques in generation-heavy machine translation
Nizar Habash | Bonnie Dorr
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

This paper describes a novel approach to handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system. The translation divergence problem is usually reserved for Transfer and Interlingual MT because it requires a large combination of complex lexical and structural mappings. A major requirement of these approaches is the accessibility of large amounts of explicit symmetric knowledge for both source and target languages. This limitation renders Transfer and Interlingual approaches ineffective in the face of structurally-divergent language pairs with asymmetric resources. GHMT addresses the more common form of this problem, source-poor/targetrich, by fully exploiting symbolic and statistical target-language resources. This non-interlingual non-transfer approach is accomplished by using target-language lexical semantics, categorial variations and subcategorization frames to overgenerate multiple lexico-structural variations from a target-glossed syntactic dependency of the source-language sentence. The symbolic overgeneration, which accounts for different possible translation divergences, is constrained by a statistical target-language model.

DUSTer: a method for unraveling cross-language divergences for statistical word-level alignment
Bonnie Dorr | Lisa Pearl | Rebecca Hwa | Nizar Habash
Proceedings of the 5th Conference of the Association for Machine Translation in the Americas: Technical Papers

The frequent occurrence of divergenceS—structural differences between languages—presents a great challenge for statistical word-level alignment. In this paper, we introduce DUSTer, a method for systematically identifying common divergence types and transforming an English sentence structure to bear a closer resemblance to that of another language. Our ultimate goal is to enable more accurate alignment and projection of dependency trees in another language without requiring any training on dependency-tree data in that language. We present an empirical analysis comparing the complexities of performing word-level alignments with and without divergence handling. Our results suggest that our approach facilitates word-level alignment, particularly for sentence pairs containing divergences.

Generation-Heavy Hybrid Machine Translation
Nizar Habash
Proceedings of the International Natural Language Generation Conference

2001

Large scale language independent generation using thematic hierarchies
Nizar Habash | Bonnie Dorr
Proceedings of Machine Translation Summit VIII

2000

Generation from Lexical Conceptual Structures
David Traum | Nizar Habash
NAACL-ANLP 2000 Workshop: Applied Interlinguas: Practical Applications of Interlingual Approaches to NLP

Oxygen: a language independent linearization engine
Nizar Habash
Proceedings of the Fourth Conference of the Association for Machine Translation in the Americas: Technical Papers

This paper describes a language independent linearization engine, oxyGen. This system compiles target language grammars into programs that take feature graphs as inputs and generate word lattices that can be passed along to the statistical extraction module of the generation system Nitrogen. The grammars are written using a flexible and powerful language, oxyL, that has the power of a programming language but focuses on natural language realization. This engine has been used successfully in creating an English linearization program that is currently employed as part of a Chinese-English machine translation system.

1998

A thematic hierarchy for efficient generation from lexical-conceptual structure
Bonnie Dorr | Nizar Habash | David Traum
Proceedings of the Third Conference of the Association for Machine Translation in the Americas: Technical Papers

This paper describes an implemented algorithm for syntactic realization of a target-language sentence from an interlingual representation called Lexical Conceptual Structure (LCS). We provide a mapping between LCS thematic roles and Abstract Meaning Representation (AMR) relations; these relations serve as input to an off-the-shelf generator (Nitrogen). There are two contributions of this work: (1) the development of a thematic hierarchy that provides ordering information for realization of arguments in their surface positions; (2) the provision of a diagnostic tool for detecting inconsistencies in an existing online LCS-based lexicon that allows us to enhance principles for thematic-role assignment.

Co-authors

Ramy Eskander 19

Ossama Obeid 17

Wajdi Zaghouani 14

Nasser Zalmout 14

Fadhl Eryani 11

Ahmed El Kholy 11

Alexander Erdmann 9

Preslav Nakov 8

Kemal Oflazer 8

Muhammad Abdul-Mageed 7

Muhamed Al-Khalil 7

Alla Rozovskaya 7

Behrang Mohit 6

Dana Abdulrahim 5

Mohamed Al-Badrashiny 5

Wassim El-Hajj 5

Iryna Gurevych 5

Christian Khairallah 5

Artem Shelmanov 5

Slim Abdennadher 4

Muhammed AbuOdeh 4

Osama Mohammed Afzal 4

Alham Fikri Aji 4

Sarah Alkuhlani 4

Alberto Chierici 4

Kirill Chirkunov 4

Kareem Darwish 4

Khalid N. Elmadani 4

Abdelrahim Elmadany 4

Mustafa Jarrar 4

Mohamed Maamouri 4

Tarek Mahmoud 4

Jonibek Mansurov 4

Reham Marzouk 4

Kurt Micallef 4

Mohammad Salameh 4

Hanada Taha-Thomure 4

Ngoc Thang Vu 4

Ahmed Abdelali 3

Ibrahim Abu Farha 3

Mohamed Altantawy 3

Ekaterina Artemova 3

Gilbert Badaro 3

Rahma Boujelbane 3

Ryan Cotterell 3

David Farwell 3

Stephen Helmreich 3

Joseph Le Roux 3

Kathleen McKeown 3

Keith J. Miller 3

Teruko Mitamura 3

Chatrine Qwaider 3

Florence Reeder 3

Khaled Shaban 3

Anas Shahrour 3

Advaith Siddharthan 3

Reut Tsarfaty 3

Francis Tyers 3

Stephan Vogel 3

Marcin Woliński 3

Mervat Abassy 2

Chaimae Abouzahir 2

Faisal Al-Shargi 2

Sakhar Alkhereyf 2

Rawan Almatham 2

Khalid Almubarak 2

Zaid Alyafeai 2

Thomas Arnold 2

Yustinus Ghanggo Ate 2

Mohammed Attia 2

Nurpeiis Baimukan 2

Timothy Baldwin 2

Aziyana Bayyr-ool 2

Jean-Philippe Bernardy 2

Marine Carpuat 2

Eleanor Chodroff 2

Cagri Coltekin 2

Pradeep Dasigi 2

Samhaa R. El-Beltagy 2

Mahmoud El-Haj 2

Charbel El-Khaissi 2

Georges El-Khoury 2

Mariem Ellouze Khemekhem 2

Sofya Ganieva 2

Michael Gasser 2

Lamia Hadrich Belguith 2

Richard J. Hatcher 2

Abdelati Hawwari 2

Julia Hirschberg 2

Sardana Ivanova 2

Zhengyang Jiang 2

Witold Kieraś 2

Elena Klyachko 2

Andrew Krizhanovsky 2

Brian Leonard 2

Gregor Leusch 2

Abir Masmoudi 2

Evgeny Matusov 2

Sabrina J. Mielke 2

Vladislav Mikhailov 2

Rawan Moukalled 2

Garrett Nicolai 2

Zahroh Nuriah 2

Arturo Oncevay 2

David Palfreyman 2

Kristen Parton 2

Tiago Pimentel 2

Matvey Plugaryov 2

Edoardo M. Ponti 2

Manoj Pooleery 2

Emily Prud’hommeaux 2

Giovanni Puccetti 2

Mohammad Sadegh Rasooli 2

Maria Ryskina 2

Mostafa Saeed 2

Aelita Salchak 2

Jaime Rafael Montoya Samame 2

Djamé Seddah 2

Farah E. Shamout 2

Shady Shehata 2

Karina Sheifer 2

Niklas Stoehr 2

Christopher Straughn 2

Totok Suhardijanto 2

Raj Vardhan Tomar 2

Samia Touileb 2

Gema Celeste Silva Villegas 2

Ekaterina Vylomova 2

Jonathan Washington 2

David Yarowsky 2

Basmah Abdulkareem 1

Basma Abdulkareem 1

Mouath Abu-Daoud 1

Abdallah Abushmaes 1

Bimarsha Adhikari 1

Armaan Agrawal 1

Saad El Dine Ahmed 1

Hayat Al Hassan 1

Meera Al Kaabi 1

Ahmad Al Sallab 1

Asma Al Wazrah 1

Mohammad Al-Badrashiny 1

Walid Al-Eisawi 1

Hend Al-Khalifa 1

Rawan Al-Matham 1

Renad Al-Monef 1

Raghad Al-Rasheed 1

Abdulmohsen Al-Thubaity 1

Sarah Al-Towaity 1

Abdulrahman AlOsaimy 1

Linda Alamir-Salloum 1

Ashwag Alasmari 1

Eman Albilali 1

Hector Fernandez Alcalde 1

Hanan Aldarmaki 1

Abdullah Alfaifi 1

Latifa Alfalasi 1

Emad Alghamdi 1

Sultana Alghurabi 1

Mais Alheraki 1

Muneera Alhoshan 1

Hassan Alhuzali 1

Badr Alkhamissi 1

Hasan Alkheder 1

Sarah Alkhulani 1

Amjad Almahairi 1

Amal Almazrua 1

Ali Alqahtani 1

Aisha Alraeesi 1

Sultan Alrowili 1

Saied Alshahrani 1

Waad Thuwaini Alshammari 1

Khawlah M. Alshanqiti 1

Areej Alshaqarawi 1

Faisal Alshargi 1

Hamad Alshehhi 1

Maryam Alshihri 1

Malik H. Altakrori 1

Afrah Altamimi 1

Kinda Altarbouch 1

Nora Altwairesh 1

Norah A. Alzahrani 1

Maverick Alzate 1

Daliyah Alzeer 1

Atikah Alzeghayer 1

Maryam Aminian 1

Antonios Anastasopoulos 1

Jacob Andreas 1

Taras Andrushko 1

Wissam Antoun 1

Mohamed Anwar 1

Hiroyuki Aoyama 1

Aryaman Arora 1

Necip Fazil Ayan 1

Alexander Aziz 1

Elena Badmaeva 1

Rim El Ballouli 1

Esha Banerjee 1

Khuyagbaatar Batsuren 1

Riadh Belkebir 1

Brijesh Bhatt 1

Margarita Bicec 1

Fethi Bougares 1

Halim-Antoine Boukaram 1

Tim Buckwalter 1

Elena Budianskaya 1

Aljoscha Burchardt 1

Randy Calistri-Yeh 1

Delio Siticonatzi Camaiteri 1

Marie Candito 1

Violetta Cavalli-Sforza 1

Kai-Wei Chang 1

Yuanzhu Peter Chen 1

Jinho D. Choi 1

Shammur Absar Chowdhury 1

Christine Chung 1

Silvie Cinková 1

Josep M. Crego 1

Paula Czarnowska 1

Rocktim Jyoti Das 1

Shahd Salah Uddin Dibas 1

Malika Dikshit 1

Amirbek Djanibekov 1

Hossep Dolatian 1

Kira Droganova 1

Shady ELbassouni 1

Moussa Kamal Eddine 1

Mai Mohamed Eida 1

Saad El Dine Ahmed El Etter 1

Jamila El Gizuli 1

Abdelrahman El-Sheikh 1

Muhammad N. ElNokrashy 1

Ahmed Elbakry 1

Salman Elgamal 1

Khalid Elmadani 1

AbdelRahim A. Elmadany 1

Muhammad Elmallah 1

Kareem Elozeiri 1

Kareem Ashraf Elozeiri 1

Tamer Elsayed 1

Ahmed Elshabrawy 1

Yannick Estève 1

Benjamin Farber 1

Richárd Farkas 1

Ahmed Farouk Zakaria Elshabrawy 1

Jennifer Foster 1

Dayne Freitag 1

Mahmoud Ghoneim 1

Fausto Giunchiglia 1

Shantanu Godbole 1

Iakes Goenaga 1

Koldo Gojenola 1

Yoav Goldberg 1

Maiya Goloburda 1

Rebecca Green 1

Stephen Grimes 1

Francisco Guzmán 1

Memduh Gökırmak 1

Amal Haddad Haddad 1

Jan Hajič jr. 1

Tymaa Hammouda 1

Fatima Haouari 1

Maram Hasanain 1

Tyeece Kiana Fredorcia Hensley 1

Jaroslava Hlaváčová 1

Gonzalo Iglesias 1

Abderrahmane Issam 1

Serena Jeblee 1

Jermsak Jermsurawong 1

Abdurrahman Juma 1

Hiroshi Kanayama 1

Masahiro Kaneko 1

Jenna Kanerva 1

Yash Kankanampati 1

Ritván Karahóǧa 1

Tolga Kayadelen 1

Václava Kettnerová 1

Muhamed Khalil 1

Nouran Khallaf 1

Shehroze Khan 1

George Anton Kiraz 1

Katrin Kirchhoff 1

Jesse Kirchner 1

Natalia Krizhanovskaya 1

Natalia Krizhanovsky 1

Sondos Krouna 1

Marco Kuhlmann 1

Mucahid Kutlu 1

Sookyoung Kwak 1

Sandra Kübler 1

Nurkhan Laiyk 1

Dorina Lakatos 1

Tatiana Lando 1

William Abbott Lane 1

Saran Lertpradit 1

Juan Liberato 1

Tom Lippincott 1

Juhani Luotolahti 1

Juan López Bautista 1

Didier López Francis 1

Vivien Macketanz 1

Samar Mohamed Magdy 1

Wolfgang Maier 1

Sanad Malaysha 1

Michael Mandel 1

Christopher D. Manning 1

Camille Mansour 1

Ruli Manurung 1

Igor Marchenko 1

Katrin Marheinecke 1

Stella Markantonatou 1

Héctor Martínez Alonso 1

Polina Mashkovtseva 1

Yuji Matsumoto 1

Rowan Hall Maudslay 1

Arya D. McCarthy 1

Gustavo Mendonca 1

Hayden Metsky 1

George Mikros 1

Anna Missilä 1

Christof Monz 1

Juan Moreno Gonzalez 1

Hamdy Mubarak 1

Zain Muhammad Mujahid 1

El-Moatez-Billah Nagoudi 1

Fatema Nassar 1

Anna Nedoluzhko 1

Maria Nepomniashchaya 1

Irene Nikkarinen 1

Rattima Nitisaroj 1

Yaser Onaizan 1

Chukwuyem Onyibe 1

Rebecca J. Passonneau 1

George Pavlidis 1

Juan David Pineros Liberato 1

Martin Potthast 1

Adam Przepiórkowski 1

Goffredo Puccetti 1

Sampo Pyysalo 1

Abdelrahim Qaddoumi 1

Abed Qaddoumi 1

Daria Rodionova 1

Esaú Zumaeta Rojas 1

Cynthia Rudin 1

Houda Saadane 1

Caroline Sabty 1

Abdelrahman Sadallah 1

Omar Fayez Sadi 1

Benoît Sagot 1

Tariq Sairafy 1

Ibrahim M. Saleh 1

Ibrahim Saleh 1

Elizabeth Salesky 1

Manuela Sanguinetti 1

Karmel Sarabta 1

Andrey Scherbakov 1

Sebastian Schuster 1

Wolfgang Seeker 1

Neha Sengupta 1

Alexandra Serova 1

Latifa Shamsan 1

Safiullah Shareef 1

Sara Shatnawi 1

Sanad Sha’ban 1

Andrey Shcherbakov 1

Páraic Sheridan 1

Reshef Shilon 1

Atsuko Shimada 1

Miikka Silfverberg 1

Thamar Solorio 1

Elvis de Souza 1

Antonio Stella 1

Stephanie Strassel 1

Jana Strnadová 1

Peter Sullivan 1

Umut Sulubacak 1

Gábor Szolnok 1

Dalila Tabessi 1

Bashar Talafha 1

Irina Temnikova 1

Lucas Torroba Hennigen 1

Godfried Toussaint 1

Hawau Olamide Toyin 1

Zdenka Uresova 1

Hans Uszkoreit 1

Josef Valvoda 1

Michalis Vazirgiannis 1

Yannick Versley 1

K. Vijay-Shanker 1

Éric Villemonte de la Clergerie 1

Veronika Vincze 1

Daniel Watson 1

Jennifer White 1

Chenxi Whitehouse 1

Shuly Wintner 1

Jonathan Wright 1

Alina Wróblewska 1

Anna Yablonskaya 1

Anastasia Yemelina 1

Jeremiah Young 1

Ayman Al Zaatari 1

Roberto Zariquiey 1

Lingliang Zhang 1

Micheline Ziadee 1

Tarik Zulfikarpasic 1

Mahmoud Zyate 1

Adrià de Gispert 1

Marie-Catherine de Marneffe 1

Valeria de Paiva 1

Özlem Çetinoğlu 1

Venues

JEP/TALN/RECITAL3