Salam Khalifa


2022

pdf bib
Morphosyntactic Tagging with Pre-trained Language Models for Arabic and its Dialects
Go Inoue | Salam Khalifa | Nizar Habash
Findings of the Association for Computational Linguistics: ACL 2022

We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.

pdf bib
UniMorph 4.0: Universal Morphology
Khuyagbaatar Batsuren | Omer Goldman | Salam Khalifa | Nizar Habash | Witold Kieraś | Gábor Bella | Brian Leonard | Garrett Nicolai | Kyle Gorman | Yustinus Ghanggo Ate | Maria Ryskina | Sabrina Mielke | Elena Budianskaya | Charbel El-Khaissi | Tiago Pimentel | Michael Gasser | William Abbott Lane | Mohit Raj | Matt Coler | Jaime Rafael Montoya Samame | Delio Siticonatzi Camaiteri | Esaú Zumaeta Rojas | Didier López Francis | Arturo Oncevay | Juan López Bautista | Gema Celeste Silva Villegas | Lucas Torroba Hennigen | Adam Ek | David Guriel | Peter Dirix | Jean-Philippe Bernardy | Andrey Scherbakov | Aziyana Bayyr-ool | Antonios Anastasopoulos | Roberto Zariquiey | Karina Sheifer | Sofya Ganieva | Hilaria Cruz | Ritván Karahóǧa | Stella Markantonatou | George Pavlidis | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Candy Angulo | Jatayu Baxi | Andrew Krizhanovsky | Natalia Krizhanovskaya | Elizabeth Salesky | Clara Vania | Sardana Ivanova | Jennifer White | Rowan Hall Maudslay | Josef Valvoda | Ran Zmigrod | Paula Czarnowska | Irene Nikkarinen | Aelita Salchak | Brijesh Bhatt | Christopher Straughn | Zoey Liu | Jonathan North Washington | Yuval Pinter | Duygu Ataman | Marcin Wolinski | Totok Suhardijanto | Anna Yablonskaya | Niklas Stoehr | Hossep Dolatian | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Aryaman Arora | Richard J. Hatcher | Ritesh Kumar | Jeremiah Young | Daria Rodionova | Anastasia Yemelina | Taras Andrushko | Igor Marchenko | Polina Mashkovtseva | Alexandra Serova | Emily Prud’hommeaux | Maria Nepomniashchaya | Fausto Giunchiglia | Eleanor Chodroff | Mans Hulden | Miikka Silfverberg | Arya D. McCarthy | David Yarowsky | Ryan Cotterell | Reut Tsarfaty | Ekaterina Vylomova
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive.In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

pdf bib
The Bahrain Corpus: A Multi-genre Corpus of Bahraini Arabic
Dana Abdulrahim | Go Inoue | Latifa Shamsan | Salam Khalifa | Nizar Habash
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In recent years, the focus on developing natural language processing (NLP) tools for Arabic has shifted from Modern Standard Arabic to various Arabic dialects. Various corpora of various sizes and representing different genres, have been created for a number of Arabic dialects. As far as Gulf Arabic is concerned, Gumar Corpus (Khalifa et al., 2016) is the largest corpus, to date, that includes data representing the dialectal Arabic of the six Gulf Cooperation Council countries (Bahrain, Kuwait, Saudi Arabia, Qatar, United Arab Emirates, and Oman), particularly in the genre of “online forum novels”. In this paper, we present the Bahrain Corpus. Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.). The corpus comprises 620K words, carefully curated. We provide automatic morphological annotations of the full corpus using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We plan to make the annotated sample as well as the full corpus publicly available to support researchers interested in Arabic NLP.

pdf bib
Morphotactic Modeling in an Open-source Multi-dialectal Arabic Morphological Analyzer and Generator
Nizar Habash | Reham Marzouk | Christian Khairallah | Salam Khalifa
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

Arabic is a morphologically rich and complex language, with numerous dialectal variants. Previous efforts on Arabic morphology modeling focused on specific variants and specific domains using a range of techniques with different degrees of linguistic modeling transparency. In this paper we propose a new approach to modeling Arabic morphology with an eye towards multi-dialectness, resource openness, and easy extensibility and use. We demonstrate our approach by modeling verbs from Standard Arabic and Egyptian Arabic, within a common framework, and with high coverage.

pdf bib
SIGMORPHONUniMorph 2022 Shared Task 0: Modeling Inflection in Language Acquisition
Jordan Kodner | Salam Khalifa
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This year’s iteration of the SIGMORPHONUniMorph shared task on “human-like” morphological inflection generation focuses on generalization and errors in language acquisition. Systems are trained on data sets extracted from corpora of child-directed speech in order to simulate a natural learning setting, and their predictions are evaluated against what is known about children’s developmental trajectories for three well-studied patterns: English past tense, German noun plurals, and Arabic noun plurals. Three submitted neural systems were evaluated together with two baselines. Performance was generally good, and all systems were prone to human-like over-regularization. However, all systems were also prone to non-human-like over-irregularization and nonsense productions to varying degrees. We situate this behavior in a discussion of the Past Tense Debate.

pdf bib
SIGMORPHONUniMorph 2022 Shared Task 0: Generalization and Typologically Diverse Morphological Inflection
Jordan Kodner | Salam Khalifa | Khuyagbaatar Batsuren | Hossep Dolatian | Ryan Cotterell | Faruk Akkus | Antonios Anastasopoulos | Taras Andrushko | Aryaman Arora | Nona Atanalov | Gábor Bella | Elena Budianskaya | Yustinus Ghanggo Ate | Omer Goldman | David Guriel | Simon Guriel | Silvia Guriel-Agiashvili | Witold Kieraś | Andrew Krizhanovsky | Natalia Krizhanovsky | Igor Marchenko | Magdalena Markowska | Polina Mashkovtseva | Maria Nepomniashchaya | Daria Rodionova | Karina Scheifer | Alexandra Sorova | Anastasia Yemelina | Jeremiah Young | Ekaterina Vylomova
Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

The 2022 SIGMORPHON–UniMorph shared task on large scale morphological inflection generation included a wide range of typologically diverse languages: 33 languages from 11 top-level language families: Arabic (Modern Standard), Assamese, Braj, Chukchi, Eastern Armenian, Evenki, Georgian, Gothic, Gujarati, Hebrew, Hungarian, Itelmen, Karelian, Kazakh, Ket, Khalkha Mongolian, Kholosi, Korean, Lamahalot, Low German, Ludic, Magahi, Middle Low German, Old English, Old High German, Old Norse, Polish, Pomak, Slovak, Turkish, Upper Sorbian, Veps, and Xibe. We emphasize generalization along different dimensions this year by evaluating test items with unseen lemmas and unseen features separately under small and large training conditions. Across the five submitted systems and two baselines, the prediction of inflections with unseen features proved challenging, with average performance decreased substantially from last year. This was true even for languages for which the forms were in principle predictable, which suggests that further work is needed in designing systems that capture the various types of generalization required for the world’s languages.

2021

pdf bib
SIGMORPHON 2021 Shared Task on Morphological Reinflection: Generalization Across Languages
Tiago Pimentel | Maria Ryskina | Sabrina J. Mielke | Shijie Wu | Eleanor Chodroff | Brian Leonard | Garrett Nicolai | Yustinus Ghanggo Ate | Salam Khalifa | Nizar Habash | Charbel El-Khaissi | Omer Goldman | Michael Gasser | William Lane | Matt Coler | Arturo Oncevay | Jaime Rafael Montoya Samame | Gema Celeste Silva Villegas | Adam Ek | Jean-Philippe Bernardy | Andrey Shcherbakov | Aziyana Bayyr-ool | Karina Sheifer | Sofya Ganieva | Matvey Plugaryov | Elena Klyachko | Ali Salehi | Andrew Krizhanovsky | Natalia Krizhanovsky | Clara Vania | Sardana Ivanova | Aelita Salchak | Christopher Straughn | Zoey Liu | Jonathan North Washington | Duygu Ataman | Witold Kieraś | Marcin Woliński | Totok Suhardijanto | Niklas Stoehr | Zahroh Nuriah | Shyam Ratan | Francis M. Tyers | Edoardo M. Ponti | Grant Aiton | Richard J. Hatcher | Emily Prud'hommeaux | Ritesh Kumar | Mans Hulden | Botond Barta | Dorina Lakatos | Gábor Szolnok | Judit Ács | Mohit Raj | David Yarowsky | Ryan Cotterell | Ben Ambridge | Ekaterina Vylomova
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.

2020

pdf bib
Morphological Analysis and Disambiguation for Gulf Arabic: The Interplay between Resources and Methods
Salam Khalifa | Nasser Zalmout | Nizar Habash
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper we present the first full morphological analysis and disambiguation system for Gulf Arabic. We use an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic. We find that in very low settings, morphological analyzers help boost the performance of the full morphological disambiguation task. However, as the size of resources increase, the value of the morphological analyzers decreases.

pdf bib
A Spelling Correction Corpus for Multiple Arabic Dialects
Fadhl Eryani | Nizar Habash | Houda Bouamor | Salam Khalifa
Proceedings of the Twelfth Language Resources and Evaluation Conference

Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.

pdf bib
CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing
Ossama Obeid | Nasser Zalmout | Salam Khalifa | Dima Taji | Mai Oudah | Bashar Alhafni | Go Inoue | Fadhl Eryani | Alexander Erdmann | Nizar Habash
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.

2019

pdf bib
A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance
Alexander Erdmann | Salam Khalifa | Mai Oudah | Nizar Habash | Houda Bouamor
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.

2018

pdf bib
An Arabic Morphological Analyzer and Generator with Copious Features
Dima Taji | Salam Khalifa | Ossama Obeid | Fadhl Eryani | Nizar Habash
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more. This tool includes a fast engine that can be easily integrated into other systems, as well as an easy-to-use API and a web interface. CALIMA-Star also supports morphological reinflection. We evaluate CALIMA-Star against four commonly used analyzers for Arabic in terms of speed and morphological content.

pdf bib
MADARi: A Web Interface for Joint Arabic Morphological Annotation and Spelling Correction
Ossama Obeid | Salam Khalifa | Nizar Habash | Houda Bouamor | Wajdi Zaghouani | Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
The MADAR Arabic Dialect Corpus and Lexicon
Houda Bouamor | Nizar Habash | Mohammad Salameh | Wajdi Zaghouani | Owen Rambow | Dana Abdulrahim | Ossama Obeid | Salam Khalifa | Fadhl Eryani | Alexander Erdmann | Kemal Oflazer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Unified Guidelines and Resources for Arabic Dialect Orthography
Nizar Habash | Fadhl Eryani | Salam Khalifa | Owen Rambow | Dana Abdulrahim | Alexander Erdmann | Reem Faraj | Wajdi Zaghouani | Houda Bouamor | Nasser Zalmout | Sara Hassan | Faisal Al-Shargi | Sakhar Alkhereyf | Basma Abdulkareem | Ramy Eskander | Mohammad Salameh | Hind Saddiki
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
A Morphologically Annotated Corpus of Emirati Arabic
Salam Khalifa | Nizar Habash | Fadhl Eryani | Ossama Obeid | Dana Abdulrahim | Meera Al Kaabi
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
A Morphological Analyzer for Gulf Arabic Verbs
Salam Khalifa | Sara Hassan | Nizar Habash
Proceedings of the Third Arabic Natural Language Processing Workshop

We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively.

2016

pdf bib
YAMAMA: Yet Another Multi-Dialect Arabic Morphological Analyzer
Salam Khalifa | Nasser Zalmout | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we present YAMAMA, a multi-dialect Arabic morphological analyzer and disambiguator. Our system is almost five times faster than the state-of-art MADAMIRA system with a slightly lower quality. In addition to speed, YAMAMA outputs a rich representation which allows for a wider spectrum of use. In this regard, YAMAMA transcends other systems, such as FARASA, which is faster but provides specific outputs catering to specific applications.

pdf bib
CamelParser: A system for Arabic Syntactic Analysis and Morphological Disambiguation
Anas Shahrour | Salam Khalifa | Dima Taji | Nizar Habash
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.

pdf bib
DALILA: The Dialectal Arabic Linguistic Learning Assistant
Salam Khalifa | Houda Bouamor | Nizar Habash
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP). The number and sophistication of tools and datasets in DA are very limited in comparison to Modern Standard Arabic (MSA) and other languages. MSA tools do not effectively model DA which makes the direct use of MSA NLP tools for handling dialects impractical. This is particularly a challenge for the creation of tools to support learning Arabic as a living language on the web, where authentic material can be found in both MSA and DA. In this paper, we present the Dialectal Arabic Linguistic Learning Assistant (DALILA), a Chrome extension that utilizes cutting-edge Arabic dialect NLP research to assist learners and non-native speakers in understanding text written in either MSA or DA. DALILA provides dialectal word analysis and English gloss corresponding to each word.

pdf bib
A Large Scale Corpus of Gulf Arabic
Salam Khalifa | Nizar Habash | Dana Abdulrahim | Sara Hassan
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.

2015

pdf bib
Improving Arabic Diacritization through Syntactic Analysis
Anas Shahrour | Salam Khalifa | Nizar Habash
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

Search
Co-authors