In this paper, we define the task of gender rewriting in contexts involving two users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. We develop a multi-step system that combines the positive aspects of both rule-based and neural rewriting models. Our results successfully demonstrate the viability of this approach on a recently created corpus for Arabic gender rewriting, achieving 88.42 M2 F0.5 on a blind test set. Our proposed system improves over previous work on the first-person-only version of this task, by 3.05 absolute increase in M2 F0.5. We demonstrate a use case of our gender rewriting system by using it to post-edit the output of a commercial MT system to provide personalized outputs based on the users’ grammatical gender preferences. We make our code, data, and pretrained models publicly available.
This paper presents (AraSAS) the first open-source Arabic semantic analysis tagging system. AraSAS is a software framework that provides full semantic tagging of text written in Arabic. AraSAS is based on the UCREL Semantic Analysis System (USAS) which was first developed to semantically tag English text. Similarly to USAS, AraSAS uses a hierarchical semantic tag set that contains 21 major discourse fields and 232 fine-grained semantic field tags. The paper describes the creation, validation and evaluation of AraSAS. In addition, we demonstrate a first case study to illustrate the affordances of applying USAS and AraSAS semantic taggers on the Zayed University Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) (Palfreyman and Habash, 2022), where we show and compare the coverage of the two semantic taggers through running them on Arabic and English essays on different topics. The analysis expands to compare the taggers when run on texts in Arabic and English written by the same writer and texts written by male and by female students. Variables for comparison include frequency of use of particular semantic sub-domains, as well as the diversity of semantic elements within a text.
Arabic is a morphologically rich and complex language, with numerous dialectal variants. Previous efforts on Arabic morphology modeling focused on specific variants and specific domains using a range of techniques with different degrees of linguistic modeling transparency. In this paper we propose a new approach to modeling Arabic morphology with an eye towards multi-dialectness, resource openness, and easy extensibility and use. We demonstrate our approach by modeling verbs from Standard Arabic and Egyptian Arabic, within a common framework, and with high coverage.
We present ZAEBUC, an annotated Arabic-English bilingual writer corpus comprising short essays by first-year university students at Zayed University in the United Arab Emirates. We describe and discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on Arabic and English texts using consistent guidelines as much as possible, with tracked alignments among the different annotations, and to the original raw texts. For morphological tokenization, POS tagging, and lemmatization, we use existing automatic annotation tools followed by manual correction. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The publicly available ZAEBUC corpus and its annotations are intended to be the stepping stones for additional annotations.
The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive.In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.
Gender bias in natural language processing (NLP) applications, particularly machine translation, has been receiving increasing attention. Much of the research on this issue has focused on mitigating gender bias in English NLP models and systems. Addressing the problem in poorly resourced, and/or morphologically rich languages has lagged behind, largely due to the lack of datasets and resources. In this paper, we introduce a new corpus for gender identification and rewriting in contexts involving one or two target users (I and/or You) – first and second grammatical persons with independent grammatical gender preferences. We focus on Arabic, a gender-marking morphologically rich language. The corpus has multiple parallel components: four combinations of 1st and 2nd person in feminine and masculine grammatical genders, as well as English, and English to Arabic machine translation output. This corpus expands on Habash et al. (2019)’s Arabic Parallel Gender Corpus (APGC v1.0) by adding second person targets as well as increasing the total number of sentences over 6.5 times, reaching over 590K words. Our new dataset will aid the research and development of gender identification, controlled text generation, and post-editing rewrite systems that could be used to personalize NLP applications and provide users with the correct outputs based on their grammatical gender preferences. We make the Arabic Parallel Gender Corpus (APGC v2.0) publicly available
In recent years, the focus on developing natural language processing (NLP) tools for Arabic has shifted from Modern Standard Arabic to various Arabic dialects. Various corpora of various sizes and representing different genres, have been created for a number of Arabic dialects. As far as Gulf Arabic is concerned, Gumar Corpus (Khalifa et al., 2016) is the largest corpus, to date, that includes data representing the dialectal Arabic of the six Gulf Cooperation Council countries (Bahrain, Kuwait, Saudi Arabia, Qatar, United Arab Emirates, and Oman), particularly in the genre of “online forum novels”. In this paper, we present the Bahrain Corpus. Our objective is to create a specialized corpus of the Bahraini Arabic dialect, which includes written texts as well as transcripts of audio files, belonging to a different genre (folktales, comedy shows, plays, cooking shows, etc.). The corpus comprises 620K words, carefully curated. We provide automatic morphological annotations of the full corpus using state-of-the-art morphosyntactic disambiguation for Gulf Arabic. We validate the quality of the annotations on a 7.6K word sample. We plan to make the annotated sample as well as the full corpus publicly available to support researchers interested in Arabic NLP.
We present the Camel Treebank (CAMELTB), a 188K word open-source dependency treebank of Modern Standard and Classical Arabic. CAMELTB 1.0 includes 13 sub-corpora comprising selections of texts from pre-Islamic poetry to social media online commentaries, and covering a range of genres from religious and philosophical texts to news, novels, and student essays. The texts are all publicly available (out of copyright, creative commons, or under open licenses). The texts were morphologically tokenized and syntactically parsed automatically, and then manually corrected by a team of trained annotators. The annotations follow the guidelines of the Columbia Arabic Treebank (CATiB) dependency representation. We discuss our annotation process and guideline extensions, and we present some initial observations on lexical and syntactic differences among the annotated sub-corpora. This corpus will be publicly available to support and encourage research on Arabic NLP in general and on new, previously unexplored genres that are of interest to a wider spectrum of researchers, from historical linguistics and digital humanities to computer-assisted language pedagogy.
Arabic is a collection of dialectal variants that are historically related but significantly different. These differences can be seen across regions, countries, and even cities in the same countries. Previous work on Arabic Dialect identification has focused mainly on specific dialect levels (region, country, province, or city) using level-specific resources; and different efforts used different schemas and labels. In this paper, we present the first effort aiming at defining a standard unified three-level hierarchical schema (region-country-city) for dialectal Arabic classification. We map 29 different data sets to this unified schema, and use the common mapping to facilitate aggregating these data sets. We test the value of such aggregation by building language models and using them in dialect identification. We make our label mapping code and aggregated language models publicly available.
Like most natural language understanding and generation tasks, state-of-the-art models for summarization are transformer-based sequence-to-sequence architectures that are pretrained on large corpora. While most existing models focus on English, Arabic remains understudied. In this paper we propose AraBART, the first Arabic model in which the encoder and the decoder are pretrained end-to-end, based on BART. We show that AraBART achieves the best performance on multiple abstractive summarization datasets, outperforming strong baselines including a pretrained Arabic BERT-based model, multilingual BART, Arabic T5, and a multilingual T5 model. AraBART is publicly available.
We describe the findings of the third Nuanced Arabic Dialect Identification Shared Task (NADI 2022). NADI aims at advancing state-of-the-art Arabic NLP, including Arabic dialects. It does so by affording diverse datasets and modeling opportunities in a standardized context where meaningful comparisons between models and approaches are possible. NADI 2022 targeted both dialect identification (Subtask 1) and dialectal sentiment analysis (Subtask 2) at the country level. A total of 41 unique teams registered for the shared task, of whom 21 teams have participated (with 105 valid submissions). Among these, 19 teams participated in Subtask 1, and 10 participated in Subtask 2. The winning team achieved F1=27.06 on Subtask 1 and F1=75.16 on Subtask 2, reflecting that both subtasks remain challenging and motivating future work in this area. We describe the methods employed by the participating teams and offer an outlook for NADI.
In this paper, we present the results and findings of the Shared Task on Gender Rewriting, which was organized as part of the Seventh Arabic Natural Language Processing Workshop. The task of gender rewriting refers to generating alternatives of a given sentence to match different target user gender contexts (e.g., a female speaker with a male listener, a male speaker with a male listener, etc.). This requires changing the grammatical gender (masculine or feminine) of certain words referring to the users. In this task, we focus on Arabic, a gender-marking morphologically rich language. A total of five teams from four countries participated in the shared task.
We present our work on collecting ArzEn-ST, a code-switched Egyptian Arabic-English Speech Translation Corpus. This corpus is an extension of the ArzEn speech corpus, which was collected through informal interviews with bilingual speakers. In this work, we collect translations in both directions, monolingual Egyptian Arabic and monolingual English, forming a three-way speech translation corpus. We make the translation guidelines and corpus publicly available. We also report results for baseline systems for machine translation and speech translation tasks. We believe this is a valuable resource that can motivate and facilitate further research studying the code-switching phenomenon from a linguistic perspective and can be used to train and evaluate NLP systems.
We present Maknuune, a large open lexicon for the Palestinian Arabic dialect. Maknuune has over 36K entries from 17K lemmas, and 3.7K roots. All entries include diacritized Arabic orthography, phonological transcription and English glosses. Some entries are enriched with additional information such as broken plurals and templatic feminine forms, associated phrases and collocations, Standard Arabic glosses, and examples or notes on grammar, usage, or location of collected entry
We present state-of-the-art results on morphosyntactic tagging across different varieties of Arabic using fine-tuned pre-trained transformer language models. Our models consistently outperform existing systems in Modern Standard Arabic and all the Arabic dialects we study, achieving 2.6% absolute improvement over the previous state-of-the-art in Modern Standard Arabic, 2.8% in Gulf, 1.6% in Egyptian, and 8.3% in Levantine. We explore different training setups for fine-tuning pre-trained transformer language models, including training data size, the use of external linguistic resources, and the use of annotated data from other dialects in a low-resource scenario. Our results show that strategic fine-tuning using datasets from other high-resource dialects is beneficial for a low-resource dialect. Additionally, we show that high-quality morphological analyzers as external linguistic resources are beneficial especially in low-resource settings.
The Arabic language continues to be the focus of an increasing number of projects in natural language processing (NLP) and computational linguistics (CL). This tutorial provides NLP/CL system developers and researchers (computer scientists and linguists alike) with the necessary background information for working with Arabic in its various forms: Classical, Modern Standard and Dialectal. We discuss various Arabic linguistic phenomena and review the state-of-the-art in Arabic processing from enabling technologies and resources, to common tasks and applications. The tutorial will explain important concepts, common wisdom, and common pitfalls in Arabic processing. Given the wide range of possible issues, we invite tutorial attendees to bring up interesting challenges and problems they are working on to discuss during the tutorial.
This demo paper presents a Google Docs add-on for automatic Arabic word-level readability visualization. The add-on includes a lemmatization component that is connected to a five-level readability lexicon and Arabic WordNet-based substitution suggestions. The add-on can be used for assessing the reading difficulty of a text and identifying difficult words as part of the task of manual text simplification. We make our add-on and its code publicly available.
We present Camelira, a web-based Arabic multi-dialect morphological disambiguation tool that covers four major variants of Arabic: Modern Standard Arabic, Egyptian, Gulf, and Levantine.Camelira offers a user-friendly web interface that allows researchers and language learners to explore various linguistic information, such as part-of-speech, morphological features, and lemmas.Our system also provides an option to automatically choose an appropriate dialect-specific disambiguator based on the prediction of a dialect identification component. Camelira is publicly accessible at http://camelira.camel-lab.com.
We present ARETA, an automatic error type annotation system for Modern Standard Arabic. We design ARETA to address Arabic’s morphological richness and orthographic ambiguity. We base our error taxonomy on the Arabic Learner Corpus (ALC) Error Tagset with some modifications. ARETA achieves a performance of 85.8% (micro average F1 score) on a manually annotated blind test portion of ALC. We also demonstrate ARETA’s usability by applying it to a number of submissions from the QALB 2014 shared task for Arabic grammatical error correction. The resulting analyses give helpful insights on the strengths and weaknesses of different submissions, which is more useful than the opaque M2 scoring metrics used in the shared task. ARETA employs a large Arabic morphological analyzer, but is completely unsupervised otherwise. We make ARETA publicly available.
In this paper, we explore the effects of language variants, data sizes, and fine-tuning task types in Arabic pre-trained language models. To do so, we build three pre-trained language models across three variants of Arabic: Modern Standard Arabic (MSA), dialectal Arabic, and classical Arabic, in addition to a fourth language model which is pre-trained on a mix of the three. We also examine the importance of pre-training data size by building additional models that are pre-trained on a scaled-down set of the MSA variant. We compare our different models to each other, as well as to eight publicly available models by fine-tuning them on five NLP tasks spanning 12 datasets. Our results suggest that the variant proximity of pre-training data to fine-tuning data is more important than the pre-training data size. We exploit this insight in defining an optimized system selection model for the studied tasks.
International library standards require cataloguers to tediously input Romanization of their catalogue records for the benefit of library users without specific language expertise. In this paper, we present the first reported results on the task of automatic Romanization of undiacritized Arabic bibliographic entries. This complex task requires the modeling of Arabic phonology, morphology, and even semantics. We collected a 2.5M word corpus of parallel Arabic and Romanized bibliographic entries, and benchmarked a number of models that vary in terms of complexity and resource dependence. Our best system reaches 89.3% exact word Romanization on a blind test set. We make our data and code publicly available.
We present the findings and results of theSecond Nuanced Arabic Dialect IdentificationShared Task (NADI 2021). This Shared Taskincludes four subtasks: country-level ModernStandard Arabic (MSA) identification (Subtask1.1), country-level dialect identification (Subtask1.2), province-level MSA identification (Subtask2.1), and province-level sub-dialect identifica-tion (Subtask 2.2). The shared task dataset cov-ers a total of 100 provinces from 21 Arab coun-tries, collected from the Twitter domain. A totalof 53 teams from 23 countries registered to par-ticipate in the tasks, thus reflecting the interestof the community in this area. We received 16submissions for Subtask 1.1 from five teams, 27submissions for Subtask 1.2 from eight teams,12 submissions for Subtask 2.1 from four teams,and 13 Submissions for subtask 2.2 from fourteams.
This year's iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems' predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems' performance on previously unseen lemmas.
Time-offset interaction applications (TOIA) allow simulating conversations with people who have previously recorded relevant video utterances, which are played in response to their interacting user. TOIAs have great potential for preserving cross-generational and cross-cultural histories, online teaching, simulated interviews, etc. Current TOIAs exist in niche contexts involving high production costs. Democratizing TOIA presents different challenges when creating appropriate pre-recordings, designing different user stories, and creating simple online interfaces for experimentation. We open-source TOIA 2.0, a user-centered time-offset interaction application, and make it available for everyone who wants to interact with people’s pre-recordings, or create their pre-recordings.
Dialogue systems like chatbots, and tasks like question-answering (QA) have gained traction in recent years; yet evaluating such systems remains difficult. Reasons include the great variety in contexts and use cases for these systems as well as the high cost of human evaluation. In this paper, we focus on a specific type of dialogue systems: Time-Offset Interaction Applications (TOIAs) are intelligent, conversational software that simulates face-to-face conversations between humans and pre-recorded human avatars. Under the constraint that a TOIA is a single output system interacting with users with different expectations, we identify two challenges: first, how do we define a ‘good’ answer? and second, what’s an appropriate metric to use? We explore both challenges through the creation of a novel dataset that identifies multiple good answers to specific TOIA questions through the help of Amazon Mechanical Turk workers. This ‘view from the crowd’ allows us to study the variations of how TOIA interrogators perceive its answers. Our contributions include the annotated dataset that we make publicly available and the proposal of Success Rate @k as an evaluation metric that is more appropriate than the traditional QA’s and information retrieval’s metrics.
We present PALMYRA 2.0, a graphical dependency-tree visualization and editing software. PALMYRA 2.0 is designed to be highly configurable to any dependency parsing representation, and to enable the annotation of a multitude of linguistic features. It uses an intuitive interface that relies on drag-and-drop utilities as well as pop-up menus and keyboard shortcuts that can be easily specified.
In this paper, we present an approach for sentence-level gender reinflection using linguistically enhanced sequence-to-sequence models. Our system takes an Arabic sentence and a given target gender as input and generates a gender-reinflected sentence based on the target gender. We formulate the problem as a user-aware grammatical error correction task and build an encoder-decoder architecture to jointly model reinflection for both masculine and feminine grammatical genders. We also show that adding linguistic features to our model leads to better reinflection results. The results on a blind test set using our best system show improvements over previous work, with a 3.6% absolute increase in M2 F0.5.
We present the results and findings of the First Nuanced Arabic Dialect Identification Shared Task (NADI). This Shared Task includes two subtasks: country-level dialect identification (Subtask 1) and province-level sub-dialect identification (Subtask 2). The data for the shared task covers a total of 100 provinces from 21 Arab countries and is collected from the Twitter domain. As such, NADI is the first shared task to target naturally-occurring fine-grained dialectal text at the sub-country level. A total of 61 teams from 25 countries registered to participate in the tasks, thus reflecting the interest of the community in this area. We received 47 submissions for Subtask 1 from 18 teams and 9 submissions for Subtask 2 from 9 teams.
While online Arabic is primarily written using the Arabic script, a Roman-script variety called Arabizi is often seen on social media. Although this representation captures the phonology of the language, it is not a one-to-one mapping with the Arabic script version. This issue is exacerbated by the fact that Arabizi on social media is Dialectal Arabic which does not have a standard orthography. Furthermore, Arabizi tends to include a lot of code mixing between Arabic and English (or French). To map Arabizi text to Arabic script in the context of complete utterances, previously published efforts have split Arabizi detection and Arabic script target in two separate tasks. In this paper, we present the first effort on a unified model for Arabizi detection and transliteration into a code-mixed output with consistent Arabic spelling conventions, using a sequence-to-sequence deep learning model. Our best system achieves 80.6% word accuracy and 58.7% BLEU on a blind test set.
This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.
The written forms of Semitic languages are both highly ambiguous and morphologically rich: a word can have multiple interpretations and is one of many inflected forms of the same concept or lemma. This is further exacerbated for dialectal content, which is more prone to noise and lacks a standard orthography. The morphological features can be lexicalized, like lemmas and diacritized forms, or non-lexicalized, like gender, number, and part-of-speech tags, among others. Joint modeling of the lexicalized and non-lexicalized features can identify more intricate morphological patterns, which provide better context modeling, and further disambiguate ambiguous lexical choices. However, the different modeling granularity can make joint modeling more difficult. Our approach models the different features jointly, whether lexicalized (on the character-level), or non-lexicalized (on the word-level). We use Arabic as a test case, and achieve state-of-the-art results for Modern Standard Arabic with 20% relative error reduction, and Egyptian Arabic with 11% relative error reduction.
Time-Offset Interaction Applications (TOIAs) are systems that simulate face-to-face conversations between humans and digital human avatars recorded in the past. Developing a well-functioning TOIA involves several research areas: artificial intelligence, human-computer interaction, natural language processing, question answering, and dialogue systems. The first challenges are to define a sensible methodology for data collection and to create useful data sets for training the system to retrieve the best answer to a user’s question. In this paper, we present three main contributions: a methodology for creating the knowledge base for a TOIA, a dialogue corpus, and baselines for single-turn answer retrieval. We develop the methodology using a two-step strategy. First, we let the avatar maker list pairs by intuition, guessing what possible questions a user may ask to the avatar. Second, we record actual dialogues between random individuals and the avatar-maker. We make the Margarita Dialogue Corpus available to the research community. This corpus comprises the knowledge base in text format, the video clips for each answer, and the annotated dialogues.
We present a large-scale 26,000-lemma leveled readability lexicon for Modern Standard Arabic. The lexicon was manually annotated in triplicate by language professionals from three regions in the Arab world. The annotations show a high degree of agreement; and major differences were limited to regional variations. Comparing lemma readability levels with their frequencies provided good insights in the benefits and pitfalls of frequency-based readability approaches. The lexicon will be publicly available.
In this paper we present the first full morphological analysis and disambiguation system for Gulf Arabic. We use an existing state-of-the-art morphological disambiguation system to investigate the effects of different data sizes and different combinations of morphological analyzers for Modern Standard Arabic, Egyptian Arabic, and Gulf Arabic. We find that in very low settings, morphological analyzers help boost the performance of the full morphological disambiguation task. However, as the size of resources increase, the value of the morphological analyzers decreases.
Arabic dialects are the non-standard varieties of Arabic commonly spoken – and increasingly written on social media – across the Arab world. Arabic dialects do not have standard orthographies, a challenge for natural language processing applications. In this paper, we present the MADAR CODA Corpus, a collection of 10,000 sentences from five Arabic city dialects (Beirut, Cairo, Doha, Rabat, and Tunis) represented in the Conventional Orthography for Dialectal Arabic (CODA) in parallel with their raw original form. The sentences come from the Multi-Arabic Dialect Applications and Resources (MADAR) Project and are in parallel across the cities (2,000 sentences from each city). This publicly available resource is intended to support research on spelling correction and text normalization for Arabic dialects. We present results on a bootstrapping technique we use to speed up the CODA annotation, as well as on the degree of similarity across the dialects before and after CODA annotation.
We present CAMeL Tools, a collection of open-source tools for Arabic natural language processing in Python. CAMeL Tools currently provides utilities for pre-processing, morphological modeling, Dialect Identification, Named Entity Recognition and Sentiment Analysis. In this paper, we describe the design of CAMeL Tools and the functionalities it provides.
In this paper we present a parsing model for projective dependency trees which takes advantage of the existence of complementary dependency annotations which is the case in Arabic, with the availability of CATiB and UD treebanks. Our system performs syntactic parsing according to both annotation types jointly as a sequence of arc-creating operations, and partially created trees for one annotation are also available to the other as features for the score function. This method gives error reduction of 9.9% on CATiB and 6.1% on UD compared to a strong baseline, and ablation tests show that the main contribution of this reduction is given by sharing tree representation between tasks, and not simply sharing BiLSTM layers as is often performed in NLP multitask systems.
In this paper we present a character-level sequence-to-sequence lemmatization model, utilizing several subword features in multiple configurations. In addition to generic n-gram embeddings (using FastText), we experiment with concatenative (stems) and templatic (roots and patterns) morphological subwords. We present several architectures that embed these features directly at the encoder side, or learn them jointly at the decoder side with a multitask learning architecture. The results indicate that using the generic n-gram embeddings (through FastText) outperform the other linguistically-driven subwords. We use Modern Standard Arabic and Egyptian Arabic as test cases, with up to 22% and 13% relative error reduction, respectively, from a strong baseline. An error analysis shows that our best system is even able to handle word/lemma pairs that are both unseen in the training data.
This demo paper introduces the online Readability Leveled Arabic Thesaurus interface. For a given user input word, this interface provides the word’s possible lemmas, roots, English glosses, related Arabic words and phrases, and readability on a five-level readability scale. This interface builds on and connects multiple existing Arabic resources and processing tools. This one-of-a-kind system enables Arabic speakers and learners to benefit from advances in Arabic computational linguistics technologies. Feedback from users of the system will help the developers to identify lexical coverage gaps and errors. A live link to the demo is available at: http://samer.camel-lab.com/.
This demo paper describes ADIDA, a web-based system for automatic dialect identification for Arabic text. The system distinguishes among the dialects of 25 Arab cities (from Rabat to Muscat) in addition to Modern Standard Arabic. The results are presented with either a point map or a heat map visualizing the automatic identification probabilities over a geographical map of the Arab World.
The impressive progress in many Natural Language Processing (NLP) applications has increased the awareness of some of the biases these NLP systems have with regards to gender identities. In this paper, we propose an approach to extend biased single-output gender-blind NLP systems with gender-specific alternative reinflections. We focus on Arabic, a gender-marking morphologically rich language, in the context of machine translation (MT) from English, and for first-person-singular constructions only. Our contributions are the development of a system-independent gender-awareness wrapper, and the building of a corpus for training and evaluating first-person-singular gender identification and reinflection in Arabic. Our results successfully demonstrate the viability of this approach with 8% relative increase in Bleu score for first-person-singular feminine, and 5.3% comparable increase for first-person-singular masculine on top of a state-of-the-art gender-blind MT system on a held-out test set.
We present de-lexical segmentation, a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language specific input. Our technique involves creating a small grammar of closed-class affixes which can be written in a few hours. The grammar over generates analyses for word forms attested in a raw corpus which are disambiguated based on features of the linguistic base proposed for each form. Extending the grammar to cover orthographic, morpho-syntactic or lexical variation is simple, making it an ideal solution for challenging corpora with noisy, dialect-inconsistent, or otherwise non-standard content. In two evaluations, we consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art supervised models trained on large amounts of data, providing evidence for the value of linguistic input during preprocessing.
We present a collection of morphologically annotated corpora for seven Arabic dialects: Taizi Yemeni, Sanaani Yemeni, Najdi, Jordanian, Syrian, Iraqi and Moroccan Arabic. The corpora collectively cover over 200,000 words, and are all manually annotated in a common set of standards for orthography, diacritized lemmas, tokenization, morphological units and English glosses. These corpora will be publicly available to serve as benchmarks for training and evaluating systems for Arabic dialect morphological analysis and disambiguation.
In this paper, we present the results and findings of the MADAR Shared Task on Arabic Fine-Grained Dialect Identification. This shared task was organized as part of The Fourth Arabic Natural Language Processing Workshop, collocated with ACL 2019. The shared task includes two subtasks: the MADAR Travel Domain Dialect Identification subtask (Subtask 1) and the MADAR Twitter User Dialect Identification subtask (Subtask 2). This shared task is the first to target a large set of dialect labels at the city and country levels. The data for the shared task was created or collected under the Multi-Arabic Dialect Applications and Resources (MADAR) project. A total of 21 teams from 15 countries participated in the shared task.
Morphological tagging is challenging for morphologically rich languages due to the large target space and the need for more training data to minimize model sparsity. Dialectal variants of morphologically rich languages suffer more as they tend to be more noisy and have less resources. In this paper we explore the use of multitask learning and adversarial training to address morphological richness and dialectal variations in the context of full morphological tagging. We use multitask learning for joint morphological modeling for the features within two dialects, and as a knowledge-transfer scheme for cross-dialectal modeling. We use adversarial training to learn dialect invariant features that can help the knowledge-transfer scheme from the high to low-resource variants. We work with two dialectal variants: Modern Standard Arabic (high-resource “dialect’”) and Egyptian Arabic (low-resource dialect) as a case study. Our models achieve state-of-the-art results for both. Furthermore, adversarial training provides more significant improvement when using smaller training datasets in particular.
Hypernymy modeling has largely been separated according to two paradigms, pattern-based methods and distributional methods. However, recent works utilizing a mix of these strategies have yielded state-of-the-art results. This paper evaluates the contribution of both paradigms to hybrid success by evaluating the benefits of hybrid treatment of baseline models from each paradigm. Even with a simple methodology for each individual system, utilizing a hybrid approach establishes new state-of-the-art results on two domain-specific English hypernym discovery tasks and outperforms all non-hybrid approaches in a general English hypernym discovery task.
Previous work on the problem of Arabic Dialect Identification typically targeted coarse-grained five dialect classes plus Standard Arabic (6-way classification). This paper presents the first results on a fine-grained dialect classification task covering 25 specific cities from across the Arab World, in addition to Standard Arabic – a very challenging task. We build several classification systems and explore a large space of features. Our results show that we can identify the exact city of a speaker at an accuracy of 67.9% for sentences with an average length of 7 words (a 9% relative error reduction over the state-of-the-art technique for Arabic dialect identification) and reach more than 90% when we consider 16 words. We also report on additional insights from a data analysis of similarity and difference across Arabic dialects.
We present Qutr (Query Translator), a smart cross-lingual communication application for the travel domain. Qutr is a real-time messaging app that automatically translates conversations while supporting keyword-to-sentence matching. Qutr relies on querying a database that holds commonly used pre-translated travel-domain phrases and phrase templates in different languages with the use of keywords. The query matching supports paraphrases, incomplete keywords and some input spelling errors. The application addresses common cross-lingual communication issues such as translation accuracy, speed, privacy, and personalization.
Text normalization is an important enabling technology for several NLP tasks. Recently, neural-network-based approaches have outperformed well-established models in this task. However, in languages other than English, there has been little exploration in this direction. Both the scarcity of annotated data and the complexity of the language increase the difficulty of the problem. To address these challenges, we use a sequence-to-sequence model with character-based attention, which in addition to its self-learned character embeddings, uses word embeddings pre-trained with an approach that also models subword information. This provides the neural model with access to more linguistic information especially suitable for text normalization, without large parallel corpora. We show that providing the model with word-level features bridges the gap for the neural network approach to achieve a state-of-the-art F1 score on a standard Arabic language correction shared task dataset.
User-generated text tends to be noisy with many lexical and orthographic inconsistencies, making natural language processing (NLP) tasks more challenging. The challenging nature of noisy text processing is exacerbated for dialectal content, where in addition to spelling and lexical differences, dialectal text is characterized with morpho-syntactic and phonetic variations. These issues increase sparsity in NLP models and reduce accuracy. We present a neural morphological tagging and disambiguation model for Egyptian Arabic, with various extensions to handle noisy and inconsistent content. Our models achieve about 5% relative error reduction (1.1% absolute improvement) for full morphological analysis, and around 22% relative error reduction (1.8% absolute improvement) for part-of-speech tagging, over a state-of-the-art baseline.
Advances in automatic readability assessment can impact the way people consume information in a number of domains. Arabic, being a low-resource and morphologically complex language, presents numerous challenges to the task of automatic readability assessment. In this paper, we present the largest and most in-depth computational readability study for Arabic to date. We study a large set of features with varying depths, from shallow words to syntactic trees, for both L1 and L2 readability tasks. Our best L1 readability accuracy result is 94.8% (75% error reduction from a commonly used baseline). The comparable results for L2 are 72.4% (45% error reduction). We also demonstrate the added value of leveraging L1 features for L2 readability prediction.
Automatic syntactic parsing for question constructions is a challenging task due to the paucity of training examples in most treebanks. The near absence of question constructions is due to the dominance of the news domain in treebanking efforts. In this paper, we compare two synthetic low-cost question treebank creation methods with a conventional manual high-cost annotation method in the context of three domains (news questions, political talk shows, and chatbots) for Modern Standard Arabic, a language with relatively low resources and rich morphology. Our results show that synthetic methods can be effective at significantly reducing parsing errors for a target domain without having to invest large resources on manual annotation; and the combination of manual and synthetic methods is our best domain-independent performer.
This demonstration paper presents a bilingual (Arabic-English) interactive human avatar dialogue system. The system is named TOIA (time-offset interaction application), as it simulates face-to-face conversations between humans using digital human avatars recorded in the past. TOIA is a conversational agent, similar to a chat bot, except that it is based on an actual human being and can be used to preserve and tell stories. The system is designed to allow anybody, simply using a laptop, to create an avatar of themselves, thus facilitating cross-cultural and cross-generational sharing of narratives to wider audiences. The system currently supports monolingual and cross-lingual dialogues in Arabic and English, but can be extended to other languages.
Morphologically rich languages are challenging for natural language processing tasks due to data sparsity. This can be addressed either by introducing out-of-context morphological knowledge, or by developing machine learning architectures that specifically target data sparsity and/or morphological information. We find these approaches to complement each other in a morphological paradigm modeling task in Modern Standard Arabic, which, in addition to being morphologically complex, features ubiquitous ambiguity, exacerbating sparsity with noise. Given a small number of out-of-context rules describing closed class morphology, we combine them with word embeddings leveraging subword strings and noise reduction techniques. The combination outperforms both approaches individually by about 20% absolute. While morphological resources already exist for Modern Standard Arabic, our results inform how comparable resources might be constructed for non-standard dialects or any morphologically rich, low resourced language, given scarcity of time and funding.
We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more. This tool includes a fast engine that can be easily integrated into other systems, as well as an easy-to-use API and a web interface. CALIMA-Star also supports morphological reinflection. We evaluate CALIMA-Star against four commonly used analyzers for Arabic in terms of speed and morphological content.
Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53%, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.
We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively.
Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs to express opinion. In this paper, we perform an analytical study to observe how such linguistic phenomena vary across different Arab regions. This study of Arabic Twitter characterization aims at providing better understanding of Arabic Tweets, and fostering advanced research on the topic. Furthermore, we explore the performance of the two schools of machine learning on Arabic Twitter, namely the feature engineering approach and the deep learning approach. We consider models that have achieved state-of-the-art performance for opinion mining in English. Results highlight the advantages of using deep learning-based models, and confirm the importance of using morphological abstractions to address Arabic’s complex morphology.
We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate’s “did you mean” feature, as well as the Yamli smart Arabic keyboard.
We describe the process of creating NUDAR, a Universal Dependency treebank for Arabic. We present the conversion from the Penn Arabic Treebank to the Universal Dependency syntactic representation through an intermediate dependency representation. We discuss the challenges faced in the conversion of the trees, the decisions we made to solve them, and the validation of our conversion. We also present initial parsing results on NUDAR.
This paper presents a model for Arabic morphological disambiguation based on Recurrent Neural Networks (RNN). We train Long Short-Term Memory (LSTM) cells in several configurations and embedding levels to model the various morphological features. Our experiments show that these models outperform state-of-the-art systems without explicit use of feature engineering. However, adding learning features from a morphological analyzer to model the space of possible analyses provides additional improvement. We make use of the resulting morphological models for scoring and ranking the analyses of the morphological analyzer for morphological disambiguation. The results show significant gains in accuracy across several evaluation metrics. Our system results in 4.4% absolute increase over the state-of-the-art in full morphological analysis accuracy (30.6% relative error reduction), and 10.6% (31.5% relative error reduction) for out-of-vocabulary words.
The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.
While sentiment analysis in English has achieved significant progress, it remains a challenging task in Arabic given the rich morphology of the language. It becomes more challenging when applied to Twitter data that comes with additional sources of noise including dialects, misspellings, grammatical mistakes, code switching and the use of non-textual objects to express sentiments. This paper describes the “OMAM” systems that we developed as part of SemEval-2017 task 4. We evaluate English state-of-the-art methods on Arabic tweets for subtask A. As for the remaining subtasks, we introduce a topic-based approach that accounts for topic specificities by predicting topics or domains of upcoming tweets, and then using this information to predict their sentiment. Results indicate that applying the English state-of-the-art method to Arabic has achieved solid results without significant enhancements. Furthermore, the topic-based method ranked 1st in subtasks C and E, and 2nd in subtask D.
We describe a supervised system that uses optimized Condition Random Fields and lexical features to predict the sentiment of a tweet. The system was submitted to the English version of all subtasks in SemEval-2017 Task 4.
We present Arab-Acquis, a large publicly available dataset for evaluating machine translation between 22 European languages and Arabic. Arab-Acquis consists of over 12,000 sentences from the JRC-Acquis (Acquis Communautaire) corpus translated twice by professional translators, once from English and once from French, and totaling over 600,000 words. The corpus follows previous data splits in the literature for tuning, development, and testing. We describe the corpus and how it was created. We also present the first benchmarking results on translating to and from Arabic for 22 European languages.
Dialectal Arabic (DA) poses serious challenges for Natural Language Processing (NLP). The number and sophistication of tools and datasets in DA are very limited in comparison to Modern Standard Arabic (MSA) and other languages. MSA tools do not effectively model DA which makes the direct use of MSA NLP tools for handling dialects impractical. This is particularly a challenge for the creation of tools to support learning Arabic as a living language on the web, where authentic material can be found in both MSA and DA. In this paper, we present the Dialectal Arabic Linguistic Learning Assistant (DALILA), a Chrome extension that utilizes cutting-edge Arabic dialect NLP research to assist learners and non-native speakers in understanding text written in either MSA or DA. DALILA provides dialectal word analysis and English gloss corresponding to each word.
We present new language resources for Moroccan and Sanaani Yemeni Arabic. The resources include corpora for each dialect which have been morphologically annotated, and morphological analyzers for each dialect which are derived from these corpora. These are the first sets of resources for Moroccan and Yemeni Arabic. The resources will be made available to the public.
We present our guidelines and annotation procedure to create a human corrected machine translated post-edited corpus for the Modern Standard Arabic. Our overarching goal is to use the annotated corpus to develop automatic machine translation post-editing systems for Arabic that can be used to help accelerate the human revision process of translated texts. The creation of any manually annotated corpus usually presents many challenges. In order to address these challenges, we created comprehensive and simplified annotation guidelines which were used by a team of five annotators and one lead annotator. In order to ensure a high annotation agreement between the annotators, multiple training sessions were held and regular inter-annotator agreement measures were performed to check the annotation quality. The created corpus of manual post-edited translations of English to Arabic articles is the largest to date for this language pair.
The goal of the cognitive machine translation (MT) evaluation approach is to build classifiers which assign post-editing effort scores to new texts. The approach helps estimate fair compensation for post-editors in the translation industry by evaluating the cognitive difficulty of post-editing MT output. The approach counts the number of errors classified in different categories on the basis of how much cognitive effort they require in order to be corrected. In this paper, we present the results of applying an existing cognitive evaluation approach to Modern Standard Arabic (MSA). We provide a comparison of the number of errors and categories of errors in three MSA texts of different MT quality (without any language-specific adaptation), as well as a comparison between MSA texts and texts from three Indo-European languages (Russian, Spanish, and Bulgarian), taken from a previous experiment. The results show how the error distributions change passing from the MSA texts of worse MT quality to MSA texts of better MT quality, as well as a similarity in distinguishing the texts of better MT quality for all four languages.
Text preprocessing is an important and necessary task for all NLP applications. A simple variation in any preprocessing step may drastically affect the final results. Moreover replicability and comparability, as much as feasible, is one of the goals of our scientific enterprise, thus building systems that can ensure the consistency in our various pipelines would contribute significantly to our goals. The problem has become quite pronounced with the abundance of NLP tools becoming more and more available yet with different levels of specifications. In this paper, we present a dynamic unified preprocessing framework and tool, SPLIT, that is highly configurable based on user requirements which serves as a preprocessing tool for several tools at once. SPLIT aims to standardize the implementations of the most important preprocessing steps by allowing for a unified API that could be exchanged across different researchers to ensure complete transparency in replication. The user is able to select the required preprocessing tasks among a long list of preprocessing steps. The user is also able to specify the order of execution which in turn affects the final preprocessing output.
Most Arabic natural language processing tools and resources are developed to serve Modern Standard Arabic (MSA), which is the official written language in the Arab World. Some Dialectal Arabic varieties, notably Egyptian Arabic, have received some attention lately and have a growing collection of resources that include annotated corpora and morphological analyzers and taggers. Gulf Arabic, however, lags behind in that respect. In this paper, we present the Gumar Corpus, a large-scale corpus of Gulf Arabic consisting of 110 million words from 1,200 forum novels. We annotate the corpus for sub-dialect information at the document level. We also present results of a preliminary study in the morphological annotation of Gulf Arabic which includes developing guidelines for a conventional orthography. The text of the corpus is publicly browsable through a web interface we developed for it.
We present a novel technique for Arabic morphological annotation. The technique utilizes diacritization to produce morphological annotations of quality comparable to human annotators. Although Arabic text is generally written without diacritics, diacritization is already available for large corpora of Arabic text in several genres. Furthermore, diacritization can be generated at a low cost for new text as it does not require specialized training beyond what educated Arabic typists know. The basic approach is to enrich the input to a state-of-the-art Arabic morphological analyzer with word diacritics (full or partial) to enhance its performance. When applied to fully diacritized text, our approach produces annotations with an accuracy of over 97% on lemma, part-of-speech, and tokenization combined.
A significant portion of data generated on blogging and microblogging websites is non-credible as shown in many recent studies. To filter out such non-credible information, machine learning can be deployed to build automatic credibility classifiers. However, as in the case with most supervised machine learning approaches, a sufficiently large and accurate training data must be available. In this paper, we focus on building a public Arabic corpus of blogs and microblogs that can be used for credibility classification. We focus on Arabic due to the recent popularity of blogs and microblogs in the Arab World and due to the lack of any such public corpora in Arabic. We discuss our data acquisition approach and annotation process, provide rigid analysis on the annotated data and finally report some results on the effectiveness of our data for credibility classification.
Much research in education has been done on the study of different language teaching methods. However, there has been little investigation using computational analysis to compare such methods in terms of readability or complexity progression. In this paper, we make use of existing readability scoring techniques and our own classifiers to analyze the textbooks used in two very different teaching methods for English as a Second Language – the grammar-based and the communicative methods. Our analysis indicates that the grammar-based curriculum shows a more coherent readability progression compared to the communicative curriculum. This finding corroborates with the expectations about the differences between these two methods and validates our approach’s value in comparing different teaching methods quantitatively.
Evaluation of machine translation (MT) into morphologically rich languages (MRL) has not been well studied despite posing many challenges. In this paper, we explore the use of embeddings obtained from different levels of lexical and morpho-syntactic linguistic analysis and show that they improve MT evaluation into an MRL. Specifically we report on Arabic, a language with complex and rich morphology. Our results show that using a neural-network model with different input representations produces results that clearly outperform the state-of-the-art for MT evaluation into Arabic, by almost over 75% increase in correlation with human judgments on pairwise MT evaluation quality task. More importantly, we demonstrate the usefulness of morpho-syntactic representations to model sentence similarity for MT evaluation and address complex linguistic phenomena of Arabic.
Arabic dialects present a special problem for natural language processing because there are few resources, they have no standard orthography, and have not been studied much. However, as more and more written dialectal Arabic is found in social media, NLP for Arabic dialects becomes an important goal. We present a methodology for creating a morphological analyzer and a morphological tagger for dialectal Arabic, and we illustrate it on Egyptian and Levantine Arabic. To our knowledge, these are the first analyzer and tagger for Levantine.
This paper presents BOTTA, the first Arabic dialect chatbot. We explore the challenges of creating a conversational agent that aims to simulate friendly conversations using the Egyptian Arabic dialect. We present a number of solutions and describe the different components of the BOTTA chatbot. The BOTTA database files are publicly available for researchers working on Arabic chatbot technologies. The BOTTA chatbot is also publicly available for any users who want to chat with it online.
In this paper, we present YAMAMA, a multi-dialect Arabic morphological analyzer and disambiguator. Our system is almost five times faster than the state-of-art MADAMIRA system with a slightly lower quality. In addition to speed, YAMAMA outputs a rich representation which allows for a wider spectrum of use. In this regard, YAMAMA transcends other systems, such as FARASA, which is faster but provides specific outputs catering to specific applications.
In this paper, we present CamelParser, a state-of-the-art system for Arabic syntactic dependency analysis aligned with contextually disambiguated morphological features. CamelParser uses a state-of-the-art morphological disambiguator and improves its results using syntactically driven features. The system offers a number of output formats that include basic dependency with morphological features, two tree visualization modes, and traditional Arabic grammatical analysis.
This tutorial introduces the different challenges and current solutions to the automatic processing of Arabic and its dialects. The tutorial has two parts: First, we present a discussion of generic issues relevant to Arabic NLP and detail dialectal linguistic issues and the challenges they pose for NLP. In the second part, we review the state-of-the-art in Arabic processing covering several enabling technologies and applications, e.g., dialect identification, morphological processing (analysis, disambiguation, tokenization, POS tagging), parsing, and machine translation.
This paper describes the parallel development of an Egyptian Arabic Treebank and a morphological analyzer for Egyptian Arabic (CALIMA). By the very nature of Egyptian Arabic, the data collected is informal, for example Discussion Forum text, which we use for the treebank discussed here. In addition, Egyptian Arabic, like other Arabic dialects, is sufficiently different from Modern Standard Arabic (MSA) that tools and techniques developed for MSA cannot be simply transferred over to work on Egyptian Arabic work. In particular, a morphological analyzer for Egyptian Arabic is needed to mediate between the written text and the segmented, vocalized form used for the syntactic trees. This led to the necessity of a feedback loop between the treebank team and the analyzer team, as improvements in each area were fed to the other. Therefore, by necessity, there needed to be close cooperation between the annotation team and the tool development team, which was to their mutual benefit. Collaboration on this type of challenge, where tools and resources are limited, proved to be remarkably synergistic and opens the way to further fruitful work on Arabic dialects.
We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwas creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.
Tunisian Arabic is a dialect of the Arabic language spoken in Tunisia. Tunisian Arabic is an under-resourced language. It has neither a standard orthography nor large collections of written text and dictionaries. Actually, there is no strict separation between Modern Standard Arabic, the official language of the government, media and education, and Tunisian Arabic; the two exist on a continuum dominated by mixed forms. In this paper, we present a conventional orthography for Tunisian Arabic, following a previous effort on developing a conventional orthography for Dialectal Arabic (or CODA) demonstrated for Egyptian Arabic. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Tunisian Arabic.
In this paper we describe an effort to create a corpus and phonetic dictionary for Tunisian Arabic Automatic Speech Recognition (ASR). The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. The phonetic (or pronunciation) dictionary is an important ASR component that serves as an intermediary between acoustic models and language models in ASR systems. The method proposed in this paper, to automatically generate a phonetic dictionary, is rule based. For that reason, we define a set of pronunciation rules and a lexicon of exceptions. To determine the performance of our phonetic rules, we chose to evaluate our pronunciation dictionary on two types of corpora. The word error rate of word grapheme-to-phoneme mapping is around 9%.
The daily spoken variety of Arabic is often termed the colloquial or dialect form of Arabic. There are many Arabic dialects across the Arab World and within other Arabic speaking communities. These dialects vary widely from region to region and to a lesser extent from city to city in each region. The dialects are not standardized, they are not taught, and they do not have official status. However they are the primary vehicles of communication (face-to-face and recently, online) and have a large presence in the arts as well. In this paper, we present the first multidialectal Arabic parallel corpus, a collection of 2,000 sentences in Standard Arabic, Egyptian, Tunisian, Jordanian, Palestinian and Syrian Arabic, in addition to English. Such parallel data does not exist naturally, which makes this corpus a very valuable resource that has many potential applications such as Arabic dialect identification and machine translation.
In this paper, we present MADAMIRA, a system for morphological analysis and disambiguation of Arabic that combines some of the best aspects of two previously commonly used systems for Arabic processing, MADA (Habash and Rambow, 2005; Habash et al., 2009; Habash et al., 2013) and AMIRA (Diab et al., 2007). MADAMIRA improves upon the two systems with a more streamlined Java implementation that is more robust, portable, extensible, and is faster than its ancestors by more than an order of magnitude. We also discuss an online demo (see http://nlp.ldeo.columbia.edu/madamira/) that highlights these aspects.
We present annotation guidelines and a web-based annotation framework developed as part of an effort to create a manually annotated Arabic corpus of errors and corrections for various text types. Such a corpus will be invaluable for developing Arabic error correction tools, both for training models and as a gold standard for evaluating error correction algorithms. We summarize the guidelines we created. We also describe issues encountered during the training of the annotators, as well as problems that are specific to the Arabic language that arose during the annotation process. Finally, we present the annotation tool that was developed as part of this project, the annotation pipeline, and the quality of the resulting annotations.
Dialectal Arabic (DA) refers to the day-to-day vernaculars spoken in the Arab world. DA lives side-by-side with the official language, Modern Standard Arabic (MSA). DA differs from MSA on all levels of linguistic representation, from phonology and morphology to lexicon and syntax. Unlike MSA, DA has no standard orthography since there are no Arabic dialect academies, nor is there a large edited body of dialectal literature that follows the same spelling standard. In this paper, we present CODA, a conventional orthography for dialectal Arabic; it is designed primarily for the purpose of developing computational models of Arabic dialects. We explain the design principles of CODA and provide a detailed description of its guidelines as applied to Egyptian Arabic.
In an ideal cross-lingual information retrieval (CLIR) system, a user query would generate a search over documents in a different language and the relevant results would be presented in the user’s language. In practice, CLIR systems are typically evaluated by judging result relevance in the document language, to factor out the effects of translating the results using machine translation (MT). In this paper, we investigate the influence of four different approaches for integrating MT and CLIR on both retrieval accuracy and user judgment of relevancy. We create a corpus with relevance judgments for both human and machine translated results, and use it to quantify the effect that MT quality has on end-to-end relevance. We find that MT errors result in a 16-39% decrease in mean average precision over the ground truth system that uses human translations. MT errors also caused relevant sentences to appear irrelevant – 5-19% of sentences were relevant in human translation, but were judged irrelevant in MT. To counter this degradation, we present two hybrid retrieval models and two automatic MT post-editing techniques and show that these approaches substantially mitigate the errors and improve the end-to-end relevance.
Arabic poses many interesting challenges to machine translation: ambiguous orthography, rich morphology, complex morpho-syntactic behavior, and numerous dialects. In this tutorial, we introduce the most important themes of challenges and solutions for people working on translation from/to Arabic or any of its dialects. The tutorial is intended for researchers and developers working on MT. The discussion of linguistic issues and how they are addressed in MT will help linguists and professional translators understand the issues machine translation faces when dealing with Arabic and other morphologically rich languages. The tutorial does not expect the attendees to be able to speak/read/write Arabic.
Hebrew and Arabic are related but mutually incomprehensible languages with complex morphology and scarce parallel corpora. Machine translation between the two languages is therefore interesting and challenging. We discuss similarities and differences between Hebrew and Arabic, the benefits and challenges that they induce, respectively, and their implications for machine translation. We highlight the shortcomings of using English as a pivot language and advocate a direct, transfer-based and linguistically-informed (but still statistical, and hence scalable) approach. We report preliminary results of such a system that we are currently developing.
The Quranic Arabic Corpus (http://corpus.quran.com) is an annotated linguistic resource with multiple layers of annotation including morphological segmentation, part-of-speech tagging, and syntactic analysis using dependency grammar. The motivation behind this work is to produce a resource that enables further analysis of the Quran, the 1,400 year old central religious text of Islam. This paper describes a new approach to morphological annotation of Quranic Arabic, a genre difficult to compare with other forms of Arabic. Processing Quranic Arabic is a unique challenge from a computational point of view, since the vocabulary and spelling differ from Modern Standard Arabic. The Quranic Arabic Corpus differs from other Arabic computational resources in adopting a tagset that closely follows traditional Arabic grammar. We made this decision in order to leverage a large body of existing historical grammatical analysis, and to encourage online collaborative annotation. In this paper, we discuss how the unique challenge of morphological annotation of Quranic Arabic is solved using a multi-stage approach. The different stages include automatic morphological tagging using diacritic edit-distance, two-pass manual verification, and online collaborative annotation. This process is evaluated to validate the appropriateness of the chosen methodology.
MAGEAD is a morphological analyzer and generator for Modern Standard Arabic (MSA) and its dialects. We introduced MAGEAD in previous work with an implementation of MSA and Levantine Arabic verbs. In this paper, we port that system to MSA nominals (nouns and adjectives), which are far more complex to model than verbs. Our system is a functional morphological analyzer and generator, i.e., it analyzes to and generates from a representation consisting of a lexeme and linguistic feature-value pairs, where the features are syntactically (and perhaps semantically) meaningful, rather than just morphologically. A detailed evaluation of the current implementation comparing it to a commonly used morphological analyzer shows that it has good morphological coverage with precision and recall scores in the 90s. An error analysis reveals that the majority of recall and precision errors are problems in the gold standard or a result of the discrepancy between different models of form-based/functional morphology.
Much of the work on Statistical Machine Translation (SMT) from morphologically rich languages has shown that morphological tokenization and orthographic normalization help improve SMT quality because of the sparsity reduction they contribute. In this paper, we study the effect of these processes on SMT when translating into a morphologically rich language, namely Arabic.We explore a space of tokenization schemes and normalization options. We only evaluate on detokenized and orthographically correct (enriched) output. Our results show that the best performing tokenization scheme is that of the Penn Arabic Treebank. Additionally, training on orthographically normalized (reduced) text then jointly enriching and detokenizing the output outperforms training on enriched text.
We improve our recently proposed technique for integrating Arabic verb-subject constructions in SMT word alignment (Carpuat et al., 2010) by distinguishing between matrix (or main clause) and non-matrix Arabic verb-subject constructions. In gold translations, most matrix VS (main clause verb-subject) constructions are translated in inverted SV order, while non-matrix (subordinate clause) VS constructions are inverted in only half the cases. In addition, while detecting verbs and their subjects is a hard task, our syntactic parser detects VS constructions better in matrix than in non-matrix clauses. As a result, reordering only matrix VS for word alignment consistently improves translation quality over a phrase-based SMT baseline, and over reordering all VS constructions, in both medium- and large-scale settings. In fact, the improvements obtained by reordering matrix VS on the medium-scale setting remarkably represent 44% of the gain in BLEU and 51% of the gain in TER obtained with a word alignment training bitext that is 5 times larger.
We discuss a named entity recognition system for Arabic, and show how we incorporated the information provided by MADA, a full morphological tagger which uses a morphological analyzer. Surprisingly, the relevant features used are the capitalization of the English gloss chosen by the tagger, and the fact that an analysis is returned (that a word is not OOV to the morphological analyzer). The use of the tagger also improves over a third system which just uses a morphological analyzer, yielding a 14\% reduction in error over the baseline. We conduct a thorough error analysis to identify sources of success and failure among the variations, and show that by combining the systems in simple ways we can significantly influence the precision-recall trade-off.
In this paper, we define the task of Number Identification in natural context. We present and validate a language-independent semi-automatic approach to quickly building a gold standard for evaluating number identification systems by exploiting hand-aligned parallel data. We also present and extensively evaluate a robust rule-based system for number identification in natural context for Arabic for a variety of number formats and types. The system is shown to have strong performance, achieving, on a blind test, a 94.8% F-score for the task of correctly identifying number expression spans in natural text, and a 92.1% F-score for the task of correctly determining the core numerical value.
We present an approach for online handling of Out-of-Vocabulary (OOV) terms in Urdu-English MT. Since Urdu is morphologically richer than English, we expect a large portion of the OOV terms to be Urdu morphological variations that are irrelevant to English. We describe an approach to automatically learn English-irrelevant (target-irrelevant) Urdu (source) morphological variation rules from standard phrase tables. These rules are learned in an unsupervised (or lightly supervised) manner by exploiting redundancy in Urdu and collocation with English translations. We use these rules to hypothesize in-vocabulary alternatives to the OOV terms. Our results show that we reduce the OOV rate from a standard baseline average of 2.6% to an average of 0.3% (or 89% relative decrease). We also increase the BLEU score by 0.45 (absolute) and 2.8% (relative) on a standard test set. A manual error analysis shows that 28% of handled OOV cases produce acceptable translations in context.
In this paper, we describe the methodological procedures and issues that emerged from the development of a pilot Levantine Arabic Treebank (LATB) at the Linguistic Data Consortium (LDC) and its use at the Johns Hopkins University (JHU) Center for Language and Speech Processing workshop on Parsing Arabic Dialects (PAD). This pilot, consisting of morphological and syntactic annotation of approximately 26,000 words of Levantine Arabic conversational telephone speech, was developed under severe time constraints; hence the LDC team drew on their experience in treebanking Modern Standard Arabic (MSA) text. The resulting Levantine dialect treebanked corpus was used by the PAD team to develop and evaluate parsers for Levantine dialect texts. The parsers were trained on MSA resources and adapted using dialect-MSA lexical resources (some developed especially for this task) and existing linguistic knowledge about syntactic differences between MSA and dialect. The use of the LATB for development and evaluation of syntactic parsers allowed the PAD team to provide feedbasck to the LDC treebank developers. In this paper, we describe the creation of resources for this corpus, as well as transformations on the corpus to eliminate speech effects and lessen the gap between our pre-existing MSA resources and the new dialectal corpus
Six sites participated in the Interlingual Annotation of Multilingual Text Corpora (IAMTC) project (Dorr et al., 2004; Farwell et al., 2004; Mitamura et al., 2004). Parsed versions of English translations of news articles in Arabic, French, Hindi, Japanese, Korean and Spanish were annotated by up to ten annotators. Their task was to match open-class lexical items (nouns, verbs, adjectives, adverbs) to one or more concepts taken from the Omega ontology (Philpot et al., 2003), and to identify theta roles for verb arguments. The annotated corpus is intended to be a resource for meaning-based approaches to machine translation. Here we discuss inter-annotator agreement for the corpus. The annotation task is characterized by annotators freedom to select multiple concepts or roles per lexical item. As a result, the annotation categories are sets, the number of which is bounded only by the number of distinct annotator-lexical item pairs. We use a reliability metric designed to handle partial agreement between sets. The best results pertain to the part of the ontology derived from WordNet. We examine change over the course of the project, differences among annotators, and differences across parts of speech. Our results suggest a strong learning effect early in the project.
This paper describes an effort to investigate the incrementally deepening development of an interlingua notation, validated by human annotation of texts in English plus six languages. We begin with deep syntactic annotation, and in this paper present a series of annotation manuals for six different languages at the deep-syntactic level of representation. Many syntactic differences between languages are removed in the proposed syntactic annotation, making them useful resources for multilingual NLP projects with semantic components.
This paper describes the issues involved in extending a trans-lingual lexicon, the TextWise Conceptual Interlingua (CI), with Arabic terms. The Conceptual Interlingua is based on the Princeton English WordNet (Fellbaum, 1998). It is a central component in the cross-lingual information retrieval (CLIR) system CINDOR (Conceptual INterlingua for DOcument Retrieval). Arabic has a rich morphological system combining templatic and affixational paradigms for both inflectional and derivational morphology. This rich morphology poses a major challenge to the design and building of the Arabic CI and also its validation. This is because the available resources for Arabic, whether manually constructed bilingual lexicons or lexicons automatically derived from bilingual parallel corpora, exist at different levels of morphological representation. We describe here the issues and decisions made in the design and construction of the Arabic-English CI using different types of manual and automatic resources. We also present the results of an extensive validation of the Arabic CI and briefly discuss the evaluation of its use for CLIR on the TREC Arabic Benchmark collection.
The research context of this paper is developing hybrid machine translation (MT) systems that exploit the advantages of linguistic rule-based and statistical MT systems. Arabic, as a morphologically rich language, is especially challenging even without addressing the hybridization question. In this paper, we describe the challenges in building an Arabic-English generation-heavy machine translation (GHMT) system and boosting it with statistical machine translation (SMT) components. We present an extensive evaluation of multiple system variants and report positive results on the advantages of hybridization.
An adaptable statistical or hybrid MT system relies heavily on the quality of word-level alignments of real-world data. Statistical alignment approaches provide a reasonable initial estimate for word alignment. However, they cannot handle certain types of linguistic phenomena such as long-distance dependencies and structural differences between languages. We address this issue in Multi-Align, a new framework for incremental testing of different alignment algorithms and their combinations. Our design allows users to tune their systems to the properties of a particular genre/domain while still benefiting from general linguistic knowledge associated with a language pair. We demonstrate that a combination of statistical and linguistically-informed alignments can resolve translation divergences during the alignment process.
MT systems that use only superficial representations, including the current generation of statistical MT systems, have been successful and useful. However, they will experience a plateau in quality, much like other “silver bullet” approaches to MT. We pursue work on the development of interlingual representations for use in symbolic or hybrid MT systems. In this paper, we describe the creation of an interlingua and the development of a corpus of semantically annotated text, to be validated in six languages and evaluated in several ways. We have established a distributed, well-functioning research methodology, designed a preliminary interlingua notation, created annotation manuals and tools, developed a test collection in six languages with associated English translations, annotated some 150 translations, and designed and applied various annotation metrics. We describe the data sets being annotated and the interlingual (IL) representation language which uses two ontologies and a systematic theta-role list. We present the annotation tools built and outline the annotation process. Following this, we describe our evaluation methodology and conclude with a summary of issues that have arisen.
This paper describes and evaluates Matador, an implemented large-scale Spanish-English MT system built in the Generation-Heavy Hybrid Machine Translation (GHMT) approach. An extensive evaluation shows that Matador has a higher degree of robustness and superior output quality, in terms of grammaticality and accuracy, when compared to a primarily statistical approach.
We present a new large-scale database called “CatVar” (Habash and Dorr, 2003) which contains categorial variations of English lexemes. Due to the prevalence of cross-language categorial variation in multilingual applications, our categorial-variation resource may serve as an integral part of a diverse range of natural language applications. Thus, the research reported herein overlaps heavily with that of the machine-translation, lexicon-construction, and information-retrieval communities. We demonstrate this database, embedded in a graphical interface; we also show a GUI for user input of corrections to the database.
The frequent occurrence of divergenceS—structural differences between languages—presents a great challenge for statistical word-level alignment. In this paper, we introduce DUSTer, a method for systematically identifying common divergence types and transforming an English sentence structure to bear a closer resemblance to that of another language. Our ultimate goal is to enable more accurate alignment and projection of dependency trees in another language without requiring any training on dependency-tree data in that language. We present an empirical analysis comparing the complexities of performing word-level alignments with and without divergence handling. Our results suggest that our approach facilitates word-level alignment, particularly for sentence pairs containing divergences.
This paper describes a novel approach to handling translation divergences in a Generation-Heavy Hybrid Machine Translation (GHMT) system. The translation divergence problem is usually reserved for Transfer and Interlingual MT because it requires a large combination of complex lexical and structural mappings. A major requirement of these approaches is the accessibility of large amounts of explicit symmetric knowledge for both source and target languages. This limitation renders Transfer and Interlingual approaches ineffective in the face of structurally-divergent language pairs with asymmetric resources. GHMT addresses the more common form of this problem, source-poor/targetrich, by fully exploiting symbolic and statistical target-language resources. This non-interlingual non-transfer approach is accomplished by using target-language lexical semantics, categorial variations and subcategorization frames to overgenerate multiple lexico-structural variations from a target-glossed syntactic dependency of the source-language sentence. The symbolic overgeneration, which accounts for different possible translation divergences, is constrained by a statistical target-language model.
This paper describes a language independent linearization engine, oxyGen. This system compiles target language grammars into programs that take feature graphs as inputs and generate word lattices that can be passed along to the statistical extraction module of the generation system Nitrogen. The grammars are written using a flexible and powerful language, oxyL, that has the power of a programming language but focuses on natural language realization. This engine has been used successfully in creating an English linearization program that is currently employed as part of a Chinese-English machine translation system.
This paper describes an implemented algorithm for syntactic realization of a target-language sentence from an interlingual representation called Lexical Conceptual Structure (LCS). We provide a mapping between LCS thematic roles and Abstract Meaning Representation (AMR) relations; these relations serve as input to an off-the-shelf generator (Nitrogen). There are two contributions of this work: (1) the development of a thematic hierarchy that provides ordering information for realization of arguments in their surface positions; (2) the provision of a diagnostic tool for detecting inconsistencies in an existing online LCS-based lexicon that allows us to enhance principles for thematic-role assignment.