Proceedings of the Third Arabic Natural Language Processing Workshop

Proceedings of the Third Arabic Natural Language Processing Workshop Nizar Habash Mona Diab Kareem Darwish Wassim El-Hajj Hend Al-Khalifa Houda Bouamor Nadi Tomeh Mahmoud El-Haj April 2017

Valencia, Spain

Association for Computational Linguistics http://www.aclweb.org/anthology/W17-13 book W17-13:2017 Identification of Languages in Algerian Arabic Multilingual Documents WafiaAdouane SimonDobnik Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 1–8 http://www.aclweb.org/anthology/W17-1301 This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14%. Finally, we introduced rules for language identification from sequences of recognised words. inproceedings adouane-dobnik:2017:W17-13 Arabic Diacritization: Stats, Rules, and Hacks KareemDarwish HamdyMubarak AhmedAbdelali Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 9–17 http://www.aclweb.org/anthology/W17-1302 In this paper, we present a new and fast state-of-the-art Arabic diacritizer that guesses the diacritics of words and then their case endings. We employ a Viterbi decoder at word-level with back-off to stem, morphological patterns, and transliteration and sequence labeling based diacritization of named entities. For case endings, we use Support Vector Machine (SVM) based ranking coupled with morphological patterns and linguistic rules to properly guess case endings. We achieve a low word level diacritization error of 3.29% and 12.77% without and with case endings respectively on a new multi-genre free of copyright test set. We are making the diacritizer available for free for research purposes. inproceedings darwish-mubarak-abdelali:2017:W17-13 Semantic Similarity of Arabic Sentences with Word Embeddings El Moatez BillahNagoudi DidierSchwab Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 18–24 http://www.aclweb.org/anthology/W17-1303 Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments. inproceedings nagoudi-schwab:2017:W17-13 Morphological Analysis for the Maltese Language: The challenges of a hybrid system ClaudiaBorg AlbertGatt Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 25–34 http://www.aclweb.org/anthology/W17-1304 Maltese is a morphologically rich language with a hybrid morphological system which features both concatenative and non-concatenative processes. This paper analyses the impact of this hybridity on the performance of machine learning techniques for morphological labelling and clustering. In particular, we analyse a dataset of morphologically related word clusters to evaluate the difference in results for concatenative and non-concatenative clusters. We also describe research carried out in morphological labelling, with a particular focus on the verb category. Two evaluations were carried out, one using an unseen dataset, and another one using a gold standard dataset which was manually labelled. The gold standard dataset was split into concatenative and non-concatenative to analyse the difference in results between the two morphological systems. inproceedings borg-gatt:2017:W17-13 A Morphological Analyzer for Gulf Arabic Verbs SalamKhalifa SaraHassan NizarHabash Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 35–45 http://www.aclweb.org/anthology/W17-1305 We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering over 2,600 verbal lemmas. We describe in detail the process of building the analyzer starting from phonetic dictionary entries to fully inflected orthographic paradigms and associated lexicon and orthographic variants. We evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token recall for identifying correct POS tag outperforms both the Modern Standard Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute, respectively. inproceedings khalifa-hassan-habash:2017:W17-13 A Neural Architecture for Dialectal Arabic Segmentation YounesSamih MohammedAttia MohamedEldesouki AhmedAbdelali HamdyMubarak LauraKallmeyer KareemDarwish Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 46–54 http://www.aclweb.org/anthology/W17-1306 The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources. inproceedings samih-EtAl:2017:W17-13 Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments SalimaMedhaffar FethiBougares YannickEstève LamiaHadrich-Belguith Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 55–61 http://www.aclweb.org/anthology/W17-1307 Dialectal Arabic (DA) is significantly different from the Arabic language taught in schools and used in written communication and formal speech (broadcast news, religion, politics, etc.). There are many existing researches in the field of Arabic language Sentiment Analysis (SA); however, they are generally restricted to Modern Standard Arabic (MSA) or some dialects of economic or political interest. In this paper we are interested in the SA of the Tunisian Dialect. We utilize Machine Learning techniques to determine the polarity of comments written in Tunisian Dialect. First, we evaluate the SA systems performances with models trained using freely available MSA and Multi-dialectal data sets. We then collect and annotate a Tunisian Dialect corpus of 17.000 comments from Facebook. This corpus allows us a significant accuracy improvement compared to the best model trained on other Arabic dialects or MSA data. We believe that this first freely available corpus will be valuable to researchers working in the field of Tunisian Sentiment Analysis and similar areas. inproceedings medhaffar-EtAl:2017:W17-13 CAT: Credibility Analysis of Arabic Content on Twitter RimEl Ballouli WassimEl-Hajj AhmadGhandour ShadyElbassuoni HazemHajj KhaledShaban Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 62–71 http://www.aclweb.org/anthology/W17-1308 Data generated on Twitter has become a rich source for various data mining tasks. Those data analysis tasks that are dependent on the tweet semantics, such as sentiment analysis, emotion mining, and rumor detection among others, suffer considerably if the tweet is not credible, not real, or spam. In this paper, we perform an extensive analysis on credibility of Arabic content on Twitter. We also build a classification model (CAT) to automatically predict the credibility of a given Arabic tweet. Of particular originality is the inclusion of features extracted directly or indirectly from the author's profile and timeline. To train and test CAT, we annotated for credibility a data set of 9,000 Arabic tweets that are topic independent. CAT achieved consistent improvements in predicting the credibility of the tweets when compared to several baselines and when compared to the state-of-the-art approach with an improvement of 21% in weighted average F-measure. We also conducted experiments to highlight the importance of the user-based features as opposed to the content-based features. We conclude our work with a feature reduction experiment that highlights the best indicative features of credibility. inproceedings elballouli-EtAl:2017:W17-13 A New Error Annotation for Dyslexic texts in Arabic MahaAlamri William J.Teahan Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 72–78 http://www.aclweb.org/anthology/W17-1309 This paper aims to develop a new classification of errors made in Arabic by those suffering from dyslexia to be used in the annotation of the Arabic dyslexia corpus (BDAC). The dyslexic error classification for Arabic texts (DECA) comprises a list of spelling errors extracted from previous studies and a collection of texts written by people with dyslexia that can provide a framework to help analyse specific errors committed by dyslexic writers. The classification comprises 37 types of errors, grouped into nine categories. The paper also discusses building a corpus of dyslexic Arabic texts that uses the error annotation scheme and provides an analysis of the errors that were found in the texts. inproceedings alamri-teahan:2017:W17-13 An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems HanyAhmed MohamedElaraby AbdullahM. Mousa MostafaElhosiny SherifAbdou MohsenRashwan Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 79–83 http://www.aclweb.org/anthology/W17-1310 In this paper, we introduce an enhancement for speech recognition systems using an unsupervised speaker clustering technique. The proposed technique is mainly based on I-vectors and Self-Organizing Map Neural Network(SOM).The input to the proposed algorithm is a set of speech utterances. For each utterance, we extract 100-dimensional I-vector and then SOM is used to group the utterances to different speakers. In our experiments, we compared our technique with Normalized Cross Likelihood ratio Clustering (NCLR). Results show that the proposed technique reduces the speaker error rate in comparison with NCLR. Finally, we have experimented the effect of speaker clustering on Speaker Adaptive Training (SAT) in a speech recognition system implemented to test the performance of the proposed technique. It was noted that the proposed technique reduced the WER over clustering speakers with NCLR. inproceedings ahmed-EtAl:2017:W17-13 SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts AmanyFashwan SamehAlansary Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 84–93 http://www.aclweb.org/anthology/W17-1311 This paper sheds light on a system that would be able to diacritize Arabic texts automatically (SHAKKIL). In this system, the diacritization problem will be handled through two levels; morphological and syntactic processing levels. The adopted morphological disambiguation algorithm depends on four layers; Uni-morphological form layer, rule-based morphological disambiguation layer, statistical-based disambiguation layer and Out Of Vocabulary (OOV) layer. The adopted syntactic disambiguation algorithms is concerned with detecting the case ending diacritics depending on a rule based approach simulating the shallow parsing technique. This will be achieved using an annotated corpus for extracting the Arabic linguistic rules, building the language models and testing the system output. This system is considered as a good trial of the interaction between rule-based approach and statistical approach, where the rules can help the statistics in detecting the right diacritization and vice versa. At this point, the morphological Word Error Rate (WER) is 4.56% while the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is 9.36%. The best WER is 14.78% compared to the best-published results, of (Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally, Rashwan, & Atiya, 2016); 13.70%. inproceedings fashwan-alansary:2017:W17-13 Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach FahadAlbogamy AllanRamsay HanadyAhmed Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 94–99 http://www.aclweb.org/anthology/W17-1312 In this paper, we propose using a "bootstrapping" method for constructing a dependency treebank of Arabic tweets. This method uses a rule-based parser to create a small treebank of one thousand Arabic tweets and a data-driven parser to create a larger treebank by using the small treebank as a seed training set. We are able to create a dependency treebank from unlabelled tweets without any manual intervention. Experiments results show that this method can improve the speed of training the parser and the accuracy of the resulting parsers. inproceedings albogamy-ramsay-ahmed:2017:W17-13 Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search AhmadKhwileh HaithemAfli GarethJones AndyWay Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 100–109 http://www.aclweb.org/anthology/W17-1313 Cross Language Information Retrieval (CLIR) systems are a valuable tool to enable speakers of one language to search for content of interest expressed in a different language. A group for whom this is of particular interest is bilingual Arabic speakers who wish to search for English language content using information needs expressed in Arabic queries. A key challenge in CLIR is crossing the language barrier between the query and the documents. The most common approach to bridging this gap is automated query translation, which can be unreliable for vague or short queries. In this work, we examine the potential for improving CLIR effectiveness by predicting the translation effectiveness using Query Performance Prediction (QPP) techniques. We propose a novel QPP method to estimate the quality of translation for an Arabic-English Cross-lingual User-generated Speech Search (CLUGS) task. We present an empirical evaluation that demonstrates the quality of our method on alternative translation outputs extracted from an Arabic-to-English Machine Translation system developed for this task. Finally, we show how this framework can be integrated in CLUGS to find relevant translations for improved retrieval performance. inproceedings khwileh-EtAl:2017:W17-13 A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models RamyBaly GilbertBadaro GeorgesEl-Khoury RawanMoukalled RitaAoun HazemHajj WassimEl-Hajj NizarHabash KhaledShaban Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 110–118 http://www.aclweb.org/anthology/W17-1314 Opinion mining in Arabic is a challenging task given the rich morphology of the language. The task becomes more challenging when it is applied to Twitter data, which contains additional sources of noise, such as the use of unstandardized dialectal variations, the nonconformation to grammatical rules, the use of Arabizi and code-switching, and the use of non-text objects such as images and URLs to express opinion. In this paper, we perform an analytical study to observe how such linguistic phenomena vary across different Arab regions. This study of Arabic Twitter characterization aims at providing better understanding of Arabic Tweets, and fostering advanced research on the topic. Furthermore, we explore the performance of the two schools of machine learning on Arabic Twitter, namely the feature engineering approach and the deep learning approach. We consider models that have achieved state-of-the-art performance for opinion mining in English. Results highlight the advantages of using deep learning-based models, and confirm the importance of using morphological abstractions to address Arabic’s complex morphology. inproceedings baly-EtAl:2017:W17-13 Robust Dictionary Lookup in Multiple Noisy Orthographies LingliangZhang NizarHabash GodfriedToussaint Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 119–129 http://www.aclweb.org/anthology/W17-1315 We present the MultiScript Phonetic Search algorithm to address the problem of language learners looking up unfamiliar words that they heard. We apply it to Arabic dictionary lookup with noisy queries done using both the Arabic and Roman scripts. Our algorithm is based on a computational phonetic distance metric that can be optionally machine learned. To benchmark our performance, we created the ArabScribe dataset, containing 10,000 noisy transcriptions of random Arabic dictionary words. Our algorithm outperforms Google Translate's “did you mean" feature, as well as the Yamli smart Arabic keyboard. inproceedings zhang-habash-toussaint:2017:W17-13 Arabic POS Tagging: Don't Abandon Feature Engineering Just Yet KareemDarwish HamdyMubarak AhmedAbdelali MohamedEldesouki Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 130–137 http://www.aclweb.org/anthology/W17-1316 This paper focuses on comparing between using Support Vector Machine based ranking (SVM-Rank) and Bidirectional Long-Short-Term-Memory (bi-LSTM) neural-network based sequence labeling in building a state-of-the-art Arabic part-of-speech tagging system. Using SVM-Rank leads to state-of-the-art results, but with a fair amount of feature engineering. Using bi-LSTM, particularly when combined with word embeddings, may lead to competitive POS-tagging results by automatically deducing latent linguistic features. However, we show that augmenting bi-LSTM sequence labeling with some of the features that we used for the SVM-Rank based tagger yields to further improvements. We also show that gains that realized by using embeddings may not be additive with the gains achieved by the features. We are open-sourcing both the SVM-Rank and the bi-LSTM based systems for free. inproceedings darwish-EtAl:2017:W17-13 Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties SoumiaBougrine AichaChorana AbdallahLakhdari HaddaCherroun Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 138–146 http://www.aclweb.org/anthology/W17-1317 The success of machine learning for automatic speech processing has raised the need for large scale datasets. However, collecting such data is often a challenging task as it implies significant investment involving time and money cost. In this paper, we devise a recipe for building largescale Speech Corpora by harnessing Web resources namely YouTube, other Social Media, Online Radio and TV. We illustrate our methodology by building KALAM’DZ, An Arabic Spoken corpus dedicated to Algerian dialectal varieties. The preliminary version of our dataset covers all major Algerian dialects. In addition, we make sure that this material takes into account numerous aspects that foster its richness. In fact, we have targeted various speech topics. Some automatic and manual annotations are provided. They gather useful information related to the speakers and sub-dialect information at the utterance level. Our corpus encompasses the 8 major Algerian Arabic sub-dialects with 4881 speakers and more than 104.4 hours segmented in utterances of at least 6 s. inproceedings bougrine-EtAl:2017:W17-13 Not All Segments are Created Equal: Syntactically Motivated Sentiment Analysis in Lexical Space MuhammadAbdul-Mageed Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 147–156 http://www.aclweb.org/anthology/W17-1318 Although there is by now a considerable amount of research on subjectivity and sentiment analysis on morphologically-rich languages, it is still unclear how lexical information can best be modeled in these languages. To bridge this gap, we build effective models exploiting exclusively gold- and machine-segmented lexical input and successfully employ syntactically motivated feature selection to improve classification. Our best models achieve significantly above the baselines, with 67.93% and 69.37% accuracies for subjectivity and sentiment classification respectively. inproceedings abdulmageed:2017:W17-13 An enhanced automatic speech recognition system for Arabic Mohamed AmineMenacer OdileMella DominiqueFohr DenisJouvet DavidLanglois KamelSmaili Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 157–165 http://www.aclweb.org/anthology/W17-1319 Automatic speech recognition for Arabic is a very challenging task. Despite all the classical techniques for Automatic Speech Recognition (ASR), which can be efficiently applied to Arabic speech recognition, it is essential to take into consideration the language specificities to improve the system performance. In this article, we focus on Modern Standard Arabic (MSA) speech recognition. We introduce the challenges related to Arabic language, namely the complex morphology nature of the language and the absence of the short vowels in written text, which leads to several potential vowelization for each graphemes, which is often conflicting. We develop an ASR system for MSA by using Kaldi toolkit. Several acoustic and language models are trained. We obtain a Word Error Rate (WER) of 14.42 for the baseline system and 12.2 relative improvement by rescoring the lattice and by rewriting the output with the right Z hamoza above or below Alif. inproceedings menacer-EtAl:2017:W17-13 Universal Dependencies for Arabic DimaTaji NizarHabash DanielZeman Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 166–176 http://www.aclweb.org/anthology/W17-1320 We describe the process of creating NUDAR, a Universal Dependency treebank for Arabic. We present the conversion from the Penn Arabic Tree- bank to the Universal Dependency syntactic representation through an intermediate dependency representation. We discuss the challenges faced in the conversion of the trees, the decisions we made to solve them, and the validation of our conversion. We also present initial parsing results on NUDAR. inproceedings taji-habash-zeman:2017:W17-13 A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic MohamedAl-Badrashiny AbdelatiHawwari MonaDiab Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 177–184 http://www.aclweb.org/anthology/W17-1321 In this paper we present a system for automatic Arabic text diacritization using three levels of analysis granularity in a layered back off manner. We build and exploit diacritized language models (LM) for each of three different levels of granularity: surface form, morphologically segmented into prefix/stem/suffix, and character level. For each of the passes, we use Viterbi search to pick the most probable diacritization per word in the input. We start with the surface form LM, followed by the morphological level, then finally we leverage the character level LM. Our system outperforms all of the published systems evaluated against the same training and test data. It achieves a 10.87% WER for complete full diacritization including lexical and syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring syntactic diacritization. inproceedings albadrashiny-hawwari-diab:2017:W17-13 Arabic Textual Entailment with Word Embeddings NadaAlmarwani MonaDiab Proceedings of the Third Arabic Natural Language Processing Workshop April 2017

Valencia, Spain

Association for Computational Linguistics 185–190 http://www.aclweb.org/anthology/W17-1322 Determining the textual entailment be- tween texts is important in many NLP tasks, such as summarization, question answering, and information extraction and retrieval. Various methods have been suggested based on external knowledge sources; however, such resources are not always available in all languages and their acquisition is typically laborious and very costly. Distributional word representations such as word embeddings learned over large corpora have been shown to capture syntactic and semantic word relationships. Such models have contributed to improv- ing the performance of several NLP tasks. In this paper, we address the problem of textual entailment in Arabic. We employ both traditional features and distributional representations. Crucially, we do not de- pend on any external resources in the pro- cess. Our suggested approach yields state of the art performance on a standard data set, ArbTE, achieving an accuracy of 76.2 % compared to state of the art of 69.3 %. inproceedings almarwani-diab:2017:W17-13