<?xml version="1.0" encoding="UTF-8" ?>
<volume id="W17">
  <paper id="1300">
    <title>Proceedings of the Third Arabic Natural Language Processing Workshop</title>
    <editor>Nizar Habash</editor>
    <editor>Mona Diab</editor>
    <editor>Kareem Darwish</editor>
    <editor>Wassim El-Hajj</editor>
    <editor>Hend Al-Khalifa</editor>
    <editor>Houda Bouamor</editor>
    <editor>Nadi Tomeh</editor>
    <editor>Mahmoud El-Haj</editor>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <url>http://www.aclweb.org/anthology/W17-13</url>
    <bibtype>book</bibtype>
    <bibkey>W17-13:2017</bibkey>
  </paper>

  <paper id="1301">
    <title>Identification of Languages in Algerian Arabic Multilingual Documents</title>
    <author><first>Wafia</first><last>Adouane</last></author>
    <author><first>Simon</first><last>Dobnik</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>1&#8211;8</pages>
    <url>http://www.aclweb.org/anthology/W17-1301</url>
    <abstract>This paper presents a language identification system designed to detect the
	language of each word, in its context, in a multilingual documents as generated
	in social media by bilingual/multilingual communities, in our case speakers of
	Algerian Arabic. We frame the task as a sequence tagging problem and use
	supervised machine learning with standard methods like HMM and Ngram
	classification tagging. We also experiment with a lexicon-based method.
	Combining all the methods in a fall-back mechanism and introducing some
	linguistic rules, to deal with unseen tokens and ambiguous words, gives an
	overall accuracy of 93.14%. Finally, we introduced rules for language
	identification from sequences of recognised words.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>adouane-dobnik:2017:W17-13</bibkey>
  </paper>

  <paper id="1302">
    <title>Arabic Diacritization: Stats, Rules, and Hacks</title>
    <author><first>Kareem</first><last>Darwish</last></author>
    <author><first>Hamdy</first><last>Mubarak</last></author>
    <author><first>Ahmed</first><last>Abdelali</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>9&#8211;17</pages>
    <url>http://www.aclweb.org/anthology/W17-1302</url>
    <abstract>In this paper, we present a new and fast state-of-the-art Arabic diacritizer
	that guesses the diacritics of words and then their case endings.  We employ a
	Viterbi decoder at word-level with back-off to stem, morphological patterns,
	and transliteration and sequence labeling based diacritization of named
	entities.  For case endings, we use Support Vector Machine (SVM) based ranking
	coupled with morphological patterns and linguistic rules to properly guess case
	endings. We achieve a low word level diacritization error of 3.29% and 12.77%
	without and with case endings respectively on a new multi-genre free of
	copyright test set. We are making the diacritizer available for free for
	research purposes.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>darwish-mubarak-abdelali:2017:W17-13</bibkey>
  </paper>

  <paper id="1303">
    <title>Semantic Similarity of Arabic Sentences with Word Embeddings</title>
    <author><first>El Moatez Billah</first><last>Nagoudi</last></author>
    <author><first>Didier</first><last>Schwab</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>18&#8211;24</pages>
    <url>http://www.aclweb.org/anthology/W17-1303</url>
    <abstract>Semantic textual similarity is the basis of
	countless applications and plays an important
	role in diverse areas, such as information
	retrieval, plagiarism detection, information
	extraction and machine translation.
	This article proposes an innovative
	word embedding-based system devoted to
	calculate the semantic similarity in Arabic
	sentences. The main idea is to exploit vectors
	as word representations in a multidimensional
	space in order to capture the semantic
	and syntactic properties of words.
	IDF weighting and Part-of-Speech tagging
	are applied on the examined sentences to
	support the identification of words that are
	highly descriptive in each sentence. The
	performance of our proposed system is
	confirmed through the Pearson correlation
	between our assigned semantic similarity
	scores and human judgments.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>nagoudi-schwab:2017:W17-13</bibkey>
  </paper>

  <paper id="1304">
    <title>Morphological Analysis for the Maltese Language: The challenges of a hybrid system</title>
    <author><first>Claudia</first><last>Borg</last></author>
    <author><first>Albert</first><last>Gatt</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>25&#8211;34</pages>
    <url>http://www.aclweb.org/anthology/W17-1304</url>
    <abstract>Maltese is a morphologically rich language with a hybrid morphological system
	which features both concatenative and non-concatenative processes. This paper
	analyses the impact of this hybridity on the performance of machine learning
	techniques for morphological labelling and clustering. In particular, we
	analyse a dataset of morphologically related word clusters to evaluate the
	difference in results for concatenative and non-concatenative clusters. We also
	describe research carried out in morphological labelling, with a particular
	focus on the verb category. Two evaluations were carried out, one using an
	unseen dataset, and another one using a gold standard dataset which was
	manually labelled. The gold standard dataset was split into concatenative and
	non-concatenative to analyse the difference in results between the two
	morphological systems.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>borg-gatt:2017:W17-13</bibkey>
  </paper>

  <paper id="1305">
    <title>A Morphological Analyzer for Gulf Arabic Verbs</title>
    <author><first>Salam</first><last>Khalifa</last></author>
    <author><first>Sara</first><last>Hassan</last></author>
    <author><first>Nizar</first><last>Habash</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>35&#8211;45</pages>
    <url>http://www.aclweb.org/anthology/W17-1305</url>
    <abstract>We present CALIMAGLF, a Gulf Arabic morphological analyzer currently covering
	over 2,600 verbal lemmas. We describe in detail the process of building the
	analyzer starting from phonetic dictionary entries to fully inflected
	orthographic paradigms and associated lexicon and orthographic variants. We
	evaluate the coverage of CALIMA-GLF against Modern Standard Arabic and Egyptian
	Arabic analyzers on part of a Gulf Arabic novel. CALIMA-GLF verb analysis token
	recall for identifying correct POS tag outperforms both the Modern Standard
	Arabic and Egyptian Arabic analyzers by over 27.4% and 16.9% absolute,
	respectively.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>khalifa-hassan-habash:2017:W17-13</bibkey>
  </paper>

  <paper id="1306">
    <title>A Neural Architecture for Dialectal Arabic Segmentation</title>
    <author><first>Younes</first><last>Samih</last></author>
    <author><first>Mohammed</first><last>Attia</last></author>
    <author><first>Mohamed</first><last>Eldesouki</last></author>
    <author><first>Ahmed</first><last>Abdelali</last></author>
    <author><first>Hamdy</first><last>Mubarak</last></author>
    <author><first>Laura</first><last>Kallmeyer</last></author>
    <author><first>Kareem</first><last>Darwish</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>46&#8211;54</pages>
    <url>http://www.aclweb.org/anthology/W17-1306</url>
    <abstract>The automated processing of Arabic Dialects is challenging due to the lack of
	spelling standards and to the scarcity of annotated data and resources in
	general. Segmentation of words into its constituent parts is an important
	processing building block. In this paper, we show how a segmenter can be
	trained using only 350 annotated tweets using neural networks without any
	normalization or use of lexical features or lexical resources. We deal with
	segmentation as a sequence labeling problem at the character level. We show
	experimentally that our model can rival state-of-the-art methods that rely on
	additional resources.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>samih-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1307">
    <title>Sentiment Analysis of Tunisian Dialects: Linguistic Ressources and Experiments</title>
    <author><first>Salima</first><last>Medhaffar</last></author>
    <author><first>Fethi</first><last>Bougares</last></author>
    <author><first>Yannick</first><last>Est&#232;ve</last></author>
    <author><first>Lamia</first><last>Hadrich-Belguith</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>55&#8211;61</pages>
    <url>http://www.aclweb.org/anthology/W17-1307</url>
    <abstract>Dialectal Arabic (DA) is significantly different from the Arabic
	language taught in schools and used in written communication and formal speech
	(broadcast news, religion, politics, etc.). There are many existing researches
	in the field of Arabic language Sentiment Analysis (SA); however, they are
	generally restricted to Modern Standard Arabic (MSA) or some dialects of
	economic or political interest. In this paper we are interested in the SA of
	the Tunisian Dialect. We utilize Machine Learning techniques to determine the
	polarity of comments written in Tunisian Dialect. First, we evaluate the SA
	systems performances with models trained using freely available MSA and
	Multi-dialectal data sets. We then collect and annotate a Tunisian Dialect
	corpus of 17.000 comments from Facebook. This corpus allows us a significant
	accuracy improvement compared to the best model trained on other Arabic
	dialects or MSA data.
	We believe that this first freely available corpus will be valuable to
	researchers working in the field of Tunisian Sentiment Analysis and similar
	areas.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>medhaffar-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1308">
    <title>CAT: Credibility Analysis of Arabic Content on Twitter</title>
    <author><first>Rim</first><last>El Ballouli</last></author>
    <author><first>Wassim</first><last>El-Hajj</last></author>
    <author><first>Ahmad</first><last>Ghandour</last></author>
    <author><first>Shady</first><last>Elbassuoni</last></author>
    <author><first>Hazem</first><last>Hajj</last></author>
    <author><first>Khaled</first><last>Shaban</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>62&#8211;71</pages>
    <url>http://www.aclweb.org/anthology/W17-1308</url>
    <abstract>Data generated on Twitter has become a rich source for various data mining
	tasks. 
	Those data analysis tasks that are dependent on the tweet semantics, such as
	sentiment 
	analysis, emotion mining, and rumor detection among others, suffer considerably
	if 
	the tweet is not credible, not real, or spam. In this paper, we perform an
	extensive 
	analysis on credibility of Arabic content on Twitter. We also build a
	classification 
	model (CAT) to automatically predict the credibility of a given Arabic tweet.
	Of particular originality is the inclusion of features extracted directly 
	or indirectly from the author's profile and timeline. To train and test CAT, we
	annotated for credibility a data set of 9,000 Arabic tweets that are 
	topic independent. CAT achieved consistent improvements 
	in predicting the credibility of the tweets when compared to several baselines
	and when 
	compared to the state-of-the-art approach with an improvement of 21% in
	weighted 
	average F-measure. We also conducted experiments to highlight the importance of
	the 
	user-based features as opposed to the content-based 
	features. We conclude our work with a feature reduction 
	experiment that highlights the best indicative features of credibility.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>elballouli-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1309">
    <title>A New Error Annotation for Dyslexic texts in Arabic</title>
    <author><first>Maha</first><last>Alamri</last></author>
    <author><first>William J.</first><last>Teahan</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>72&#8211;78</pages>
    <url>http://www.aclweb.org/anthology/W17-1309</url>
    <abstract>This paper aims to develop a new classification of errors made in Arabic by
	those suffering from dyslexia to be used in the annotation of the Arabic
	dyslexia corpus (BDAC). The dyslexic error classification for Arabic texts
	(DECA) comprises a list of spelling errors extracted from previous studies and
	a collection of texts written by people with dyslexia that can provide a
	framework to help analyse specific errors committed by dyslexic writers. The
	classification comprises 37 types of errors, grouped into nine categories. The
	paper also discusses building a corpus of dyslexic Arabic texts that uses the
	error annotation scheme and provides an analysis of the errors that were found
	in the texts.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>alamri-teahan:2017:W17-13</bibkey>
  </paper>

  <paper id="1310">
    <title>An Unsupervised Speaker Clustering Technique based on SOM and I-vectors for Speech Recognition Systems</title>
    <author><first>Hany</first><last>Ahmed</last></author>
    <author><first>Mohamed</first><last>Elaraby</last></author>
    <author><first>Abdullah</first><last>M. Mousa</last></author>
    <author><first>Mostafa</first><last>Elhosiny</last></author>
    <author><first>Sherif</first><last>Abdou</last></author>
    <author><first>Mohsen</first><last>Rashwan</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>79&#8211;83</pages>
    <url>http://www.aclweb.org/anthology/W17-1310</url>
    <abstract>In this paper, we introduce an enhancement for speech recognition systems using
	an unsupervised speaker clustering technique. The proposed technique is mainly
	based on I-vectors and Self-Organizing Map Neural Network(SOM).The input to the
	proposed algorithm is a set of speech utterances. For each utterance, we
	extract 100-dimensional I-vector and then SOM is used to group the utterances
	to different speakers. In our experiments, we compared our technique with
	Normalized Cross Likelihood ratio Clustering (NCLR). Results show that the
	proposed technique reduces the speaker error rate in comparison with NCLR.
	Finally, we have experimented the effect of speaker clustering on Speaker
	Adaptive Training (SAT) in a speech recognition system implemented to test the
	performance of the proposed technique. It was noted that the proposed technique
	reduced the WER over clustering speakers with NCLR.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>ahmed-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1311">
    <title>SHAKKIL: An Automatic Diacritization System for Modern Standard Arabic Texts</title>
    <author><first>Amany</first><last>Fashwan</last></author>
    <author><first>Sameh</first><last>Alansary</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>84&#8211;93</pages>
    <url>http://www.aclweb.org/anthology/W17-1311</url>
    <abstract>This paper sheds light on a system that would be able to diacritize Arabic
	texts automatically (SHAKKIL). In this system, the diacritization problem will
	be handled through two levels; morphological and syntactic processing levels.
	The adopted morphological disambiguation algorithm depends on four layers;
	Uni-morphological form layer, rule-based morphological disambiguation layer,
	statistical-based disambiguation layer and Out Of  Vocabulary (OOV) layer. The
	adopted syntactic disambiguation algorithms is concerned with detecting the
	case ending diacritics depending on a rule based approach simulating the
	shallow parsing technique. This will be achieved using an annotated corpus for
	extracting the Arabic linguistic rules, building the language models and
	testing the system output. This system is considered as a good trial of the
	interaction between rule-based approach and statistical approach, where the
	rules can help the statistics in detecting the right diacritization and vice
	versa. At this point, the morphological Word Error Rate (WER) is 4.56% while
	the morphological Diacritic Error Rate (DER) is 1.88% and the syntactic WER is
	9.36%. The best WER is 14.78% compared to the best-published results, of
	(Abandah, 2015); 11.68%, (Rashwan, et al., 2015); 12.90% and (Metwally,
	Rashwan, &#38; Atiya, 2016); 13.70%.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>fashwan-alansary:2017:W17-13</bibkey>
  </paper>

  <paper id="1312">
    <title>Arabic Tweets Treebanking and Parsing: A Bootstrapping Approach</title>
    <author><first>Fahad</first><last>Albogamy</last></author>
    <author><first>Allan</first><last>Ramsay</last></author>
    <author><first>Hanady</first><last>Ahmed</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>94&#8211;99</pages>
    <url>http://www.aclweb.org/anthology/W17-1312</url>
    <abstract>In this paper, we propose using a "bootstrapping" method for constructing a
	dependency treebank of Arabic tweets. This method uses a rule-based parser to
	create a small treebank of one thousand Arabic tweets and a data-driven parser
	to create a larger treebank by using the small treebank as a seed training set.
	We are able to create a dependency treebank from unlabelled tweets without any
	manual intervention. Experiments results show that this method can improve the
	speed of training the parser and the accuracy of the resulting parsers.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>albogamy-ramsay-ahmed:2017:W17-13</bibkey>
  </paper>

  <paper id="1313">
    <title>Identifying Effective Translations for Cross-lingual Arabic-to-English User-generated Speech Search</title>
    <author><first>Ahmad</first><last>Khwileh</last></author>
    <author><first>Haithem</first><last>Afli</last></author>
    <author><first>Gareth</first><last>Jones</last></author>
    <author><first>Andy</first><last>Way</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>100&#8211;109</pages>
    <url>http://www.aclweb.org/anthology/W17-1313</url>
    <abstract>Cross Language Information Retrieval (CLIR) systems are a valuable tool to
	enable speakers of one language to search for content of interest expressed in
	a different language. A group for whom this is of particular interest is
	bilingual Arabic speakers who wish to search for English language content using
	information needs expressed in Arabic queries. A key challenge in CLIR is
	crossing the language barrier between the query and the documents. The most
	common approach to bridging this gap is automated query translation, which can
	be unreliable for vague or short queries. In this work, we examine the
	potential for improving CLIR effectiveness by predicting the translation
	effectiveness using Query Performance Prediction (QPP) techniques. We propose a
	novel QPP method to estimate the quality of translation for an Arabic-English
	Cross-lingual User-generated Speech Search (CLUGS) task. We present an
	empirical evaluation that demonstrates the quality of our method on alternative
	translation outputs extracted from an Arabic-to-English Machine Translation
	system developed for this task. Finally, we show how this framework can be
	integrated in CLUGS to find relevant translations for improved retrieval
	performance.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>khwileh-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1314">
    <title>A Characterization Study of Arabic Twitter Data with a Benchmarking for State-of-the-Art Opinion Mining Models</title>
    <author><first>Ramy</first><last>Baly</last></author>
    <author><first>Gilbert</first><last>Badaro</last></author>
    <author><first>Georges</first><last>El-Khoury</last></author>
    <author><first>Rawan</first><last>Moukalled</last></author>
    <author><first>Rita</first><last>Aoun</last></author>
    <author><first>Hazem</first><last>Hajj</last></author>
    <author><first>Wassim</first><last>El-Hajj</last></author>
    <author><first>Nizar</first><last>Habash</last></author>
    <author><first>Khaled</first><last>Shaban</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>110&#8211;118</pages>
    <url>http://www.aclweb.org/anthology/W17-1314</url>
    <abstract>Opinion mining in Arabic is a challenging task given the rich morphology of
	the language. The task becomes more challenging when it is applied to Twitter
	data, which contains additional sources of noise, such as the use of
	unstandardized dialectal variations, the nonconformation to grammatical rules,
	the use of Arabizi and code-switching, and the use of non-text objects such as
	images and URLs to express opinion. In this paper, we perform an analytical
	study to observe how such linguistic phenomena
	vary across different Arab regions. This study of Arabic Twitter
	characterization aims at providing better understanding of Arabic Tweets, and
	fostering advanced research on the topic. Furthermore, we explore the
	performance of the two schools of machine learning on Arabic Twitter, namely
	the feature engineering approach and the deep learning approach. We consider
	models that have achieved state-of-the-art performance for opinion mining in
	English. Results highlight the advantages of using deep learning-based models,
	and confirm the importance of using morphological abstractions to address
	Arabic’s complex morphology.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>baly-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1315">
    <title>Robust Dictionary Lookup in Multiple Noisy Orthographies</title>
    <author><first>Lingliang</first><last>Zhang</last></author>
    <author><first>Nizar</first><last>Habash</last></author>
    <author><first>Godfried</first><last>Toussaint</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>119&#8211;129</pages>
    <url>http://www.aclweb.org/anthology/W17-1315</url>
    <abstract>We present the MultiScript Phonetic Search algorithm to address the problem of
	language learners looking up unfamiliar words that they heard. We apply it to
	Arabic dictionary lookup with noisy queries done using both the Arabic and
	Roman scripts. Our algorithm is based on a computational phonetic distance
	metric that can be optionally machine learned. To benchmark our performance, we
	created the ArabScribe dataset, containing 10,000 noisy transcriptions of
	random Arabic dictionary words. Our algorithm outperforms Google Translate's
	&#x201c;did you mean" feature, as well as the Yamli smart Arabic keyboard.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>zhang-habash-toussaint:2017:W17-13</bibkey>
  </paper>

  <paper id="1316">
    <title>Arabic POS Tagging: Don't Abandon Feature Engineering Just Yet</title>
    <author><first>Kareem</first><last>Darwish</last></author>
    <author><first>Hamdy</first><last>Mubarak</last></author>
    <author><first>Ahmed</first><last>Abdelali</last></author>
    <author><first>Mohamed</first><last>Eldesouki</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>130&#8211;137</pages>
    <url>http://www.aclweb.org/anthology/W17-1316</url>
    <abstract>This paper focuses on comparing between using Support Vector Machine based
	ranking (SVM-Rank) and Bidirectional Long-Short-Term-Memory (bi-LSTM)
	neural-network based sequence labeling in building a state-of-the-art Arabic
	part-of-speech tagging system. Using SVM-Rank leads to state-of-the-art
	results, but with a fair amount of feature engineering. Using bi-LSTM,
	particularly when combined with word embeddings, may lead to competitive
	POS-tagging results by automatically deducing latent linguistic features.
	However, we show that augmenting bi-LSTM sequence labeling with some of the
	features that we used for the SVM-Rank based tagger yields to further
	improvements. We also show that gains that realized by using embeddings may not
	be additive with the gains achieved by the features. We are open-sourcing both
	the SVM-Rank and the bi-LSTM based systems for free.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>darwish-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1317">
    <title>Toward a Web-based Speech Corpus for Algerian Dialectal Arabic Varieties</title>
    <author><first>Soumia</first><last>Bougrine</last></author>
    <author><first>Aicha</first><last>Chorana</last></author>
    <author><first>Abdallah</first><last>Lakhdari</last></author>
    <author><first>Hadda</first><last>Cherroun</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>138&#8211;146</pages>
    <url>http://www.aclweb.org/anthology/W17-1317</url>
    <abstract>The success of machine learning for automatic
	speech processing has raised the
	need for large scale datasets. However,
	collecting such data is often a challenging
	task as it implies significant investment involving
	time and money cost. In this paper,
	we devise a recipe for building largescale
	Speech Corpora by harnessing Web
	resources namely YouTube, other Social
	Media, Online Radio and TV. We illustrate
	our methodology by building KALAM’DZ,
	An Arabic Spoken corpus dedicated to Algerian
	dialectal varieties. The preliminary
	version of our dataset covers all major Algerian
	dialects. In addition, we make sure
	that this material takes into account numerous
	aspects that foster its richness. In
	fact, we have targeted various speech topics.
	Some automatic and manual annotations
	are provided. They gather useful
	information related to the speakers and
	sub-dialect information at the utterance
	level. Our corpus encompasses the 8 major
	Algerian Arabic sub-dialects with 4881
	speakers and more than 104.4 hours segmented
	in utterances of at least 6 s.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>bougrine-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1318">
    <title>Not All Segments are Created Equal: Syntactically Motivated Sentiment Analysis in Lexical Space</title>
    <author><first>Muhammad</first><last>Abdul-Mageed</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>147&#8211;156</pages>
    <url>http://www.aclweb.org/anthology/W17-1318</url>
    <abstract>Although there is by now a considerable amount of research on subjectivity and
	sentiment analysis on morphologically-rich languages, it is still unclear how
	lexical information can best be modeled in these languages. To bridge this gap,
	we build effective models exploiting exclusively gold- and machine-segmented
	lexical input and successfully employ syntactically motivated feature selection
	to improve classification. Our best models achieve significantly above the
	baselines, with 67.93% and 69.37% accuracies for subjectivity and sentiment
	classification respectively.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>abdulmageed:2017:W17-13</bibkey>
  </paper>

  <paper id="1319">
    <title>An enhanced automatic speech recognition system for Arabic</title>
    <author><first>Mohamed Amine</first><last>Menacer</last></author>
    <author><first>Odile</first><last>Mella</last></author>
    <author><first>Dominique</first><last>Fohr</last></author>
    <author><first>Denis</first><last>Jouvet</last></author>
    <author><first>David</first><last>Langlois</last></author>
    <author><first>Kamel</first><last>Smaili</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>157&#8211;165</pages>
    <url>http://www.aclweb.org/anthology/W17-1319</url>
    <abstract>Automatic speech recognition for Arabic
	is a very challenging task. Despite all the
	classical techniques for Automatic Speech
	Recognition (ASR), which can be efficiently applied to Arabic speech
	recognition, it is essential to take into consideration
	the language specificities to improve
	the system performance. In this article, we
	focus on Modern Standard Arabic (MSA)
	speech recognition. We introduce the challenges
	related to Arabic language, namely
	the complex morphology nature of the language
	and the absence of the short vowels
	in written text, which leads to several potential
	vowelization for each graphemes,
	which is often conflicting. We develop
	an ASR system for MSA by using Kaldi
	toolkit. Several acoustic and language
	models are trained. We obtain a Word Error
	Rate (WER) of 14.42 for the baseline
	system and 12.2 relative improvement by
	rescoring the lattice and by rewriting the
	output with the right Z hamoza above or
	below Alif.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>menacer-EtAl:2017:W17-13</bibkey>
  </paper>

  <paper id="1320">
    <title>Universal Dependencies for Arabic</title>
    <author><first>Dima</first><last>Taji</last></author>
    <author><first>Nizar</first><last>Habash</last></author>
    <author><first>Daniel</first><last>Zeman</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>166&#8211;176</pages>
    <url>http://www.aclweb.org/anthology/W17-1320</url>
    <abstract>We describe the process of creating NUDAR, a Universal Dependency treebank for
	Arabic. We present the conversion from the Penn Arabic Tree- bank to the
	Universal Dependency syntactic representation through an intermediate
	dependency representation. We discuss the challenges faced in the conversion of
	the trees, the decisions we made to solve them, and the validation of our
	conversion. We also present initial parsing results on NUDAR.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>taji-habash-zeman:2017:W17-13</bibkey>
  </paper>

  <paper id="1321">
    <title>A Layered Language Model based Hybrid Approach to Automatic Full Diacritization of Arabic</title>
    <author><first>Mohamed</first><last>Al-Badrashiny</last></author>
    <author><first>Abdelati</first><last>Hawwari</last></author>
    <author><first>Mona</first><last>Diab</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>177&#8211;184</pages>
    <url>http://www.aclweb.org/anthology/W17-1321</url>
    <abstract>In this paper we  present a system for automatic Arabic text diacritization
	using three levels of analysis granularity in a layered back off manner. We
	build and exploit diacritized language models (LM)  for each of three different
	levels of granularity: surface form, morphologically segmented into
	prefix/stem/suffix, and character level.  For each of the passes, we use
	Viterbi search to pick the most probable diacritization per word in the input.
	We start with the surface form LM, followed by the morphological level, then
	finally we leverage the character level LM. Our system outperforms all of the
	published systems evaluated against the same training and test data. It
	achieves a 10.87% WER for complete full diacritization including lexical and
	syntactic diacritization, and 3.0% WER for lexical diacritization, ignoring
	syntactic diacritization.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>albadrashiny-hawwari-diab:2017:W17-13</bibkey>
  </paper>

  <paper id="1322">
    <title>Arabic Textual Entailment with Word Embeddings</title>
    <author><first>Nada</first><last>Almarwani</last></author>
    <author><first>Mona</first><last>Diab</last></author>
    <booktitle>Proceedings of the Third Arabic Natural Language Processing Workshop</booktitle>
    <month>April</month>
    <year>2017</year>
    <address>Valencia, Spain</address>
    <publisher>Association for Computational Linguistics</publisher>
    <pages>185&#8211;190</pages>
    <url>http://www.aclweb.org/anthology/W17-1322</url>
    <abstract>Determining the textual entailment be- tween texts is important in many NLP
	tasks, such as summarization, question answering, and information extraction
	and retrieval. Various methods have been suggested based on external knowledge
	sources; however, such resources are not always available in all languages and
	their acquisition is typically laborious and very costly. Distributional word
	representations such as word embeddings learned over large corpora have been
	shown to capture syntactic and semantic word relationships. Such models have
	contributed to improv- ing the performance of several NLP tasks. In this paper,
	we address the problem of textual entailment in Arabic. We employ both
	traditional features and distributional representations. Crucially, we do not
	de- pend on any external resources in the pro- cess. Our suggested approach
	yields state of the art performance on a standard data set, ArbTE, achieving an
	accuracy of 76.2 % compared to state of the art of 69.3 %.</abstract>
    <bibtype>inproceedings</bibtype>
    <bibkey>almarwani-diab:2017:W17-13</bibkey>
  </paper>

</volume>

