Statistical Measures for Readability Assessment
Mohammed Attia | Younes Samih | Yo Ehara
Proceedings of the Joint 3rd International Conference on Natural Language Processing for Digital Humanities and 8th International Workshop on Computational Linguistics for Uralic Languages

Neural models and deep learning techniques have predominantly been used in many tasks of natural language processing (NLP), including automatic readability assessment (ARA). They apply deep transfer learning and enjoy high accuracy. However, most of the models still cannot leverage long dependence such as inter-sentential topic-level or document-level information because of their structure and computational cost. Moreover, neural models usually have low interpretability. In this paper, we propose a generalization of passage-level, corpus-level, document-level and topic-level features. In our experiments, we show the effectiveness of “Statistical Lexical Spread (SLS)” features when combined with IDF (inverse document frequency) and TF-IDF (term frequency–inverse document frequency), which adds a topological perspective (inter-document) to readability to complement the typological approaches (intra-document) used in traditional readability formulas. Interestingly, simply adding these features in BERT models outperformed state-of-the-art systems trained on a large number of hand-crafted features derived from heavy linguistic processing. In analysis, we show that SLS is also easy-to-interpret because SLS computes lexical features, which appear explicitly in texts, compared to parameters in neural models.


POS Tagging for Improving Code-Switching Identification in Arabic
Mohammed Attia | Younes Samih | Ali Elkahky | Hamdy Mubarak | Ahmed Abdelali | Kareem Darwish
Proceedings of the Fourth Arabic Natural Language Processing Workshop

When speakers code-switch between their native language and a second language or language variant, they follow a syntactic pattern where words and phrases from the embedded language are inserted into the matrix language. This paper explores the possibility of utilizing this pattern in improving code-switching identification between Modern Standard Arabic (MSA) and Egyptian Arabic (EA). We try to answer the question of how strong is the POS signal in word-level code-switching identification. We build a deep learning model enriched with linguistic features (including POS tags) that outperforms the state-of-the-art results by 1.9% on the development set and 1.0% on the test set. We also show that in intra-sentential code-switching, the selection of lexical items is constrained by POS categories, where function words tend to come more often from the dialectal language while the majority of content words come from the standard language.

Segmentation for Domain Adaptation in Arabic
Mohammed Attia | Ali Elkahky
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Segmentation serves as an integral part in many NLP applications including Machine Translation, Parsing, and Information Retrieval. When a model trained on the standard language is applied to dialects, the accuracy drops dramatically. However, there are more lexical items shared by the standard language and dialects than can be found by mere surface word matching. This shared lexicon is obscured by a lot of cliticization, gemination, and character repetition. In this paper, we prove that segmentation and base normalization of dialects can help in domain adaptation by reducing data sparseness. Segmentation will improve a system performance by reducing the number of OOVs, help isolate the differences and allow better utilization of the commonalities. We show that adding a small amount of dialectal segmentation training data reduced OOVs by 5% and remarkably improves POS tagging for dialects by 7.37% f-score, even though no dialect-specific POS training data is included.

QC-GO Submission for MADAR Shared Task: Arabic Fine-Grained Dialect Identification
Younes Samih | Hamdy Mubarak | Ahmed Abdelali | Mohammed Attia | Mohamed Eldesouki | Kareem Darwish
Proceedings of the Fourth Arabic Natural Language Processing Workshop

This paper describes the QC-GO team submission to the MADAR Shared Task Subtask 1 (travel domain dialect identification) and Subtask 2 (Twitter user location identification). In our participation in both subtasks, we explored a number of approaches and system combinations to obtain the best performance for both tasks. These include deep neural nets and heuristics. Since individual approaches suffer from various shortcomings, the combination of different approaches was able to fill some of these gaps. Our system achieves F1-Scores of 66.1% and 67.0% on the development sets for Subtasks 1 and 2 respectively.


Multi-Dialect Arabic POS Tagging: A CRF Approach
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali | Mohamed Eldesouki | Younes Samih | Randah Alharbi | Mohammed Attia | Walid Magdy | Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Multilingual Multi-class Sentiment Classification Using Convolutional Neural Networks
Mohammed Attia | Younes Samih | Ali Elkahky | Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

The Morpho-syntactic Annotation of Animacy for a Dependency Parser
Mohammed Attia | Vitaly Nikolaev | Ali Elkahky
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

GHH at SemEval-2018 Task 10: Discovering Discriminative Attributes in Distributional Semantics
Mohammed Attia | Younes Samih | Manaal Faruqui | Wolfgang Maier
Proceedings of the 12th International Workshop on Semantic Evaluation

This paper describes our system submission to the SemEval 2018 Task 10 on Capturing Discriminative Attributes. Given two concepts and an attribute, the task is to determine whether the attribute is semantically related to one concept and not the other. In this work we assume that discriminative attributes can be detected by discovering the association (or lack of association) between a pair of words. The hypothesis we test in this contribution is whether the semantic difference between two pairs of concepts can be treated in terms of measuring the distance between words in a vector space, or can simply be obtained as a by-product of word co-occurrence counts.

GHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks
Mohammed Attia | Younes Samih | Wolfgang Maier
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09%.


Learning from Relatives: Unified Dialectal Arabic Segmentation
Younes Samih | Mohamed Eldesouki | Mohammed Attia | Kareem Darwish | Ahmed Abdelali | Hamdy Mubarak | Laura Kallmeyer
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

Arabic dialects do not just share a common koiné, but there are shared pan-dialectal linguistic phenomena that allow computational models for dialects to learn from each other. In this paper we build a unified segmentation model where the training data for different dialects are combined and a single model is trained. The model yields higher accuracies than dialect-specific models, eliminating the need for dialect identification before segmentation. We also measure the degree of relatedness between four major Arabic dialects by testing how a segmentation model trained on one dialect performs on the other dialects. We found that linguistic relatedness is contingent with geographical proximity. In our experiments we use SVM-based ranking and bi-LSTM-CRF sequence labeling.

CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

A Neural Architecture for Dialectal Arabic Segmentation
Younes Samih | Mohammed Attia | Mohamed Eldesouki | Ahmed Abdelali | Hamdy Mubarak | Laura Kallmeyer | Kareem Darwish
Proceedings of the Third Arabic Natural Language Processing Workshop

The automated processing of Arabic Dialects is challenging due to the lack of spelling standards and to the scarcity of annotated data and resources in general. Segmentation of words into its constituent parts is an important processing building block. In this paper, we show how a segmenter can be trained using only 350 annotated tweets using neural networks without any normalization or use of lexical features or lexical resources. We deal with segmentation as a sequence labeling problem at the character level. We show experimentally that our model can rival state-of-the-art methods that rely on additional resources.


Explicit Fine grained Syntactic and Semantic Annotation of the Idafa Construction in Arabic
Abdelati Hawwari | Mohammed Attia | Mahmoud Ghoneim | Mona Diab
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Idafa in traditional Arabic grammar is an umbrella construction that covers several phenomena including what is expressed in English as noun-noun compounds and Saxon and Norman genitives. Additionally, Idafa participates in some other constructions, such as quantifiers, quasi-prepositions, and adjectives. Identifying the various types of the Idafa construction (IC) is of importance to Natural Language processing (NLP) applications. Noun-Noun compounds exhibit special behavior in most languages impacting their semantic interpretation. Hence distinguishing them could have an impact on downstream NLP applications. The most comprehensive syntactic representation of the Arabic language is the LDC Arabic Treebank (ATB). In the ATB, ICs are not explicitly labeled and furthermore, there is no distinction between ICs of noun-noun relations and other traditional ICs. Hence, we devise a detailed syntactic and semantic typification process of the IC phenomenon in Arabic. We target the ATB as a platform for this classification. We render the ATB annotated with explicit IC labels but with the further semantic characterization which is useful for syntactic, semantic and cross language processing. Our typification of IC comprises 3 main syntactic IC types: FIC, GIC, and TIC, and they are further divided into 10 syntactic subclasses. The TIC group is further classified into semantic relations. We devise a method for automatic IC labeling and compare its yield against the CATiB treebank. Our evaluation shows that we achieve the same level of accuracy, but with the additional fine-grained classification into the various syntactic and semantic types.

The Power of Language Music: Arabic Lemmatization through Patterns
Mohammed Attia | Ayah Zirikly | Mona Diab
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

The interaction between roots and patterns in Arabic has intrigued lexicographers and morphologists for centuries. While roots provide the consonantal building blocks, patterns provide the syllabic vocalic moulds. While roots provide abstract semantic classes, patterns realize these classes in specific instances. In this way both roots and patterns are indispensable for understanding the derivational, morphological and, to some extent, the cognitive aspects of the Arabic language. In this paper we perform lemmatization (a high-level lexical processing) without relying on a lookup dictionary. We use a hybrid approach that consists of a machine learning classifier to predict the lemma pattern for a given stem, and mapping rules to convert stems to their respective lemmas with the vocalization defined by the pattern.

CogALex-V Shared Task: GHHH - Detecting Semantic Relations via Word Embeddings
Mohammed Attia | Suraj Maharjan | Younes Samih | Laura Kallmeyer | Thamar Solorio
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

This paper describes our system submission to the CogALex-2016 Shared Task on Corpus-Based Identification of Semantic Relations. Our system won first place for Task-1 and second place for Task-2. The evaluation results of our system on the test set is 88.1% (79.0% for TRUE only) f-measure for Task-1 on detecting semantic similarity, and 76.0% (42.3% when excluding RANDOM) for Task-2 on identifying finer-grained semantic relations. In our experiments, we try word analogy, linear regression, and multi-task Convolutional Neural Networks (CNNs) with word embeddings from publicly available word vectors. We found that linear regression performs better in the binary classification (Task-1), while CNNs have better performance in the multi-class semantic classification (Task-2). We assume that word analogy is more suited for deterministic answers rather than handling the ambiguity of one-to-many and many-to-many relationships. We also show that classifier performance could benefit from balancing the distribution of labels in the training data.

Multilingual Code-switching Identification via LSTM Recurrent Neural Networks
Younes Samih | Suraj Maharjan | Mohammed Attia | Laura Kallmeyer | Thamar Solorio
Proceedings of the Second Workshop on Computational Approaches to Code Switching


GWU-HASP-2015@QALB-2015 Shared Task: Priming Spelling Candidates with Probability
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the Second Workshop on Arabic Natural Language Processing


Tharwa: A Large Scale Dialectal Arabic - Standard Arabic - English Lexicon
Mona Diab | Mohamed Al-Badrashiny | Maryam Aminian | Mohammed Attia | Heba Elfardy | Nizar Habash | Abdelati Hawwari | Wael Salloum | Pradeep Dasigi | Ramy Eskander
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

We introduce an electronic three-way lexicon, Tharwa, comprising Dialectal Arabic, Modern Standard Arabic and English correspondents. The paper focuses on Egyptian Arabic as the first pilot dialect for the resource, with plans to expand to other dialects of Arabic in later phases of the project. We describe Tharwa’s creation process and report on its current status. The lexical entries are augmented with various elements of linguistic information such as POS, gender, rationality, number, and root and pattern information. The lexicon is based on a compilation of information from both monolingual and bilingual existing resources such as paper dictionaries and electronic, corpus-based dictionaries. Multiple levels of quality checks are performed on the output of each step in the creation process. The importance of this lexicon lies in the fact that it is the first resource of its kind bridging multiple variants of Arabic with English. Furthermore, it is a wide coverage lexical resource containing over 73,000 Egyptian entries. Tharwa is publicly available. We believe it will have a significant impact on both Theoretical Linguistics as well as Computational Linguistics research.

A Framework for the Classification and Annotation of Multiword Expressions in Dialectal Arabic
Abdelati Hawwari | Mohammed Attia | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

GWU-HASP: Hybrid Arabic Spelling and Punctuation Corrector
Mohammed Attia | Mohamed Al-Badrashiny | Mona Diab
Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)


The Floating Arabic Dictionary: An Automatic Method for Updating a Lexical Database through the Detection and Lemmatization of Unknown Words
Mohammed Attia | Younes Samih | Khaled Shaalan | Josef van Genabith
Proceedings of COLING 2012

Improved Spelling Error Detection and Correction for Arabic
Mohammed Attia | Pavel Pecina | Younes Samih | Khaled Shaalan | Josef van Genabith
Proceedings of COLING 2012: Posters

Arabic Word Generation and Modelling for Spell Checking
Khaled Shaalan | Mohammed Attia | Pavel Pecina | Younes Samih | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

Arabic is a language known for its rich and complex morphology. Although many research projects have focused on the problem of Arabic morphological analysis using different techniques and approaches, very few have addressed the issue of generation of fully inflected words for the purpose of text authoring. Available open-source spell checking resources for Arabic are too small and inadequate. Ayaspell, for example, the official resource used with OpenOffice applications, contains only 300,000 fully inflected words. We try to bridge this critical gap by creating an adequate, open-source and large-coverage word list for Arabic containing 9,000,000 fully inflected surface words. Furthermore, from a large list of valid forms and invalid forms we create a character-based tri-gram language model to approximate knowledge about permissible character clusters in Arabic, creating a novel method for detecting spelling errors. Testing of this language model gives a precision of 98.2% at a recall of 100%. We take our research a step further by creating a context-independent spelling correction tool using a finite-state automaton that measures the edit distance between input words and candidate corrections, the Noisy Channel Model, and knowledge-based rules. Our system performs significantly better than Hunspell in choosing the best solution, but it is still below the MS Spell Checker.

Automatic Extraction and Evaluation of Arabic LFG Resources
Mohammed Attia | Khaled Shaalan | Lamia Tounsi | Josef van Genabith
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper presents the results of an approach to automatically acquire large-scale, probabilistic Lexical-Functional Grammar (LFG) resources for Arabic from the Penn Arabic Treebank (ATB). Our starting point is the earlier, work of (Tounsi et al., 2009) on automatic LFG f(eature)-structure annotation for Arabic using the ATB. They exploit tree configuration, POS categories, functional tags, local heads and trace information to annotate nodes with LFG feature-structure equations. We utilize this annotation to automatically acquire grammatical function (dependency) based subcategorization frames and paths linking long-distance dependencies (LDDs). Many state-of-the-art treebank-based probabilistic parsing approaches are scalable and robust but often also shallow: they do not capture LDDs and represent only local information. Subcategorization frames and LDD paths can be used to recover LDDs from such parser output to capture deep linguistic information. Automatic acquisition of language resources from existing treebanks saves time and effort involved in creating such resources by hand. Moreover, data-driven automatic acquisition naturally associates probabilistic information with subcategorization frames and LDD paths. Finally, based on the statistical distribution of LDD path types, we propose empirical bounds on traditional regular expression based functional uncertainty equations used to handle LDDs in LFG.

Handling Unknown Words in Arabic FST Morphology
Khaled Shaalan | Mohammed Attia
Proceedings of the 10th International Workshop on Finite State Methods and Natural Language Processing


An Open-Source Finite State Morphological Transducer for Modern Standard Arabic
Mohammed Attia | Pavel Pecina | Antonio Toral | Lamia Tounsi | Josef van Genabith
Proceedings of the 9th International Workshop on Finite State Methods and Natural Language Processing


An Automatically Built Named Entity Lexicon for Arabic
Mohammed Attia | Antonio Toral | Lamia Tounsi | Monica Monachini | Josef van Genabith
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We have adapted and extended the automatic Multilingual, Interoperable Named Entity Lexicon approach to Arabic, using Arabic WordNet (AWN) and Arabic Wikipedia (AWK). First, we extract AWN’s instantiable nouns and identify the corresponding categories and hyponym subcategories in AWK. Then, we exploit Wikipedia inter-lingual links to locate correspondences between articles in ten different languages in order to identify Named Entities (NEs). We apply keyword search on AWK abstracts to provide for Arabic articles that do not have a correspondence in any of the other languages. In addition, we perform a post-processing step to fetch further NEs from AWK not reachable through AWN. Finally, we investigate diacritization using matching with geonames databases, MADA-TOKAN tools and different heuristics for restoring vowel marks of Arabic NEs. Using this methodology, we have extracted approximately 45,000 Arabic NEs and built, to the best of our knowledge, the largest, most mature and well-structured Arabic NE lexical resource to date. We have stored and organised this lexicon following the LMF ISO standard. We conduct a quantitative and qualitative evaluation against a manually annotated gold standard and achieve precision scores from 95.83% (with 66.13% recall) to 99.31% (with 61.45% recall) according to different values of a threshold.

Handling Unknown Words in Statistical Latent-Variable Parsing Models for Arabic, English and French
Mohammed Attia | Jennifer Foster | Deirdre Hogan | Joseph Le Roux | Lamia Tounsi | Josef van Genabith
Proceedings of the NAACL HLT 2010 First Workshop on Statistical Parsing of Morphologically-Rich Languages

Automatic Extraction of Arabic Multiword Expressions
Mohammed Attia | Antonio Toral | Lamia Tounsi | Pavel Pecina | Josef van Genabith
Proceedings of the 2010 Workshop on Multiword Expressions: from Theory to Applications


Automatic Treebank-Based Acquisition of Arabic LFG Dependency Structures
Lamia Tounsi | Mohammed Attia | Josef van Genabith
Proceedings of the EACL 2009 Workshop on Computational Approaches to Semitic Languages


Arabic Tokenization System
Mohammed Attia
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources

