Emad Mohamed


2020

pdf bib
A First Dataset for Film Age Appropriateness Investigation
Emad Mohamed | Le An Ha
Proceedings of the 12th Language Resources and Evaluation Conference

Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a corpus of 17000 films along with their age ratings. We use the textual contents in an experiment to predict the correct age classification for the United States (G, PG, PG-13, R and NC-17) and the United Kingdom (U, PG, 12A, 15, 18 and R18). Our experiments indicate that gradient boosting machines beat FastText and various Deep Learning architectures. We reach an overall accuracy of 79.3% for the US ratings compared to a projected super human accuracy of 84%. For the UK ratings, we reach an overall accuracy of 65.3% (UK) compared to a projected super human accuracy of 80.0%.

pdf bib
Fake or Real? A Study of Arabic Satirical Fake News
Hadeel Saadany | Constantin Orasan | Emad Mohamed
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

One very common type of fake news is satire which comes in a form of a news website or an online platform that parodies reputable real news agencies to create a sarcastic version of reality. This type of fake news is often disseminated by individuals on their online platforms as it has a much stronger effect in delivering criticism than through a straightforward message. However, when the satirical text is disseminated via social media without mention of its source, it can be mistaken for real news. This study conducts several exploratory analyses to identify the linguistic properties of Arabic fake news with satirical content. It shows that although it parodies real news, Arabic satirical news has distinguishing features on the lexico-grammatical level. We exploit these features to build a number of machine learning models capable of identifying satirical fake news with an accuracy of up to 98.6%. The study introduces a new dataset (3185 articles) scraped from two Arabic satirical news websites (‘Al-Hudood’ and ‘Al-Ahram Al-Mexici’) which consists of fake news. The real news dataset consists of 3710 articles collected from three official news sites: the ‘BBC-Arabic’, the ‘CNN-Arabic’ and ‘Al-Jazeera news’. Both datasets are concerned with political issues related to the Middle East.

2013

pdf bib
Pre-processing and Language Analysis for Arabic to French Statistical Machine Translation (Traduction automatique statistique pour l’arabe-français améliorée par le prétraitement et l’analyse de la langue) [in French]
Fatiha Sadat | Emad Mohamed
Proceedings of TALN 2013 (Volume 2: Short Papers)

2012

pdf bib
Annotating and Learning Morphological Segmentation of Egyptian Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present an annotation and morphological segmentation scheme for Egyptian Colloquial Arabic (ECA) in which we annotate user-generated content that significantly deviates from the orthographic and grammatical rules of Modern Standard Arabic and thus cannot be processed by the commonly used MSA tools. Using a per letter classification scheme in which each letter is classified as either a segment boundary or not, and using a memory-based classifier, with only word-internal context, prove effective and achieve a 92% exact match accuracy at the word level. The well-known MADA system achieves 81% while the per letter classification scheme using the ATB achieves 82%. Error analysis shows that the major problem is that of character ambiguity since the ECA orthography overloads the characters which would otherwise be more specific in MSA, like the differences between y (ي) and Y (ى) and A (ا) , > ( أ), and < (إ) which are collapsed to y (ي) and A (ا) respectively or even totally confused and interchangeable. While normalization helps alleviate orthographic inconsistencies, it aggravates the problem of ambiguity.

pdf bib
Morphological Segmentation and Part of Speech Tagging for Religious Arabic
Emad Mohamed
Fourth Workshop on Computational Approaches to Arabic-Script-based Languages

We annotate a small corpus of religious Arabic with morphological segmentation boundaries and fine-grained segment-based part of speech tags. Experiments on both segmentation and POS tagging show that the religious corpus-trained segmenter and POS tagger outperform the Arabic Treebak-trained ones although the latter is 21 times as big, which shows the need for building religious Arabic linguistic resources. The small corpus we annotate improves segmentation accuracy by 5% absolute (from 90.84% to 95.70%), and POS tagging by 9% absolute (from 82.22% to 91.26) when using gold standard segmentation, and by 9.6% absolute (from 78.62% to 88.22) when using automatic segmentation.

pdf bib
Transforming Standard Arabic to Colloquial Arabic
Emad Mohamed | Behrang Mohit | Kemal Oflazer
Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2011

pdf bib
The Effect of Automatic Tokenization, Vocalization, Stemming, and POS Tagging on Arabic Dependency Parsing
Emad Mohamed
Proceedings of the Fifteenth Conference on Computational Natural Language Learning

2010

pdf bib
Is Arabic Part of Speech Tagging Feasible Without Word Segmentation?
Emad Mohamed | Sandra Kübler
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf bib
Arabic Part of Speech Tagging
Emad Mohamed | Sandra Kübler
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

Arabic is a morphologically rich language, which presents a challenge for part of speech tagging. In this paper, we compare two novel methods for POS tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags that describe full words and does not require any word segmentation. The second approach is segmentation-based, using a machine learning segmenter. In this approach, the words are first segmented, then the segments are annotated with POS tags. Because of the word-based approach, we evaluate full word accuracy rather than segment accuracy. Word-based POS tagging yields better results than segment-based tagging (93.93% vs. 93.41%). Word based tagging also gives the best results on known words, the segmentation-based approach gives better results on unknown words. Combining both methods results in a word accuracy of 94.37%, which is very close to the result obtained by using gold standard segmentation (94.91%).

2009

pdf bib
Diacritization for Real-World Arabic Texts
Emad Mohamed | Sandra Kübler
Proceedings of the International Conference RANLP-2009