Walid Magdy


2021

pdf bib
Chandler: An Explainable Sarcastic Response Generator
Silviu Oprea | Steven Wilson | Walid Magdy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We introduce Chandler, a system that generates sarcastic responses to a given utterance. Previous sarcasm generators assume the intended meaning that sarcasm conceals is the opposite of the literal meaning. We argue that this traditional theory of sarcasm provides a grounding that is neither necessary, nor sufficient, for sarcasm to occur. Instead, we ground our generation process on a formal theory that specifies conditions that unambiguously differentiate sarcasm from non-sarcasm. Furthermore, Chandler not only generates sarcastic responses, but also explanations for why each response is sarcastic. This provides accountability, crucial for avoiding miscommunication between humans and conversational agents, particularly considering that sarcastic communication can be offensive. In human evaluation, Chandler achieves comparable or higher sarcasm scores, compared to state-of-the-art generators, while generating more diverse responses, that are more specific and more coherent to the input.

pdf bib
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Nizar Habash | Houda Bouamor | Hazem Hajj | Walid Magdy | Wajdi Zaghouani | Fethi Bougares | Nadi Tomeh | Ibrahim Abu Farha | Samia Touileb
Proceedings of the Sixth Arabic Natural Language Processing Workshop

pdf bib
Benchmarking Transformer-based Language Models for Arabic Sentiment and Sarcasm Detection
Ibrahim Abu Farha | Walid Magdy
Proceedings of the Sixth Arabic Natural Language Processing Workshop

The introduction of transformer-based language models has been a revolutionary step for natural language processing (NLP) research. These models, such as BERT, GPT and ELECTRA, led to state-of-the-art performance in many NLP tasks. Most of these models were initially developed for English and other languages followed later. Recently, several Arabic-specific models started emerging. However, there are limited direct comparisons between these models. In this paper, we evaluate the performance of 24 of these models on Arabic sentiment and sarcasm detection. Our results show that the models achieving the best performance are those that are trained on only Arabic data, including dialectal Arabic, and use a larger number of parameters, such as the recently released MARBERT. However, we noticed that AraELECTRA is one of the top performing models while being much more efficient in its computational cost. Finally, the experiments on AraGPT2 variants showed low performance compared to BERT models, which indicates that it might not be suitable for classification tasks.

pdf bib
Overview of the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic
Ibrahim Abu Farha | Wajdi Zaghouani | Walid Magdy
Proceedings of the Sixth Arabic Natural Language Processing Workshop

This paper provides an overview of the WANLP 2021 shared task on sarcasm and sentiment detection in Arabic. The shared task has two subtasks: sarcasm detection (subtask 1) and sentiment analysis (subtask 2). This shared task aims to promote and bring attention to Arabic sarcasm detection, which is crucial to improve the performance in other tasks such as sentiment analysis. The dataset used in this shared task, namely ArSarcasm-v2, consists of 15,548 tweets labelled for sarcasm, sentiment and dialect. We received 27 and 22 submissions for subtasks 1 and 2 respectively. Most of the approaches relied on using and fine-tuning pre-trained language models such as AraBERT and MARBERT. The top achieved results for the sarcasm detection and sentiment analysis tasks were 0.6225 F1-score and 0.748 F1-PN respectively.

pdf bib
SemEval 2021 Task 7: HaHackathon, Detecting and Rating Humor and Offense
J. A. Meaney | Steven Wilson | Luis Chiruzzo | Adam Lopez | Walid Magdy
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

SemEval 2021 Task 7, HaHackathon, was the first shared task to combine the previously separate domains of humor detection and offense detection. We collected 10,000 texts from Twitter and the Kaggle Short Jokes dataset, and had each annotated for humor and offense by 20 annotators aged 18-70. Our subtasks were binary humor detection, prediction of humor and offense ratings, and a novel controversy task: to predict if the variance in the humor ratings was higher than a specific threshold. The subtasks attracted 36-58 submissions, with most of the participants choosing to use pre-trained language models. Many of the highest performing teams also implemented additional optimization techniques, including task-adaptive training and adversarial training. The results suggest that the participating systems are well suited to humor detection, but that humor controversy is a more challenging task. We discuss which models excel in this task, which auxiliary techniques boost their performance, and analyze the errors which were not captured by the best systems.

2020

pdf bib
Embedding Structured Dictionary Entries
Steven Wilson | Walid Magdy | Barbara McGillivray | Gareth Tyson
Proceedings of the First Workshop on Insights from Negative Results in NLP

Previous work has shown how to effectively use external resources such as dictionaries to improve English-language word embeddings, either by manipulating the training process or by applying post-hoc adjustments to the embedding space. We experiment with a multi-task learning approach for explicitly incorporating the structured elements of dictionary entries, such as user-assigned tags and usage examples, when learning embeddings for dictionary headwords. Our work generalizes several existing models for learning word embeddings from dictionaries. However, we find that the most effective representations overall are learned by simply training with a skip-gram objective over the concatenated text of all entries in the dictionary, giving no particular focus to the structure of the entries.

pdf bib
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Hend Al-Khalifa | Walid Magdy | Kareem Darwish | Tamer Elsayed | Hamdy Mubarak
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

pdf bib
From Arabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset
Ibrahim Abu Farha | Walid Magdy
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

Sarcasm is one of the main challenges for sentiment analysis systems. Its complexity comes from the expression of opinion using implicit indirect phrasing. In this paper, we present ArSarcasm, an Arabic sarcasm detection dataset, which was created through the reannotation of available Arabic sentiment analysis datasets. The dataset contains 10,547 tweets, 16% of which are sarcastic. In addition to sarcasm the data was annotated for sentiment and dialects. Our analysis shows the highly subjective nature of these tasks, which is demonstrated by the shift in sentiment labels based on annotators’ biases. Experiments show the degradation of state-of-the-art sentiment analysers when faced with sarcastic content. Finally, we train a deep learning model for sarcasm detection using BiLSTM. The model achieves an F1 score of 0.46, which shows the challenging nature of the task, and should act as a basic baseline for future research on our dataset.

pdf bib
Overview of OSACT4 Arabic Offensive Language Detection Shared Task
Hamdy Mubarak | Kareem Darwish | Walid Magdy | Tamer Elsayed | Hend Al-Khalifa
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

This paper provides an overview of the offensive language detection shared task at the 4th workshop on Open-Source Arabic Corpora and Processing Tools (OSACT4). There were two subtasks, namely: Subtask A, involving the detection of offensive language, which contains unacceptable or vulgar content in addition to any kind of explicit or implicit insults or attacks against individuals or groups; and Subtask B, involving the detection of hate speech, which contains insults or threats targeting a group based on their nationality, ethnicity, race, gender, political or sport affiliation, religious belief, or other common characteristics. In total, 40 teams signed up to participate in Subtask A, and 14 of them submitted test runs. For Subtask B, 33 teams signed up to participate and 13 of them submitted runs. We present and analyze all submissions in this paper.

pdf bib
Multitask Learning for Arabic Offensive Language and Hate-Speech Detection
Ibrahim Abu Farha | Walid Magdy
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

Offensive language and hate-speech are phenomena that spread with the rising popularity of social media. Detecting such content is crucial for understanding and predicting conflicts, understanding polarisation among communities and providing means and tools to filter or block inappropriate content. This paper describes the SMASH team submission to OSACT4’s shared task on hate-speech and offensive language detection, where we explore different approaches to perform these tasks. The experiments cover a variety of approaches that include deep learning, transfer learning and multitask learning. We also explore the utilisation of sentiment information to perform the previous task. Our best model is a multitask learning architecture, based on CNN-BiLSTM, that was trained to detect hate-speech and offensive language and predict sentiment.

pdf bib
Emoji and Self-Identity in Twitter Bios
Jinhang Li | Giorgos Longinos | Steven Wilson | Walid Magdy
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

Emoji are widely used to express emotions and concepts on social media, and prior work has shown that users’ choice of emoji reflects the way that they wish to present themselves to the world. Emoji usage is typically studied in the context of posts made by users, and this view has provided important insights into phenomena such as emotional expression and self-representation. In addition to making posts, however, social media platforms like Twitter allow for users to provide a short bio, which is an opportunity to briefly describe their account as a whole. In this work, we focus on the use of emoji in these bio statements. We explore the ways in which users include emoji in these self-descriptions, finding different patterns than those observed around emoji usage in tweets. We examine the relationships between emoji used in bios and the content of users’ tweets, showing that the topics and even the average sentiment of tweets varies for users with different emoji in their bios. Lastly, we confirm that homophily effects exist with respect to the types of emoji that are included in bios of users and their followers.

pdf bib
iSarcasm: A Dataset of Intended Sarcasm
Silviu Oprea | Walid Magdy
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We consider the distinction between intended and perceived sarcasm in the context of textual sarcasm detection. The former occurs when an utterance is sarcastic from the perspective of its author, while the latter occurs when the utterance is interpreted as sarcastic by the audience. We show the limitations of previous labelling methods in capturing intended sarcasm and introduce the iSarcasm dataset of tweets labeled for sarcasm directly by their authors. Examining the state-of-the-art sarcasm detection models on our dataset showed low performance compared to previously studied datasets, which indicates that these datasets might be biased or obvious and sarcasm could be a phenomenon under-studied computationally thus far. By providing the iSarcasm dataset, we aim to encourage future NLP research to develop methods for detecting sarcasm in text as intended by the authors of the text, not as labeled under assumptions that we demonstrate to be sub-optimal.

pdf bib
Smash at SemEval-2020 Task 7: Optimizing the Hyperparameters of ERNIE 2.0 for Humor Ranking and Rating
J. A. Meaney | Steven Wilson | Walid Magdy
Proceedings of the Fourteenth Workshop on Semantic Evaluation

The use of pre-trained language models such as BERT and ULMFiT has become increasingly popular in shared tasks, due to their powerful language modelling capabilities. Our entry to SemEval uses ERNIE 2.0, a language model which is pre-trained on a large number of tasks to enrich the semantic and syntactic information learned. ERNIE’s knowledge masking pre-training task is a unique method for learning about named entities, and we hypothesise that it may be of use in a dataset which is built on news headlines and which contains many named entities. We optimize the hyperparameters in a regression and classification model and find that the hyperparameters we selected helped to make bigger gains in the classification model than the regression model.

pdf bib
Urban Dictionary Embeddings for Slang NLP Applications
Steven Wilson | Walid Magdy | Barbara McGillivray | Kiran Garimella | Gareth Tyson
Proceedings of the 12th Language Resources and Evaluation Conference

The choice of the corpus on which word embeddings are trained can have a sizable effect on the learned representations, the types of analyses that can be performed with them, and their utility as features for machine learning models. To contribute to the existing sets of pre-trained word embeddings, we introduce and release the first set of word embeddings trained on the content of Urban Dictionary, a crowd-sourced dictionary for slang words and phrases. We show that although these embeddings are trained on fewer total tokens (by at least an order of magnitude compared to most popular pre-trained embeddings), they have high performance across a range of common word embedding evaluations, ranging from semantic similarity to word clustering tasks. Further, for some extrinsic tasks such as sentiment analysis and sarcasm detection where we expect to require some knowledge of colloquial language on social media data, initializing classifiers with the Urban Dictionary Embeddings resulted in improved performance compared to initializing with a range of other well-known, pre-trained embeddings that are order of magnitude larger in size.

2019

pdf bib
Exploring Author Context for Detecting Intended vs Perceived Sarcasm
Silviu Oprea | Walid Magdy
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We investigate the impact of using author context on textual sarcasm detection. We define author context as the embedded representation of their historical posts on Twitter and suggest neural models that extract these representations. We experiment with two tweet datasets, one labelled manually for sarcasm, and the other via tag-based distant supervision. We achieve state-of-the-art performance on the second dataset, but not on the one labelled manually, indicating a difference between intended sarcasm, captured by distant supervision, and perceived sarcasm, captured by manual labelling.

pdf bib
Proceedings of the Fourth Arabic Natural Language Processing Workshop
Wassim El-Hajj | Lamia Hadrich Belguith | Fethi Bougares | Walid Magdy | Imed Zitouni | Nadi Tomeh | Mahmoud El-Haj | Wajdi Zaghouani
Proceedings of the Fourth Arabic Natural Language Processing Workshop

pdf bib
Arabic Tweet-Act: Speech Act Recognition for Arabic Asynchronous Conversations
Bushra Algotiml | AbdelRahim Elmadany | Walid Magdy
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Speech acts are the actions that a speaker intends when performing an utterance within conversations. In this paper, we proposed speech act classification for asynchronous conversations on Twitter using multiple machine learning methods including SVM and deep neural networks. We applied the proposed methods on the ArSAS tweets dataset. The obtained results show that superiority of deep learning methods compared to SVMs, where Bi-LSTM managed to achieve an accuracy of 87.5% and a macro-averaged F1 score 61.5%. We believe that our results are the first to be reported on the task of speech-act recognition for asynchronous conversations on Arabic Twitter.

pdf bib
Mazajak: An Online Arabic Sentiment Analyser
Ibrahim Abu Farha | Walid Magdy
Proceedings of the Fourth Arabic Natural Language Processing Workshop

Sentiment analysis (SA) is one of the most useful natural language processing applications. Literature is flooding with many papers and systems addressing this task, but most of the work is focused on English. In this paper, we present “Mazajak”, an online system for Arabic SA. The system is based on a deep learning model, which achieves state-of-the-art results on many Arabic dialect datasets including SemEval 2017 and ASTD. The availability of such system should assist various applications and research that rely on sentiment analysis as a tool.

2018

pdf bib
Multi-Dialect Arabic POS Tagging: A CRF Approach
Kareem Darwish | Hamdy Mubarak | Ahmed Abdelali | Mohamed Eldesouki | Younes Samih | Randah Alharbi | Mohammed Attia | Walid Magdy | Laura Kallmeyer
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

pdf bib
Part-of-Speech Tagging for Arabic Gulf Dialect Using Bi-LSTM
Randah Alharbi | Walid Magdy | Kareem Darwish | Ahmed AbdelAli | Hamdy Mubarak
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf bib
Abusive Language Detection on Arabic Social Media
Hamdy Mubarak | Kareem Darwish | Walid Magdy
Proceedings of the First Workshop on Abusive Language Online

In this paper, we present our work on detecting abusive language on Arabic social media. We extract a list of obscene words and hashtags using common patterns used in offensive and rude communications. We also classify Twitter users according to whether they use any of these words or not in their tweets. We expand the list of obscene words using this classification, and we report results on a newly created dataset of classified Arabic tweets (obscene, offensive, and clean). We make this dataset freely available for research, in addition to the list of obscene words and hashtags. We are also publicly releasing a large corpus of classified user comments that were deleted from a popular Arabic news site due to violations the site’s rules and guidelines.

2016

pdf bib
SemEval-2016 Task 3: Community Question Answering
Preslav Nakov | Lluís Màrquez | Alessandro Moschitti | Walid Magdy | Hamdy Mubarak | Abed Alhakim Freihat | Jim Glass | Bilal Randeree
Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016)

2015

pdf bib
QCRI: Answer Selection for Community Question Answering - Experiments for Arabic and English
Massimo Nicosia | Simone Filice | Alberto Barrón-Cedeño | Iman Saleh | Hamdy Mubarak | Wei Gao | Preslav Nakov | Giovanni Da San Martino | Alessandro Moschitti | Kareem Darwish | Lluís Màrquez | Shafiq Joty | Walid Magdy
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
SemEval-2015 Task 3: Answer Selection in Community Question Answering
Preslav Nakov | Lluís Màrquez | Walid Magdy | Alessandro Moschitti | Jim Glass | Bilal Randeree
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf bib
Multi-Reference Evaluation for Dialectal Speech Recognition System: A Study for Egyptian ASR
Ahmed Ali | Walid Magdy | Steve Renals
Proceedings of the Second Workshop on Arabic Natural Language Processing

2010

pdf bib
Building a Domain-specific Document Collection for Evaluating Metadata Effects on Information Retrieval
Walid Magdy | Jinming Min | Johannes Leveling | Gareth J. F. Jones
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes the development of a structured document collection containing user-generated text and numerical metadata for exploring the exploitation of metadata in information retrieval (IR). The collection consists of more than 61,000 documents extracted from YouTube video pages on basketball in general and NBA (National Basketball Association) in particular, together with a set of 40 topics and their relevance judgements. In addition, a collection of nearly 250,000 user profiles related to the NBA collection is available. Several baseline IR experiments report the effect of using video-associated metadata on retrieval effectiveness. The results surprisingly show that searching the videos titles only performs significantly better than searching additional metadata text fields of the videos such as the tags or the description.

2007

pdf bib
Arabic Cross-Document Person Name Normalization
Walid Magdy | Kareem Darwish | Ossama Emam | Hany Hassan
Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages: Common Issues and Resources

2006

pdf bib
Building a Heterogeneous Information Retrieval Collection of Printed Arabic Documents
Abdelrahim Abdelsapor | Noha Adly | Kareem Darwish | Ossama Emam | Walid Magdy | Magdi Nagi
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the development of an Arabic document image collection containing 34,651 documents from 1,378 different books and 25 topics with their relevance judgments. The books from which the collection is obtained are a part of a larger collection 75,000 books being scanned for archival and retrieval at the bibliotheca Alexandrina (BA). The documents in the collection vary widely in topics, fonts, and degradation levels. Initial baseline experiments were performed to examine the effectiveness of different index terms, with and without blind relevance feedback, on Arabic OCR degraded text.

pdf bib
Arabic OCR Error Correction Using Character Segment Correction, Language Modeling, and Shallow Morphology
Walid Magdy | Kareem Darwish
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing