Joni Salminen


2020

pdf bib
Improving Arabic Text Categorization Using Transformer Training Diversification
Shammur Absar Chowdhury | Ahmed Abdelali | Kareem Darwish | Jung Soon-Gyo | Joni Salminen | Bernard J. Jansen
Proceedings of the Fifth Arabic Natural Language Processing Workshop

Automatic categorization of short texts, such as news headlines and social media posts, has many applications ranging from content analysis to recommendation systems. In this paper, we use such text categorization i.e., labeling the social media posts to categories like ‘sports’, ‘politics’, ‘human-rights’ among others, to showcase the efficacy of models across different sources and varieties of Arabic. In doing so, we show that diversifying the training data, whether by using diverse training data for the specific task (an increase of 21% macro F1) or using diverse data to pre-train a BERT model (26% macro F1), leads to overall improvements in classification effectiveness. In our work, we also introduce two new Arabic text categorization datasets, where the first is composed of social media posts from a popular Arabic news channel that cover Twitter, Facebook, and YouTube, and the second is composed of tweets from popular Arabic accounts. The posts in the former are nearly exclusively authored in modern standard Arabic (MSA), while the tweets in the latter contain both MSA and dialectal Arabic.

pdf bib
A Multi-Platform Arabic News Comment Dataset for Offensive Language Detection
Shammur Absar Chowdhury | Hamdy Mubarak | Ahmed Abdelali | Soon-gyo Jung | Bernard J. Jansen | Joni Salminen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Access to social media often enables users to engage in conversation with limited accountability. This allows a user to share their opinions and ideology, especially regarding public content, occasionally adopting offensive language. This may encourage hate crimes or cause mental harm to targeted individuals or groups. Hence, it is important to detect offensive comments in social media platforms. Typically, most studies focus on offensive commenting in one platform only, even though the problem of offensive language is observed across multiple platforms. Therefore, in this paper, we introduce and make publicly available a new dialectal Arabic news comment dataset, collected from multiple social media platforms, including Twitter, Facebook, and YouTube. We follow two-step crowd-annotator selection criteria for low-representative language annotation task in a crowdsourcing platform. Furthermore, we analyze the distinctive lexical content along with the use of emojis in offensive comments. We train and evaluate the classifiers using the annotated multi-platform dataset along with other publicly available data. Our results highlight the importance of multiple platform dataset for (a) cross-platform, (b) cross-domain, and (c) cross-dialect generalization of classifier performance.