Kawarith: an Arabic Twitter Corpus for Crisis Events

Alaa Alharbi, Mark Lee


Abstract
Social media (SM) platforms such as Twitter provide large quantities of real-time data that can be leveraged during mass emergencies. Developing tools to support crisis-affected communities requires available datasets, which often do not exist for low resource languages. This paper introduces Kawarith a multi-dialect Arabic Twitter corpus for crisis events, comprising more than a million Arabic tweets collected during 22 crises that occurred between 2018 and 2020 and involved several types of hazard. Exploration of this content revealed the most discussed topics and information types, and the paper presents a labelled dataset from seven emergency events that serves as a gold standard for several tasks in crisis informatics research. Using annotated data from the same event, a BERT model is fine-tuned to classify tweets into different categories in the multi- label setting. Results show that BERT-based models yield good performance on this task even with small amounts of task-specific training data.
Anthology ID:
2021.wanlp-1.5
Volume:
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Month:
April
Year:
2021
Address:
Kyiv, Ukraine (Virtual)
Editors:
Nizar Habash, Houda Bouamor, Hazem Hajj, Walid Magdy, Wajdi Zaghouani, Fethi Bougares, Nadi Tomeh, Ibrahim Abu Farha, Samia Touileb
Venue:
WANLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
42–52
Language:
URL:
https://aclanthology.org/2021.wanlp-1.5
DOI:
Bibkey:
Cite (ACL):
Alaa Alharbi and Mark Lee. 2021. Kawarith: an Arabic Twitter Corpus for Crisis Events. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 42–52, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Cite (Informal):
Kawarith: an Arabic Twitter Corpus for Crisis Events (Alharbi & Lee, WANLP 2021)
Copy Citation:
PDF:
https://aclanthology.org/2021.wanlp-1.5.pdf
Code
 alaa-a-a/multi-dialect-arabic-stop-words +  additional community code