Finding Romanized Arabic Dialect in Code-Mixed Tweets

Clare Voss; Stephen Tratz; Jamal Laoudi; Douglas Briesch

Finding Romanized Arabic Dialect in Code-Mixed Tweets

Clare Voss, Stephen Tratz, Jamal Laoudi, Douglas Briesch

Abstract

Recent computational work on Arabic dialect identification has focused primarily on building and annotating corpora written in Arabic script. Arabic dialects however also appear written in Roman script, especially in social media. This paper describes our recent work developing tweet corpora and a token-level classifier that identifies a Romanized Arabic dialect and distinguishes it from French and English in tweets. We focus on Moroccan Darija, one of several spoken vernaculars in the family of Maghrebi Arabic dialects. Even given noisy, code-mixed tweets,the classifier achieved token-level recall of 93.2% on Romanized Arabic dialect, 83.2% on English, and 90.1% on French. The classifier, now integrated into our tweet conversation annotation tool (Tratz et al. 2013), has semi-automated the construction of a Romanized Arabic-dialect lexicon. Two datasets, a full list of Moroccan Darija surface token forms and a table of lexical entries derived from this list with spelling variants, as extracted from our tweet corpus collection, will be made available in the LRE MAP.

Anthology ID:: L14-1086
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Editors:: Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 2249–2253
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Clare Voss, Stephen Tratz, Jamal Laoudi, and Douglas Briesch. 2014. Finding Romanized Arabic Dialect in Code-Mixed Tweets. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2249–2253, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: Finding Romanized Arabic Dialect in Code-Mixed Tweets (Voss et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/1116_Paper.pdf

PDF Cite Search Fix data