Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning

Wafia Adouane, Nasredine Semmar, Richard Johansson


Abstract
The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/word-based n-gram models. However, social media and new technologies have contributed to the rise of informal and minority languages on the Web. The state-of-the-art automatic language identifiers fail to properly identify many of them. Romanized Arabic (RA) and Romanized Berber (RB) are cases of these informal languages which are under-resourced. The goal of this paper is twofold: detect RA and RB, at a document level, as separate languages and distinguish between them as they coexist in North Africa. We consider the task as a classification problem and use supervised machine learning to solve it. For both languages, character-based 5-grams combined with additional lexicons score the best, F-score of 99.75% and 97.77% for RB and RA respectively.
Anthology ID:
W16-4807
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
53–61
Language:
URL:
https://aclanthology.org/W16-4807/
DOI:
Bibkey:
Cite (ACL):
Wafia Adouane, Nasredine Semmar, and Richard Johansson. 2016. Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 53–61, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning (Adouane et al., VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4807.pdf