Identification of Languages in Algerian Arabic Multilingual Documents

Wafia Adouane, Simon Dobnik


Abstract
This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual/multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14%. Finally, we introduced rules for language identification from sequences of recognised words.
Anthology ID:
W17-1301
Volume:
Proceedings of the Third Arabic Natural Language Processing Workshop
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Nizar Habash, Mona Diab, Kareem Darwish, Wassim El-Hajj, Hend Al-Khalifa, Houda Bouamor, Nadi Tomeh, Mahmoud El-Haj, Wajdi Zaghouani
Venue:
WANLP
SIG:
SEMITIC
Publisher:
Association for Computational Linguistics
Note:
Pages:
1–8
Language:
URL:
https://aclanthology.org/W17-1301
DOI:
10.18653/v1/W17-1301
Bibkey:
Cite (ACL):
Wafia Adouane and Simon Dobnik. 2017. Identification of Languages in Algerian Arabic Multilingual Documents. In Proceedings of the Third Arabic Natural Language Processing Workshop, pages 1–8, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Identification of Languages in Algerian Arabic Multilingual Documents (Adouane & Dobnik, WANLP 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1301.pdf