A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts

Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, Nasredine Semmar


Abstract
This paper seeks to examine the effect of including background knowledge in the form of character pre-trained neural language model (LM), and data bootstrapping to overcome the problem of unbalanced limited resources. As a test, we explore the task of language identification in mixed-language short non-edited texts with an under-resourced language, namely the case of Algerian Arabic for which both labelled and unlabelled data are limited. We compare the performance of two traditional machine learning methods and a deep neural networks (DNNs) model. The results show that overall DNNs perform better on labelled data for the majority categories and struggle with the minority ones. While the effect of the untokenised and unlabelled data encoded as LM differs for each category, bootstrapping, however, improves the performance of all systems and all categories. These methods are language independent and could be generalised to other under-resourced languages for which a small labelled data and a larger unlabelled data are available.
Anthology ID:
W18-1203
Volume:
Proceedings of the Second Workshop on Subword/Character LEvel Models
Month:
June
Year:
2018
Address:
New Orleans
Editors:
Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh
Venue:
SCLeM
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
22–31
Language:
URL:
https://aclanthology.org/W18-1203
DOI:
10.18653/v1/W18-1203
Bibkey:
Cite (ACL):
Wafia Adouane, Simon Dobnik, Jean-Philippe Bernardy, and Nasredine Semmar. 2018. A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts. In Proceedings of the Second Workshop on Subword/Character LEvel Models, pages 22–31, New Orleans. Association for Computational Linguistics.
Cite (Informal):
A Comparison of Character Neural Language Model and Bootstrapping for Language Identification in Multilingual Noisy Texts (Adouane et al., SCLeM 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-1203.pdf