Naive Bayes and BiLSTM Ensemble for Discriminating between Mainland and Taiwan Variation of Mandarin Chinese

Li Yang, Yang Xiang


Abstract
Automatic dialect identification is a more challengingctask than language identification, as it requires the ability to discriminate between varieties of one language. In this paper, we propose an ensemble based system, which combines traditional machine learning models trained on bag of n-gram fetures, with deep learning models trained on word embeddings, to solve the Discriminating between Mainland and Taiwan Variation of Mandarin Chinese (DMT) shared task at VarDial 2019. Our experiments show that a character bigram-trigram combination based Naive Bayes is a very strong model for identifying varieties of Mandarin Chinense. Through further ensemble of Navie Bayes and BiLSTM, our system (team: itsalexyang) achived an macro-averaged F1 score of 0.8530 and 0.8687 in two tracks.
Anthology ID:
W19-1412
Volume:
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
June
Year:
2019
Address:
Ann Arbor, Michigan
Editors:
Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
120–127
Language:
URL:
https://aclanthology.org/W19-1412/
DOI:
10.18653/v1/W19-1412
Bibkey:
Cite (ACL):
Li Yang and Yang Xiang. 2019. Naive Bayes and BiLSTM Ensemble for Discriminating between Mainland and Taiwan Variation of Mandarin Chinese. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 120–127, Ann Arbor, Michigan. Association for Computational Linguistics.
Cite (Informal):
Naive Bayes and BiLSTM Ensemble for Discriminating between Mainland and Taiwan Variation of Mandarin Chinese (Yang & Xiang, VarDial 2019)
Copy Citation:
PDF:
https://aclanthology.org/W19-1412.pdf