An Unsupervised Morphological Criterion for Discriminating Similar Languages

Adrien Barbaresi


Abstract
In this study conducted on the occasion of the Discriminating between Similar Languages shared task, I introduce an additional decision factor focusing on the token and subtoken level. The motivation behind this submission is to test whether a morphologically-informed criterion can add linguistically relevant information to global categorization and thus improve performance. The contributions of this paper are (1) a description of the unsupervised, low-resource method; (2) an evaluation and analysis of its raw performance; and (3) an assessment of its impact within a model comprising common indicators used in language identification. I present and discuss the systems used in the task A, a 12-way language identification task comprising varieties of five main language groups. Additionally I introduce a new off-the-shelf Naive Bayes classifier using a contrastive word and subword n-gram model (“Bayesline”) which outperforms the best submissions.
Anthology ID:
W16-4827
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
212–220
Language:
URL:
https://aclanthology.org/W16-4827/
DOI:
Bibkey:
Cite (ACL):
Adrien Barbaresi. 2016. An Unsupervised Morphological Criterion for Discriminating Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 212–220, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
An Unsupervised Morphological Criterion for Discriminating Similar Languages (Barbaresi, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4827.pdf