Advances in Ngram-based Discrimination of Similar Languages

Cyril Goutte, Serge Léger


Abstract
We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a mixture of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the two-stage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, and that additional data has a small but decisive impact.
Anthology ID:
W16-4823
Volume:
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)
Month:
December
Year:
2016
Address:
Osaka, Japan
Editors:
Preslav Nakov, Marcos Zampieri, Liling Tan, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi
Venue:
VarDial
SIG:
Publisher:
The COLING 2016 Organizing Committee
Note:
Pages:
178–184
Language:
URL:
https://aclanthology.org/W16-4823
DOI:
Bibkey:
Cite (ACL):
Cyril Goutte and Serge Léger. 2016. Advances in Ngram-based Discrimination of Similar Languages. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 178–184, Osaka, Japan. The COLING 2016 Organizing Committee.
Cite (Informal):
Advances in Ngram-based Discrimination of Similar Languages (Goutte & Léger, VarDial 2016)
Copy Citation:
PDF:
https://aclanthology.org/W16-4823.pdf