When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish

Martin Kroon, Masha Medvedeva, Barbara Plank


Abstract
In this paper we present the results of our participation in the Discriminating between Dutch and Flemish in Subtitles VarDial 2018 shared task. We try techniques proven to work well for discriminating between language varieties as well as explore the potential of using syntactic features, i.e. hierarchical syntactic subtrees. We experiment with different combinations of features. Discriminating between these two languages turned out to be a very hard task, not only for a machine: human performance is only around 0.51 F1 score; our best system is still a simple Naive Bayes model with word unigrams and bigrams. The system achieved an F1 score (macro) of 0.62, which ranked us 4th in the shared task.
Anthology ID:
W18-3928
Volume:
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)
Month:
August
Year:
2018
Address:
Santa Fe, New Mexico, USA
Editors:
Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Shervin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
244–253
Language:
URL:
https://aclanthology.org/W18-3928/
DOI:
Bibkey:
Cite (ACL):
Martin Kroon, Masha Medvedeva, and Barbara Plank. 2018. When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 244–253, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
Cite (Informal):
When Simple n-gram Models Outperform Syntactic Approaches: Discriminating between Dutch and Flemish (Kroon et al., VarDial 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-3928.pdf