Exploring Lexical and Syntactic Features for Language Variety Identification

Chris van der Lee, Antal van den Bosch


Abstract
We present a method to discriminate between texts written in either the Netherlandic or the Flemish variant of the Dutch language. The method draws on a feature bundle representing text statistics, syntactic features, and word n-grams. Text statistics include average word length and sentence length, while syntactic features include ratios of function words and part-of-speech n-grams. The effectiveness of the classifier was measured by classifying Dutch subtitles developed for either Dutch or Flemish television. Several machine learning algorithms were compared as well as feature combination methods in order to find the optimal generalization performance. A machine-learning meta classifier based on AdaBoost attained the best F-score of 0.92.
Anthology ID:
W17-1224
Volume:
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
190–199
Language:
URL:
https://aclanthology.org/W17-1224
DOI:
10.18653/v1/W17-1224
Bibkey:
Cite (ACL):
Chris van der Lee and Antal van den Bosch. 2017. Exploring Lexical and Syntactic Features for Language Variety Identification. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 190–199, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Exploring Lexical and Syntactic Features for Language Variety Identification (van der Lee & van den Bosch, VarDial 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1224.pdf