Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets

Yves Bestgen


Abstract
This paper describes the system developed by the Centre for English Corpus Linguistics (CECL) to discriminating similar languages, language varieties and dialects. Based on a SVM with character and POStag n-grams as features and the BM25 weighting scheme, it achieved 92.7% accuracy in the Discriminating between Similar Languages (DSL) task, ranking first among eleven systems but with a lead over the next three teams of only 0.2%. A simpler version of the system ranked second in the German Dialect Identification (GDI) task thanks to several ad hoc postprocessing steps. Complementary analyses carried out by a cross-validation procedure suggest that the BM25 weighting scheme could be competitive in this type of tasks, at least in comparison with the sublinear TF-IDF. POStag n-grams also improved the system performance.
Anthology ID:
W17-1214
Volume:
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Month:
April
Year:
2017
Address:
Valencia, Spain
Editors:
Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
115–123
Language:
URL:
https://aclanthology.org/W17-1214
DOI:
10.18653/v1/W17-1214
Bibkey:
Cite (ACL):
Yves Bestgen. 2017. Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 115–123, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):
Improving the Character Ngram Model for the DSL Task with BM25 Weighting and Less Frequently Used Feature Sets (Bestgen, VarDial 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-1214.pdf