Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words

Helena Gomez; Ilia Markov; Jorge Baptista; Grigori Sidorov; David Pinto

doi:10.18653/v1/W17-1217

Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words

Helena Gomez, Ilia Markov, Jorge Baptista, Grigori Sidorov, David Pinto

Abstract

This paper presents the cic_ualg’s system that took part in the Discriminating between Similar Languages (DSL) shared task, held at the VarDial 2017 Workshop. This year’s task aims at identifying 14 languages across 6 language groups using a corpus of excerpts of journalistic texts. Two classification approaches were compared: a single-step (all languages) approach and a two-step (language group and then languages within the group) approach. Features exploited include lexical features (unigrams of words) and character n-grams. Besides traditional (untyped) character n-grams, we introduce typed character n-grams in the DSL task. Experiments were carried out with different feature representation methods (binary and raw term frequency), frequency threshold values, and machine-learning algorithms – Support Vector Machines (SVM) and Multinomial Naive Bayes (MNB). Our best run in the DSL task achieved 91.46% accuracy.

Anthology ID:: W17-1217
Volume:: Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)
Month:: April
Year:: 2017
Address:: Valencia, Spain
Editors:: Preslav Nakov, Marcos Zampieri, Nikola Ljubešić, Jörg Tiedemann, Shevin Malmasi, Ahmed Ali
Venue:: VarDial
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 137–145
Language:
URL:: https://aclanthology.org/W17-1217/
DOI:: 10.18653/v1/W17-1217
Bibkey:
Cite (ACL):: Helena Gomez, Ilia Markov, Jorge Baptista, Grigori Sidorov, and David Pinto. 2017. Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words. In Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), pages 137–145, Valencia, Spain. Association for Computational Linguistics.
Cite (Informal):: Discriminating between Similar Languages Using a Combination of Typed and Untyped Character N-grams and Words (Gomez et al., VarDial 2017)
Copy Citation:
PDF:: https://aclanthology.org/W17-1217.pdf

PDF Cite Search Fix data