Fewer features perform well at Native Language Identification task

Taraka Rama, Çağrı Çöltekin


Abstract
This paper describes our results at the NLI shared task 2017. We participated in essays, speech, and fusion task that uses text, speech, and i-vectors for the task of identifying the native language of the given input. In the essay track, a linear SVM system using word bigrams and character 7-grams performed the best. In the speech track, an LDA classifier based only on i-vectors performed better than a combination system using text features from speech transcriptions and i-vectors. In the fusion task, we experimented with systems that used combination of i-vectors with higher order n-grams features, combination of i-vectors with word unigrams, a mean probability ensemble, and a stacked ensemble system. Our finding is that word unigrams in combination with i-vectors achieve higher score than systems trained with larger number of n-gram features. Our best-performing systems achieved F1-scores of 87.16%, 83.33% and 91.75% on the essay track, the speech track and the fusion track respectively.
Anthology ID:
W17-5028
Volume:
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications
Month:
September
Year:
2017
Address:
Copenhagen, Denmark
Editors:
Joel Tetreault, Jill Burstein, Claudia Leacock, Helen Yannakoudakis
Venue:
BEA
SIG:
SIGEDU
Publisher:
Association for Computational Linguistics
Note:
Pages:
255–260
Language:
URL:
https://aclanthology.org/W17-5028
DOI:
10.18653/v1/W17-5028
Bibkey:
Cite (ACL):
Taraka Rama and Çağrı Çöltekin. 2017. Fewer features perform well at Native Language Identification task. In Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications, pages 255–260, Copenhagen, Denmark. Association for Computational Linguistics.
Cite (Informal):
Fewer features perform well at Native Language Identification task (Rama & Çöltekin, BEA 2017)
Copy Citation:
PDF:
https://aclanthology.org/W17-5028.pdf
Data
italki NLI