Cross-corpus Native Language Identification via Statistical Embedding

Francisco Rangel, Paolo Rosso, Julian Brooke, Alexandra Uitdenbogerd


Abstract
In this paper, we approach the task of native language identification in a realistic cross-corpus scenario where a model is trained with available data and has to predict the native language from data of a different corpus. The motivation behind this study is to investigate native language identification in the Australian academic scenario where a majority of students come from China, Indonesia, and Arabic-speaking nations. We have proposed a statistical embedding representation reporting a significant improvement over common single-layer approaches of the state of the art, identifying Chinese, Arabic, and Indonesian in a cross-corpus scenario. The proposed approach was shown to be competitive even when the data is scarce and imbalanced.
Anthology ID:
W18-1605
Volume:
Proceedings of the Second Workshop on Stylistic Variation
Month:
June
Year:
2018
Address:
New Orleans
Venues:
NAACL | Style-Var | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
39–43
Language:
URL:
https://aclanthology.org/W18-1605
DOI:
10.18653/v1/W18-1605
Bibkey:
Cite (ACL):
Francisco Rangel, Paolo Rosso, Julian Brooke, and Alexandra Uitdenbogerd. 2018. Cross-corpus Native Language Identification via Statistical Embedding. In Proceedings of the Second Workshop on Stylistic Variation, pages 39–43, New Orleans. Association for Computational Linguistics.
Cite (Informal):
Cross-corpus Native Language Identification via Statistical Embedding (Rangel et al., 2018)
Copy Citation:
PDF:
https://aclanthology.org/W18-1605.pdf