Native-Language Identification with Attention

Stian Steinbakken, Björn Gambäck


Abstract
The paper explores how an attention-based approach can increase performance on the task of native-language identification (NLI), i.e., to identify an author’s first language given information expressed in a second language. Previously, Support Vector Machines have consistently outperformed deep learning-based methods on the TOEFL11 data set, the de facto standard for evaluating NLI systems. The attention-based system BERT (Bidirectional Encoder Representations from Transformers) was first tested in isolation on the TOEFL11 data set, then used in a meta-classifier stack in combination with traditional techniques to produce an accuracy of 0.853. However, more labelled NLI data is now available, so BERT was also trained on the much larger Reddit-L2 data set, containing 50 times as many examples as previously used for English NLI, giving an accuracy of 0.902 on the Reddit-L2 in-domain test scenario, improving the state-of-the-art by 21.2 percentage points.
Anthology ID:
2020.icon-main.35
Volume:
Proceedings of the 17th International Conference on Natural Language Processing (ICON)
Month:
December
Year:
2020
Address:
Indian Institute of Technology Patna, Patna, India
Editors:
Pushpak Bhattacharyya, Dipti Misra Sharma, Rajeev Sangal
Venue:
ICON
SIG:
Publisher:
NLP Association of India (NLPAI)
Note:
Pages:
261–271
Language:
URL:
https://aclanthology.org/2020.icon-main.35
DOI:
Bibkey:
Cite (ACL):
Stian Steinbakken and Björn Gambäck. 2020. Native-Language Identification with Attention. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 261–271, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
Cite (Informal):
Native-Language Identification with Attention (Steinbakken & Gambäck, ICON 2020)
Copy Citation:
PDF:
https://aclanthology.org/2020.icon-main.35.pdf