BigNLI: Native Language Identification with Big Bird Embeddings

Sergey Kramp, Giovanni Cassani, Chris Emmery


Abstract
Native Language Identification (NLI) intends to classify an author’s native language based on their writing in another language. Historically, the task has heavily relied on time-consuming linguistic feature engineering, and NLI transformer models have thus far failed to offer effective, practical alternatives. The current work shows input size is a limiting factor, and that classifiers trained using Big Bird embeddings outperform linguistic feature engineering models (for which we reproduce previous work) by a large margin on the Reddit-L2 dataset. Additionally, we provide further insight into input length dependencies, show consistent out-of-sample (Europe subreddit) and out-of-domain (TOEFL-11) performance, and qualitatively analyze the embedding space. Given the effectiveness and computational efficiency of this method, we believe it offers a promising avenue for future NLI work.
Anthology ID:
2024.lrec-main.212
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
2375–2382
Language:
URL:
https://aclanthology.org/2024.lrec-main.212
DOI:
Bibkey:
Cite (ACL):
Sergey Kramp, Giovanni Cassani, and Chris Emmery. 2024. BigNLI: Native Language Identification with Big Bird Embeddings. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2375–2382, Torino, Italia. ELRA and ICCL.
Cite (Informal):
BigNLI: Native Language Identification with Big Bird Embeddings (Kramp et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.lrec-main.212.pdf