From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification

Aliyah C. Vanterpool; Katsiaryna Aharodnik

From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification

Aliyah C. Vanterpool, Katsiaryna Aharodnik

Abstract

This study compares a traditional machine learning feature-engineering approach to a large language models (LLMs) fine-tuning method for Native Language Identification (NLI). We explored the COREFL corpus, which consists of L2 English narratives produced by Spanish and German L1 speakers with lower-advanced English proficiency (C1) (Lozano et al., 2020). For the feature-engineering approach, we extracted language productivity, linguistic diversity, and n-gram features for Support Vector Machine (SVM) classification. We also looked at sentence embeddings with SVM and logistic regression. For the LLM approach, we evaluated BERT-like models and GPT-4. The feature-engineering approach, particularly n-grams, outperformed the LLMs. Sentence-BERT embeddings with SVM achieved the second-highest accuracy (93%), while GPT-4 reached an average accuracy of 90.4% across three runs when prompted with labels. These findings suggest that feature engineering remains a robust method for NLI, especially for smaller datasets with subtle linguistic differences between classes. This study contributes to the comparative analysis of traditional machine learning and transformer-based LLMs, highlighting current LLM limitations in handling domain-specific data and their need for larger training resources.

Anthology ID:: 2025.r2lm-1.15
Volume:: Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models
Month:: September
Year:: 2025
Address:: Varna, Bulgaria
Editors:: Alicia Picazo-Izquierdo, Ernesto Luis Estevanell-Valladares, Ruslan Mitkov, Rafael Muñoz Guillena, Raúl García Cerdá
Venues:: R2LM | WS
SIG:
Publisher:: INCOMA Ltd., Shoumen, Bulgaria
Note:
Pages:: 144–153
Language:
URL:: https://aclanthology.org/2025.r2lm-1.15/
DOI:
Bibkey:
Cite (ACL):: Aliyah C. Vanterpool and Katsiaryna Aharodnik. 2025. From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification. In Proceedings of the First Workshop on Comparative Performance Evaluation: From Rules to Language Models, pages 144–153, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Cite (Informal):: From Handcrafted Features to LLMs: A Comparative Study in Native Language Identification (Vanterpool & Aharodnik, R2LM 2025)
Copy Citation:
PDF:: https://aclanthology.org/2025.r2lm-1.15.pdf

PDF Cite Search Fix data