Investigating Linguistic Features for Arabic NLI

Yasmeen Bassas, Sandra Kübler


Abstract
Native Language Identification (NLI) is concerned with predicting the native language of an author writing in a second language. We investigate NLI for Arabic, with a focus on the types of linguistic information given that Arabic is morphologically rich. We use the Arabic Learner Corpus (ALC) foro training and testing along with a linear SVM. We explore lexical, morpho-syntactic, and syntactic features. Results show that the best single type of information is character n-grams ranging from 2 to 6. Using this model, we achieve an accuracy of 61.84%, thus outperforming previous results (Ionesco, 2015) by 11.74% even though we use an additional 2 L1s. However, when using prefix and suffix sequences, we reach an accuracy of 53.95%, showing that an approximation of unlexicalized features still reaches solid results.
Anthology ID:
2024.arabicnlp-1.17
Volume:
Proceedings of The Second Arabic Natural Language Processing Conference
Month:
August
Year:
2024
Address:
Bangkok, Thailand
Editors:
Nizar Habash, Houda Bouamor, Ramy Eskander, Nadi Tomeh, Ibrahim Abu Farha, Ahmed Abdelali, Samia Touileb, Injy Hamed, Yaser Onaizan, Bashar Alhafni, Wissam Antoun, Salam Khalifa, Hatem Haddad, Imed Zitouni, Badr AlKhamissi, Rawan Almatham, Khalil Mrini
Venues:
ArabicNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
183–192
Language:
URL:
https://aclanthology.org/2024.arabicnlp-1.17
DOI:
10.18653/v1/2024.arabicnlp-1.17
Bibkey:
Cite (ACL):
Yasmeen Bassas and Sandra Kübler. 2024. Investigating Linguistic Features for Arabic NLI. In Proceedings of The Second Arabic Natural Language Processing Conference, pages 183–192, Bangkok, Thailand. Association for Computational Linguistics.
Cite (Informal):
Investigating Linguistic Features for Arabic NLI (Bassas & Kübler, ArabicNLP-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.arabicnlp-1.17.pdf