PALI: A Language Identification Benchmark for Perso-Arabic Scripts

Sina Ahmadi, Milind Agarwal, Antonios Anastasopoulos


Abstract
The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.
Anthology ID:
2023.vardial-1.8
Volume:
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)
Month:
May
Year:
2023
Address:
Dubrovnik, Croatia
Editors:
Yves Scherrer, Tommi Jauhiainen, Nikola Ljubešić, Preslav Nakov, Jörg Tiedemann, Marcos Zampieri
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–90
Language:
URL:
https://aclanthology.org/2023.vardial-1.8
DOI:
10.18653/v1/2023.vardial-1.8
Bibkey:
Cite (ACL):
Sina Ahmadi, Milind Agarwal, and Antonios Anastasopoulos. 2023. PALI: A Language Identification Benchmark for Perso-Arabic Scripts. In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 78–90, Dubrovnik, Croatia. Association for Computational Linguistics.
Cite (Informal):
PALI: A Language Identification Benchmark for Perso-Arabic Scripts (Ahmadi et al., VarDial 2023)
Copy Citation:
PDF:
https://aclanthology.org/2023.vardial-1.8.pdf
Video:
 https://aclanthology.org/2023.vardial-1.8.mp4