NLP for Arbëresh: How an Endangered Language Learns to Write in the 21st Century

Giulio Cusenza, Çağrı Çöltekin


Abstract
Societies are becoming more and more connected, and minority languages often find themselves helpless against the advent of the digital age, with their speakers having to regularly turn to other languages for written communication. This work introduces the case of Arbëresh, a southern Italian language related to Albanian. It presents the very first machine-readable Arbëresh data, collected through a web campaign, and describes a set of tools developed to enable the Arbëresh people to learn how to write their language, including a spellchecker, a conjugator, a numeral generator, and an interactive platform to learn Arbëresh spelling. A comprehensive web application was set up to make these tools available to the public, as well as to collect further data through them. This method can be replicated to help revive other minority languages in a situation similar to Arbëresh’s. The main challenges of the process were the extremely low-resource setting and the variability of Arbëresh dialects.
Anthology ID:
2024.sigul-1.30
Volume:
Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Maite Melero, Sakriani Sakti, Claudia Soria
Venues:
SIGUL | WS
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
252–256
Language:
URL:
https://aclanthology.org/2024.sigul-1.30
DOI:
Bibkey:
Cite (ACL):
Giulio Cusenza and Çağrı Çöltekin. 2024. NLP for Arbëresh: How an Endangered Language Learns to Write in the 21st Century. In Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under-resourced Languages @ LREC-COLING 2024, pages 252–256, Torino, Italia. ELRA and ICCL.
Cite (Informal):
NLP for Arbëresh: How an Endangered Language Learns to Write in the 21st Century (Cusenza & Çöltekin, SIGUL-WS 2024)
Copy Citation:
PDF:
https://aclanthology.org/2024.sigul-1.30.pdf