TuniFra: A Tunisian Arabic Speech Corpus with Orthographic Transcriptions and French Translations

Alex Choux, Marko Avila, Josep Crego, Fethi Bougares, Antoine Laurent


Abstract
We introduce TuniFra, a novel and comprehensive corpus developed to advance research in Automatic Speech Recognition (ASR) and Speech-to-Text Translation (STT) for Tunisian Arabic, a notably low-resourced language variety. The TuniFra corpus comprises 15 hours of native Tunisian Arabic speech, carefully transcribed and manually translated into French. While the development of ASR and STT systems for major languages is supported by extensive datasets, low-resource languages such as Tunisian Arabic face significant challenges due to limited training data, particularly for speech technologies. TuniFra addresses this gap by offering a valuable resource tailored for both ASR and STT tasks in the Tunisian dialect. We describe our methodology for data collection, transcription, and annotation, and present initial baseline results for both Tunisian Arabic speech recognition and Tunisian Arabic–French speech translation.
Anthology ID:
2025.arabicnlp-main.5
Volume:
Proceedings of The Third Arabic Natural Language Processing Conference
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Kareem Darwish, Ahmed Ali, Ibrahim Abu Farha, Samia Touileb, Imed Zitouni, Ahmed Abdelali, Sharefah Al-Ghamdi, Sakhar Alkhereyf, Wajdi Zaghouani, Salam Khalifa, Badr AlKhamissi, Rawan Almatham, Injy Hamed, Zaid Alyafeai, Areeb Alowisheq, Go Inoue, Khalil Mrini, Waad Alshammari
Venue:
ArabicNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
64–68
Language:
URL:
https://aclanthology.org/2025.arabicnlp-main.5/
DOI:
Bibkey:
Cite (ACL):
Alex Choux, Marko Avila, Josep Crego, Fethi Bougares, and Antoine Laurent. 2025. TuniFra: A Tunisian Arabic Speech Corpus with Orthographic Transcriptions and French Translations. In Proceedings of The Third Arabic Natural Language Processing Conference, pages 64–68, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
TuniFra: A Tunisian Arabic Speech Corpus with Orthographic Transcriptions and French Translations (Choux et al., ArabicNLP 2025)
Copy Citation:
PDF:
https://aclanthology.org/2025.arabicnlp-main.5.pdf