Alex Choux
2025
TuniFra: A Tunisian Arabic Speech Corpus with Orthographic Transcriptions and French Translations
Alex Choux
|
Marko Avila
|
Josep Crego
|
Fethi Bougares
|
Antoine Laurent
Proceedings of The Third Arabic Natural Language Processing Conference
We introduce TuniFra, a novel and comprehensive corpus developed to advance research in Automatic Speech Recognition (ASR) and Speech-to-Text Translation (STT) for Tunisian Arabic, a notably low-resourced language variety. The TuniFra corpus comprises 15 hours of native Tunisian Arabic speech, carefully transcribed and manually translated into French. While the development of ASR and STT systems for major languages is supported by extensive datasets, low-resource languages such as Tunisian Arabic face significant challenges due to limited training data, particularly for speech technologies. TuniFra addresses this gap by offering a valuable resource tailored for both ASR and STT tasks in the Tunisian dialect. We describe our methodology for data collection, transcription, and annotation, and present initial baseline results for both Tunisian Arabic speech recognition and Tunisian Arabic–French speech translation.