Speech Resources in the Tamasheq Language

Marcely Zanon Boito, Fethi Bougares, Florentin Barbier, Souhir Gahbiche, Loïc Barrault, Mickael Rouvier, Yannick Estève


Abstract
In this paper we present two datasets for Tamasheq, a developing language mainly spoken in Mali and Niger. These two datasets were made available for the IWSLT 2022 low-resource speech translation track, and they consist of collections of radio recordings from daily broadcast news in Niger (Studio Kalangou) and Mali (Studio Tamani). We share (i) a massive amount of unlabeled audio data (671 hours) in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma, and (ii) a smaller 17 hours parallel corpus of audio recordings in Tamasheq, with utterance-level translations in the French language. All this data is shared under the Creative Commons BY-NC-ND 3.0 license. We hope these resources will inspire the speech community to develop and benchmark models using the Tamasheq language.
Anthology ID:
2022.lrec-1.222
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2066–2071
Language:
URL:
https://aclanthology.org/2022.lrec-1.222
DOI:
Bibkey:
Cite (ACL):
Marcely Zanon Boito, Fethi Bougares, Florentin Barbier, Souhir Gahbiche, Loïc Barrault, Mickael Rouvier, and Yannick Estève. 2022. Speech Resources in the Tamasheq Language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2066–2071, Marseille, France. European Language Resources Association.
Cite (Informal):
Speech Resources in the Tamasheq Language (Zanon Boito et al., LREC 2022)
Copy Citation:
PDF:
https://aclanthology.org/2022.lrec-1.222.pdf
Code
 mzboito/iwslt2022_tamasheq_data