2024
pdf
bib
abs
Performance Analysis of Speech Encoders for Low-Resource SLU and ASR in Tunisian Dialect
Salima Mdhaffar
|
Haroun Elleuch
|
Fethi Bougares
|
Yannick Estève
Proceedings of The Second Arabic Natural Language Processing Conference
Speech encoders pretrained through self-supervised learning (SSL) have demonstrated remarkable performance in various downstream tasks, including Spoken Language Understanding (SLU) and Automatic Speech Recognition (ASR). For instance, fine-tuning SSL models for such tasks has shown significant potential, leading to improvements in the SOTA performance across challenging datasets.In contrast to existing research, this paper contributes by comparing the effectiveness of SSL approaches in the context of (i) the low-resource Spoken Tunisian Arabic Dialect and (ii) its combination with a low-resource SLU and ASR scenario, where only a few semantic annotations are available for fine-tuning. We conducted experiments using many SSL speech encoders on the TARIC-SLU dataset. We used speech encoders that were pre-trained on either monolingual or multilingual speech data. Some of them have also been refined without in-domain nor Tunisian data through a multimodal supervised teacher-student learning. The study made in this paper yields numerous significant findings that we will discuss in the paper.
pdf
bib
abs
Sonos Voice Control Bias Assessment Dataset: A Methodology for Demographic Bias Assessment in Voice Assistants
Chloe Sekkat
|
Fanny Leroy
|
Salima Mdhaffar
|
Blake Perry Smith
|
Yannick Estève
|
Joseph Dureau
|
Alice Coucke
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recent works demonstrate that voice assistants do not perform equally well for everyone, but research on demographic robustness of speech technologies is still scarce. This is mainly due to the rarity of large datasets with controlled demographic tags. This paper introduces the Sonos Voice Control Bias Assessment Dataset, an open dataset composed of voice assistant requests for North American English in the music domain (1,038 speakers, 166 hours, 170k audio samples, with 9,040 unique labelled transcripts) with a controlled demographic diversity (gender, age, dialectal region and ethnicity). We also release a statistical demographic bias assessment methodology, at the univariate and multivariate levels, tailored to this specific use case and leveraging spoken language understanding metrics rather than transcription accuracy, which we believe is a better proxy for user experience. To demonstrate the capabilities of this dataset and statistical method to detect demographic bias, we consider a pair of state-of-the-art Automatic Speech Recognition and Spoken Language Understanding models. Results show statistically significant differences in performance across age, dialectal region and ethnicity. Multivariate tests are crucial to shed light on mixed effects between dialectal region, gender and age.
pdf
bib
abs
TARIC-SLU: A Tunisian Benchmark Dataset for Spoken Language Understanding
Salima Mdhaffar
|
Fethi Bougares
|
Renato de Mori
|
Salah Zaiem
|
Mirco Ravanelli
|
Yannick Estève
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
In recent years, there has been a significant increase in interest in developing Spoken Language Understanding (SLU) systems. SLU involves extracting a list of semantic information from the speech signal. A major issue for SLU systems is the lack of sufficient amount of bi-modal (audio and textual semantic annotation) training data. Existing SLU resources are mainly available in high-resource languages such as English, Mandarin and French. However, one of the current challenges concerning low-resourced languages is data collection and annotation. In this work, we present a new freely available corpus, named TARIC-SLU, composed of railway transport conversations in Tunisian dialect that is continuously annotated in dialogue acts and slots. We describe the semantic model of the dataset, the data and experiments conducted to build ASR-based and SLU-based baseline models. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and will be integrated to SpeechBrain, a popular open-source conversational AI toolkit based on PyTorch.
2023
pdf
bib
abs
ON-TRAC Consortium Systems for the IWSLT 2023 Dialectal and Low-resource Speech Translation Tasks
Antoine Laurent
|
Souhir Gahbiche
|
Ha Nguyen
|
Haroun Elleuch
|
Fethi Bougares
|
Antoine Thiol
|
Hugo Riguidel
|
Salima Mdhaffar
|
Gaëlle Laperrière
|
Lucas Maison
|
Sameer Khurana
|
Yannick Estève
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)
This paper describes the ON-TRAC consortium speech translation systems developed for IWSLT 2023 evaluation campaign. Overall, we participated in three speech translation tracks featured in the low-resource and dialect speech translation shared tasks, namely; i) spoken Tamasheq to written French, ii) spoken Pashto to written French, and iii) spoken Tunisian to written English. All our primary submissions are based on the end-to-end speech-to-text neural architecture using a pretrained SAMU-XLSR model as a speech encoder and a mbart model as a decoder. The SAMU-XLSR model is built from the XLS-R 128 in order to generate language agnostic sentence-level embeddings. This building is driven by the LaBSE model trained on multilingual text dataset. This architecture allows us to improve the input speech representations and achieve significant improvements compared to conventional end-to-end speech translation systems.
2022
pdf
bib
abs
The Spoken Language Understanding MEDIA Benchmark Dataset in the Era of Deep Learning: data updates, training and evaluation tools
Gaëlle Laperrière
|
Valentin Pelloin
|
Antoine Caubrière
|
Salima Mdhaffar
|
Nathalie Camelin
|
Sahar Ghannay
|
Bassam Jabaian
|
Yannick Estève
Proceedings of the Thirteenth Language Resources and Evaluation Conference
With the emergence of neural end-to-end approaches for spoken language understanding (SLU), a growing number of studies have been presented during these last three years on this topic. The major part of these works addresses the spoken language understanding domain through a simple task like speech intent detection. In this context, new benchmark datasets have also been produced and shared with the community related to this task. In this paper, we focus on the French MEDIA SLU dataset, distributed since 2005 and used as a benchmark dataset for a large number of research works. This dataset has been shown as being the most challenging one among those accessible to the research community. Distributed by ELRA, this corpus is free for academic research since 2019. Unfortunately, the MEDIA dataset is not really used beyond the French research community. To facilitate its use, a complete recipe, including data preparation, training and evaluation scripts, has been built and integrated to SpeechBrain, an already popular open-source and all-in-one conversational AI toolkit based on PyTorch. This recipe is presented in this paper. In addition, based on the feedback of some researchers who have worked on this dataset for several years, some corrections have been brought to the initial manual annotation: the new version of the data will also be integrated into the ELRA catalogue, as the original one. More, a significant amount of data collected during the construction of the MEDIA corpus in the 2000s was never used until now: we present the first results reached on this subset — also included in the MEDIA SpeechBrain recipe — , that will be used for now as the MEDIA test2. Last, we discuss evaluation issues.
pdf
bib
abs
Impact Analysis of the Use of Speech and Language Models Pretrained by Self-Supersivion for Spoken Language Understanding
Salima Mdhaffar
|
Valentin Pelloin
|
Antoine Caubrière
|
Gaëlle Laperriere
|
Sahar Ghannay
|
Bassam Jabaian
|
Nathalie Camelin
|
Yannick Estève
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Pretrained models through self-supervised learning have been recently introduced for both acoustic and language modeling. Applied to spoken language understanding tasks, these models have shown their great potential by improving the state-of-the-art performances on challenging benchmark datasets. In this paper, we present an error analysis reached by the use of such models on the French MEDIA benchmark dataset, known as being one of the most challenging benchmarks for the slot filling task among all the benchmarks accessible to the entire research community. One year ago, the state-of-art system reached a Concept Error Rate (CER) of 13.6% through the use of a end-to-end neural architecture. Some months later, a cascade approach based on the sequential use of a fine-tuned wav2vec2.0 model and a fine-tuned BERT model reaches a CER of 11.2%. This significant improvement raises questions about the type of errors that remain difficult to treat, but also about those that have been corrected using these models pre-trained through self-supervision learning on a large amount of data. This study brings some answers in order to better understand the limits of such models and open new perspectives to continue improving the performance.
2021
pdf
bib
abs
ON-TRAC’ systems for the IWSLT 2021 low-resource speech translation and multilingual speech translation shared tasks
Hang Le
|
Florentin Barbier
|
Ha Nguyen
|
Natalia Tomashenko
|
Salima Mdhaffar
|
Souhir Gabiche Gahbiche
|
Benjamin Lecouteux
|
Didier Schwab
|
Yannick Estève
Proceedings of the 18th International Conference on Spoken Language Translation (IWSLT 2021)
This paper describes the ON-TRAC Consortium translation systems developed for two challenge tracks featured in the Evaluation Campaign of IWSLT 2021, low-resource speech translation and multilingual speech translation. The ON-TRAC Consortium is composed of researchers from three French academic laboratories and an industrial partner: LIA (Avignon Université), LIG (Université Grenoble Alpes), LIUM (Le Mans Université), and researchers from Airbus. A pipeline approach was explored for the low-resource speech translation task, using a hybrid HMM/TDNN automatic speech recognition system fed by wav2vec features, coupled to an NMT system. For the multilingual speech translation task, we investigated the us of a dual-decoder Transformer that jointly transcribes and translates an input speech. This model was trained in order to translate from multiple source languages to multiple target ones.
2020
pdf
bib
abs
A Multimodal Educational Corpus of Oral Courses: Annotation, Analysis and Case Study
Salima Mdhaffar
|
Yannick Estève
|
Antoine Laurent
|
Nicolas Hernandez
|
Richard Dufour
|
Delphine Charlet
|
Geraldine Damnati
|
Solen Quiniou
|
Nathalie Camelin
Proceedings of the Twelfth Language Resources and Evaluation Conference
This corpus is part of the PASTEL (Performing Automated Speech Transcription for Enhancing Learning) project aiming to explore the potential of synchronous speech transcription and application in specific teaching situations. It includes 10 hours of different lectures, manually transcribed and segmented. The main interest of this corpus lies in its multimodal aspect: in addition to speech, the courses were filmed and the written presentation supports (slides) are made available. The dataset may then serve researches in multiple fields, from speech and language to image and video processing. The dataset will be freely available to the research community. In this paper, we first describe in details the annotation protocol, including a detailed analysis of the manually labeled data. Then, we propose some possible use cases of the corpus with baseline results. The use cases concern scientific fields from both speech and text processing, with language model adaptation, thematic segmentation and transcription to slide alignment.
2019
pdf
bib
abs
Apport de l’adaptation automatique des modèles de langage pour la reconnaissance de la parole: évaluation qualitative extrinsèque dans un contexte de traitement de cours magistraux (Contribution of automatic adaptation of language models for speech recognition : extrinsic qualitative evaluation in a context of educational courses)
Salima Mdhaffar
|
Yannick Estève
|
Nicolas Hernandez
|
Antoine Laurent
|
Solen Quiniou
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts
Malgré les faiblesses connues de cette métrique, les performances de différents systèmes de reconnaissance automatique de la parole sont généralement comparées à l’aide du taux d’erreur sur les mots. Les transcriptions automatiques de ces systèmes sont de plus en plus exploitables et utilisées dans des systèmes complexes de traitement automatique du langage naturel, par exemple pour la traduction automatique, l’indexation, la recherche documentaire... Des études récentes ont proposé des métriques permettant de comparer la qualité des transcriptions automatiques de différents systèmes en fonction de la tâche visée. Dans cette étude nous souhaitons mesurer, qualitativement, l’apport de l’adaptation automatique des modèles de langage au domaine visé par un cours magistral. Les transcriptions du discours de l’enseignant peuvent servir de support à la navigation dans le document vidéo du cours magistral ou permettre l’enrichissement de son contenu pédagogique. C’est à-travers le prisme de ces deux tâches que nous évaluons l’apport de l’adaptation du modèle de langage. Les expériences ont été menées sur un corpus de cours magistraux et montrent combien le taux d’erreur sur les mots est une métrique insuffisante qui masque les apports effectifs de l’adaptation des modèles de langage.
2018
pdf
bib
abs
Le corpus PASTEL pour le traitement automatique de cours magistraux (PASTEL corpus for automatic processing of lectures)
Salima Mdhaffar
|
Antoine Laurent
|
Yannick Estève
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN
Le projet PASTEL étudie l’acceptabilité et l’utilisabilité des transcriptions automatiques dans le cadre d’enseignements magistraux. Il s’agit d’outiller les apprenants pour enrichir de manière synchrone et automatique les informations auxquelles ils peuvent avoir accès durant la séance. Cet enrichissement s’appuie sur des traitements automatiques du langage naturel effectués sur les transcriptions automatiques. Nous présentons dans cet article un travail portant sur l’annotation d’enregistrements de cours magistraux enregistrés dans le cadre du projet CominOpenCourseware. Ces annotations visent à effectuer des expériences de transcription automatique, segmentation thématique, appariement automatique en temps réel avec des ressources externes... Ce corpus comprend plus de neuf heures de parole annotées. Nous présentons également des expériences préliminaires réalisées pour évaluer l’adaptation automatique de notre système de reconnaissance de la parole.