2024
pdf
bib
abs
Audiocite.net un grand corpus d’enregistrements vocaux de lecture en français
Soline Felice
|
Solène Evain
|
Solange Rossato
|
François Portet
Actes des 35èmes Journées d'Études sur la Parole
L’arrivée de l’apprentissage auto-supervisé dans le domaine du traitement automatique de la parole a permis l’utilisation de grands corpus non étiquetés pour obtenir des modèles pré-appris utilisés comme encodeurs des signaux de parole pour de nombreuses tâches. Toutefois, l’application de ces méthodes de SSL sur des langues telles que le français s’est montrée difficile due à la quantité limitée de corpus de parole du français publiquement accessible. C’est dans cet objectif que nous présentons le corpus Audiocite.net comprenant 6682 heures d’enregistrements de lecture par 130 locuteurs et locutrices. Ce corpus est construit à partir de livres audio provenant du site audiocite.net. En plus de décrire le processus de création et les statistiques obtenues, nous montrons également l’impact de ce corpus sur les modèles du projet LeBenchmark dans leurs versions 14k pour des tâches de traitement automatique de la parole.
pdf
bib
abs
A Corpus of Spontaneous L2 English Speech for Real-situation Speaking Assessment
Sylvain Coulange
|
Marie-Hélène Fries
|
Monica Masperi
|
Solange Rossato
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
When assessing second language proficiency (L2), evaluation of spontaneous speech performance is crucial. This paper presents a corpus of spontaneous L2 English speech, focusing on the speech performance of B1 and B2 proficiency speakers. Two hundred and sixty university students were recorded during a speaking task as part of a French national certificate in English. This task entailed a 10-minute role-play among 2 or 3 candidates, arguing about a controversial topic, in order to reach a negotiated compromise. Each student’s performance was evaluated by two experts, categorizing them into B2, B1 or below B1 speaking proficiency levels. Automatic diarization, transcription, and alignment at the word level were performed on the recorded conversations, in order to analyse lexical stress realisation in polysyllabic plain words of B1 and B2 proficiency students. Results showed that only 35.4% of the 6,350 targeted words had stress detected on the expected syllable, revealing a common stress shift to the final syllable. Besides a substantial inter-speaker variability (0% to 68.4%), B2 speakers demonstrated a slightly higher stress accuracy (36%) compared to B1 speakers (29.6%). Those with accurate stress placement utilized F0 and intensity to make syllable prominence, while speakers with lower accuracy tended to lengthen words on their last syllables, with minimal changes in other dimensions.
pdf
bib
abs
Audiocite.net : A Large Spoken Read Dataset in French
Soline Felice
|
Solene Virginie Evain
|
Solange Rossato
|
François Portet
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
The advent of self-supervised learning (SSL) in speech processing has allowed the use of large unlabeled datasets to learn pre-trained models, serving as powerful encoders for various downstream tasks. However, the application of these SSL methods to languages such as French has proved difficult due to the scarcity of large French speech datasets. To advance the emergence of pre-trained models for French speech, we present the Audiocite.net corpus composed of 6,682 hours of recordings from 130 readers. This corpus is composed of audiobooks from the audiocite.net website, shared by 130 readers. In addition to describing the creation process and final statistics, we also show how this dataset impacted the models of LeBenchmark project in its 14k version for speech processing downstream tasks.
pdf
bib
abs
Unraveling Spontaneous Speech Dimensions for Cross-Corpus ASR System Evaluation for French
Solene Virginie Evain
|
Solange Rossato
|
François Portet
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Many papers on speech processing use the term ‘spontaneous speech’ as a catch-all term for situations like speaking with a friend, being interviewed on radio/TV or giving a lecture. However, Automatic Speech Recognition (ASR) systems performance seems to exhibit variation on this type of speech: the more spontaneous the speech, the higher the WER (Word Error Rate). Our study focuses on better understanding the elements influencing the levels of spontaneity in order to evaluate the relation between categories of spontaneity and ASR systems performance and improve the recognition on those categories. We first analyzed the literature, listed and unraveled those elements, and finally identified four axes: the situation of communication, the level of intimacy between speakers, the channel and the type of communication. Then, we trained ASR systems and measured the impact of instances of face-to-face interaction labeled with the previous dimensions (different levels of spontaneity) on WER. We made two axes vary and found that both dimensions have an impact on the WER. The situation of communication seems to have the biggest impact on spontaneity: ASR systems give better results for situations like an interview than for friends having a conversation at home.
2022
pdf
bib
abs
Fine-tuning pre-trained models for Automatic Speech Recognition, experiments on a fieldwork corpus of Japhug (Trans-Himalayan family)
Séverine Guillaume
|
Guillaume Wisniewski
|
Cécile Macaire
|
Guillaume Jacques
|
Alexis Michaud
|
Benjamin Galliot
|
Maximin Coavoux
|
Solange Rossato
|
Minh-Châu Nguyên
|
Maxime Fily
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
This is a report on results obtained in the development of speech recognition tools intended to support linguistic documentation efforts. The test case is an extensive fieldwork corpus of Japhug, an endangered language of the Trans-Himalayan (Sino-Tibetan) family. The goal is to reduce the transcription workload of field linguists. The method used is a deep learning approach based on the language-specific tuning of a generic pre-trained representation model, XLS-R, using a Transformer architecture. We note difficulties in implementation, in terms of learning stability. But this approach brings significant improvements nonetheless. The quality of phonemic transcription is improved over earlier experiments; and most significantly, the new approach allows for reaching the stage of automatic word recognition. Subjective evaluation of the tool by the author of the training data confirms the usefulness of this approach.
2021
pdf
bib
abs
Investigating the Impact of Gender Representation in ASR Training Data: a Case Study on Librispeech
Mahault Garnerin
|
Solange Rossato
|
Laurent Besacier
Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing
In this paper we question the impact of gender representation in training data on the performance of an end-to-end ASR system. We create an experiment based on the Librispeech corpus and build 3 different training corpora varying only the proportion of data produced by each gender category. We observe that if our system is overall robust to the gender balance or imbalance in training data, it is nonetheless dependant of the adequacy between the individuals present in the training and testing sets.
2020
pdf
bib
abs
Rhythmic Proximity Between Natives And Learners Of French - Evaluation of a metric based on the CEFC corpus
Sylvain Coulange
|
Solange Rossato
Proceedings of the Twelfth Language Resources and Evaluation Conference
This work aims to better understand the role of rhythm in foreign accent, and its modelling. We made a model of rhythm in French taking into account its variability, thanks to the Corpus pour l’Étude du Français Contemporain (CEFC), which contains up to 300 hours of speech of a wide variety of speaker profiles and situations. 16 parameters were computed, each of them being based on segment duration, such as voicing and intersyllabic timing. All the parameters are fully automatically detected from signal, without ASR or transcription. A gaussian mixture model was trained on 1,340 native speakers of French; any 30-second minimum speech may be computed to get the probability of its belonging to this model. We tested it with 146 test native speakers (NS), 37 non-native speakers (NNS) from the same corpus, and 29 non-native Japanese learners of French (JpNNS) from an independent corpus. The probability of NNS having inferior log-likelihood to NS was only a tendency (p=.067), maybe due to the heterogeneity of French proficiency of the speakers; but a much bigger probability was obtained for JpNNS (p<.0001), where all speakers were A2 level. Eta-squared test showed that most efficient parameters were intersyllabic mean duration and variation coefficient, along with speech rate for NNS; and speech rate and phonation ratio for JpNNS.
pdf
bib
abs
Gender Representation in Open Source Speech Resources
Mahault Garnerin
|
Solange Rossato
|
Laurent Besacier
Proceedings of the Twelfth Language Resources and Evaluation Conference
With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics, transparency and fairness of AI systems has become a central concern within the research community. We address transparency and fairness in spoken language systems by proposing a study about gender representation in speech resources available through the Open Speech and Language Resource platform. We show that finding gender information in open source corpora is not straightforward and that gender balance depends on other corpus characteristics (elicited/non elicited speech, low/high resource language, speech task targeted). The paper ends with recommendations about metadata and gender information for researchers in order to assure better transparency of the speech systems built using such corpora.
pdf
bib
abs
Proximité rythmique entre apprenants et natifs du français Évaluation d’une métrique basée sur le CEFC (Rhythmic Proximity Between Natives And Learners Of French – Evaluation of a metric based on the CEFC corpus )
Sylvain Coulange
|
Solange Rossato
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole
Cette étude a pour objectif de proposer une quantification de l’accent étranger se basant sur des mesures rythmiques. Nous avons utilisé le Corpus pour l’Étude du Français Contemporain, qui propose plus de 300 heures de parole aux profils de locuteurs et aux situations variés. Nous nous sommes concentrés sur 16 paramètres temporels estimés à partir des durées de voisement et de syllabes. Un mélange gaussien a été appris sur les données de 1 340 natifs du français, puis testé sur des extraits de 146 natifs tirés au hasard (NS), sur ceux des 37 non-natifs présents dans le corpus (NNS), ainsi que sur des enregistrements de 29 apprenants japonais de niveau A2 d’un autre corpus. La probabilité que les NNS aient une log-vraisemblance inférieure aux NS ne dépasse pas la tendance (p = 0, 067), mais celle pour les apprenants japonais est beaucoup plus significative (p < 0, 0001). L’étude de la répartition des paramètres entre les différents groupes met en avant l’importance du débit de parole et des durées de voisement. 1
pdf
bib
abs
Représentation du genre dans des données open source de parole (Gender representation in open source speech resources 1 With the rise of artificial intelligence (AI) and the growing use of deep-learning architectures, the question of ethics and transparency in AI systems has become a central concern within the research community)
Mahault Garnerin
|
Solange Rossato
|
Laurent Besacier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 1 : Journées d'Études sur la Parole
Avec l’essor de l’intelligence artificielle (IA) et l’utilisation croissante des architectures d’apprentissage profond, la question de l’éthique et de la transparence des systèmes d’IA est devenue une préoccupation centrale au sein de la communauté de recherche. Dans cet article, nous proposons une étude sur la représentation du genre dans les ressources de parole disponibles sur la plateforme Open Speech and Language Resource. Un tout premier résultat est la difficulté d’accès aux informations sur le genre des locuteurs. Ensuite, nous montrons que l’équilibre entre les catégories de genre dépend de diverses caractéristiques des corpus (discours élicité ou non, tâche adressée). En nous appuyant sur des travaux antérieurs, nous reprenons quelques principes concernant les métadonnées dans l’optique d’assurer une meilleure transparence des systèmes de parole construits à l’aide de ces corpus.
pdf
bib
abs
Pratiques d’évaluation en ASR et biais de performance (Evaluation methodology in ASR and performance bias)
Mahault Garnerin
|
Solange Rossato
|
Laurent Besacier
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). 2e atelier Éthique et TRaitemeNt Automatique des Langues (ETeRNAL)
Nous proposons une réflexion sur les pratiques d’évaluation des systèmes de reconnaissance automatique de la parole (ASR). Après avoir défini la notion de discrimination d’un point de vue légal et la notion d’équité dans les systèmes d’intelligence artificielle, nous nous intéressons aux pratiques actuelles lors des grandes campagnes d’évaluation. Nous observons que la variabilité de la parole et plus particulièrement celle de l’individu n’est pas prise en compte dans les protocoles d’évaluation actuels rendant impossible l’étude de biais potentiels dans les systèmes.
2016
pdf
bib
abs
Acquisition et reconnaissance automatique d’expressions et d’appels vocaux dans un habitat. (Acquisition and recognition of expressions and vocal calls in a smart home)
Michel Vacher
|
Benjamin Lecouteux
|
Frédéric Aman
|
François Portet
|
Solange Rossato
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP
Cet article présente un système capable de reconnaître les appels à l’aide de personnes âgées vivant à domicile afin de leur fournir une assistance. Le système utilise une technologie de Reconnaissance Automatique de la Parole (RAP) qui doit fonctionner en conditions de parole distante et avec de la parole expressive. Pour garantir l’intimité, le système s’exécute localement et ne reconnaît que des phrases prédéfinies. Le système a été évalué par 17 participants jouant des scénarios incluant des chutes dans un Living lab reproduisant un salon. Le taux d’erreur de détection obtenu, 29%, est encourageant et souligne les défis à surmonter pour cette tâche.
pdf
bib
abs
FABIOLE, a Speech Database for Forensic Speaker Comparison
Moez Ajili
|
Jean-François Bonastre
|
Juliette Kahn
|
Solange Rossato
|
Guillaume Bernard
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
A speech database has been collected for use to highlight the importance of “speaker factor” in forensic voice comparison. FABIOLE has been created during the FABIOLE project funded by the French Research Agency (ANR) from 2013 to 2016. This corpus consists in more than 3 thousands excerpts spoken by 130 French native male speakers. The speakers are divided into two categories: 30 target speakers who everyone has 100 excerpts and 100 “impostors” who everyone has only one excerpt. The data were collected from 10 different French radio and television shows where each utterance turns with a minimum duration of 30s and has a good speech quality. The data set is mainly used for investigating speaker factor in forensic voice comparison and interpreting some unsolved issue such as the relationship between speaker characteristics and system behavior. In this paper, we present FABIOLE database. Then, preliminary experiments are performed to evaluate the effect of the “speaker factor” and the show on a voice comparison system behavior.
pdf
bib
abs
The CIRDO Corpus: Comprehensive Audio/Video Database of Domestic Falls of Elderly People
Michel Vacher
|
Saïda Bouakaz
|
Marc-Eric Bobillier Chaumon
|
Frédéric Aman
|
R. A. Khan
|
Slima Bekkadja
|
François Portet
|
Erwan Guillou
|
Solange Rossato
|
Benjamin Lecouteux
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Ambient Assisted Living aims at enhancing the quality of life of older and disabled people at home thanks to Smart Homes. In particular, regarding elderly living alone at home, the detection of distress situation after a fall is very important to reassure this kind of population. However, many studies do not include tests in real settings, because data collection in this domain is very expensive and challenging and because of the few available data sets. The C IRDO corpus is a dataset recorded in realistic conditions in D OMUS , a fully equipped Smart Home with microphones and home automation sensors, in which participants performed scenarios including real falls on a carpet and calls for help. These scenarios were elaborated thanks to a field study involving elderly persons. Experiments related in a first part to distress detection in real-time using audio and speech analysis and in a second part to fall detection using video analysis are presented. Results show the difficulty of the task. The database can be used as standardized database by researchers to evaluate and compare their systems for elderly person’s assistance.
2015
pdf
bib
Recognition of Distress Calls in Distant Speech Setting: a Preliminary Experiment in a Smart Home
Michel Vacher
|
Benjamin Lecouteux
|
Frédéric Aman
|
Solange Rossato
|
François Portet
Proceedings of SLPAT 2015: 6th Workshop on Speech and Language Processing for Assistive Technologies
2013
pdf
bib
Analyzing the Performance of Automatic Speech Recognition for Ageing Voice: Does it Correlate with Dependency Level?
Frédéric Aman
|
Michel Vacher
|
Solange Rossato
|
François Portet
Proceedings of the Fourth Workshop on Speech and Language Processing for Assistive Technologies
2012
pdf
bib
Etude de la coarticulation CV chez des adultes bègues italiens (Study of the coarticulation CV within Italian adult stutterers) [in French]
Marine Verdurand
|
Lionel Granjon
|
Daria Balbo
|
Solange Rossato
|
Claudio Zmarich
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP
pdf
bib
Vérification du locuteur : variations de performance (Speaker verification : results variation) [in French]
Juliette Kahn
|
Nicolas Scheffer
|
Solange Rossato
|
Jean-François Bonastre
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP
pdf
bib
Etude de la performance des modèles acoustiques pour des voix de personnes âgées en vue de l’adaptation des systèmes de RAP (Assessment of the acoustic models performance in the ageing voice case for ASR system adaptation) [in French]
Frédéric Aman
|
Michel Vacher
|
Solange Rossato
|
Remus Dugheanu
|
François Portet
|
Juline le Grand
|
Yuko Sasa
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 1: JEP
pdf
bib
Les technologies de la parole et du TALN pour l’assistance à domicile des personnes âgées : un rapide tour d’horizon (Quick tour of NLP and speech technologies for ambient assisted living) [in French]
François Portet
|
Michel Vacher
|
Solange Rossato
JEP-TALN-RECITAL 2012, Workshop ILADI 2012: Interactions Langagières pour personnes Agées Dans les habitats Intelligents (ILADI 2012: Language Interaction for Elderly in Smart Homes)
pdf
bib
Contribution à l’étude de la variabilité de la voix des personnes âgées en reconnaissance automatique de la parole (Contribution to the study of elderly people’s voice variability in automatic speech recognition) [in French]
Frédéric Aman
|
Michel Vacher
|
Solange Rossato
|
François Portet
JEP-TALN-RECITAL 2012, Workshop ILADI 2012: Interactions Langagières pour personnes Agées Dans les habitats Intelligents (ILADI 2012: Language Interaction for Elderly in Smart Homes)