Speech characteristics vary from speaker to speaker. While some variation phenomena are due to the overall communication setting, others are due to diastratic factors such as gender, provenance, age, and social background. The analysis of these factors, although relevant for both linguistic and speech technology communities, is hampered by the need to annotate existing corpora or to recruit, categorise, and record volunteers as a function of targeted profiles. This paper presents a methodology that uses a knowledge base to provide speaker-specific information. This can facilitate the enrichment of existing corpora with new annotations extracted from the knowledge base. The method also helps the large scale analysis by automatically extracting instances of speech variation to correlate with diastratic features. We apply our method to an over 120-hour corpus of broadcast speech in French and investigate variation patterns linked to reduction phenomena and/or specific to connected speech such as disfluencies. We find significant differences in speech rate, the use of filler words, and the rate of non-canonical realisations of frequent segments as a function of different professional categories and age groups.
This paper builds upon recent work in leveraging the corpora and tools originally used to develop speech technologies for corpus-based linguistic studies. We address the non-canonical realization of consonants in connected speech and we focus on voicing alternation phenomena of stops in 5 standard varieties of Romance languages (French, Italian, Spanish, Portuguese, Romanian). For these languages, both large scale corpora and speech recognition systems were available for the study. We use forced alignment with pronunciation variants and machine learning techniques to examine to what extent such frequent phenomena characterize languages and what are the most triggering factors. The results confirm that voicing alternations occur in all Romance languages. Automatic classification underlines that surrounding contexts and segment duration are recurring contributing factors for modeling voicing alternation. The results of this study also demonstrate the new role that machine learning techniques such as classification algorithms can play in helping to extract linguistic knowledge from speech and to suggest interesting research directions.
L’exploration automatisée de grands corpus permet d’analyser plus finement la relation entre motifs de variation phonétique synchronique et changements diachroniques : les erreurs dans les transcriptions automatiques sont riches d’enseignements sur la variation contextuelle en parole continue et sur les possibles mutations systémiques sur le point d’apparaître. Dès lors, il est intéressant de se pencher sur des phénomènes phonologiques largement attestés dans les langues en diachronie comme en synchronie pour établir leur émergence ou non dans des langues qui n’y sont pas encore sujettes. La présente étude propose donc d’utiliser l’alignement forcé avec variantes de prononciation pour observer les alternances de voisement en coda finale de mot dans deux langues romanes : le français et le roumain. Il sera mis en évidence, notamment, que voisement et dévoisement non-canoniques des codas françaises comme roumaines ne sont pas le fruit du hasard mais bien des instances de dévoisement final et d’assimilation régressive de trait laryngal, qu’il s’agisse de voisement ou de non-voisement.
The present paper aims at providing a first study of lenition- and fortition-type phenomena in coda position in Romanian, a language that can be considered as less-resourced. Our data show that there are two contexts for devoicing in Romanian: before a voiceless obstruent, which means that there is regressive voicelessness assimilation in the language, and before pause, which means that there is a tendency towards final devoicing proper. The data also show that non-canonical voicing is an instance of voicing assimilation, as it is observed mainly before voiced consonants (voiced obstruents and sonorants alike). Two conclusions can be drawn from our analyses. First, from a phonetic point of view, the two devoicing phenomena exhibit the same behavior regarding place of articulation of the coda, while voicing assimilation displays the reverse tendency. In particular, alveolars, which tend to devoice the most, also voice the least. Second, the two assimilation processes have similarities that could distinguish them from final devoicing as such. Final devoicing seems to be sensitive to speech style and gender of the speaker, while assimilation processes do not. This may indicate that the two kinds of processes are phonologized at two different degrees in the language, assimilation being more accepted and generalized than final devoicing.
Cet article est dédié à l’analyse acoustique des voyelles du roumain : des productions en parole continue sont comparées à des prononciations “de laboratoire”. Les objectifs sont : (1) décrire les traits acoustiques des voyelles en fonction du style de parole ; (2) estimer la relation entre traits acoustiques et contrastes phonémiques de la langue ; (3) estimer dans quelle mesure l’étude de l’oral apporte des éclairages au sujet des attributs phonémiques des voyelles centrales  et , dont le statut (phonèmes vs allophones) est controversé. Nous montrons que les traits acoustiques sont comparables pour la parole journalistique vs contrôlée pour l’ensemble de l’inventaire sauf  et . Dans la parole contrôlée  et  sont distinctes, mais confondues en faveur du timbre  à l’oral. La confusion de timbres n’est pas source d’inintelligibilité car  et  sont en distribution quasicomplémentaire. Ce résultat apporte des éclairages sur la question du contraste phonémique graduel et marginal (Goldsmith, 1995; Scobbie & Stuart-Smith, 2008; Hall, 2013).
This paper documents the systems developed by LIMSI for the IWSLT 2014 speech translation task (English→French). The main objective of this participation was twofold: adapting different components of the ASR baseline system to the peculiarities of TED talks and improving the machine translation quality on the automatic speech recognition output data. For the latter task, various techniques have been considered: punctuation and number normalization, adaptation to ASR errors, as well as the use of structured output layer neural network models for speech data.
Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe’s under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25\% WER on the Quaero development data.
This paper is concerned with human assessments of the severity of errors in ASR outputs. We did not design any guidelines so that each annotator involved in the study could consider the “seriousness” of an ASR error using their own scientific background. Eight human annotators were involved in an annotation task on three distinct corpora, one of the corpora being annotated twice, hiding this annotation in duplicate to the annotators. None of the computed results (inter-annotator agreement, edit distance, majority annotation) allow any strong correlation between the considered criteria and the level of seriousness to be shown, which underlines the difficulty for a human to determine whether a ASR error is serious or not.
It is well-known that human listeners significantly outperform machines when it comes to transcribing speech. This paper presents a progress report of the joint research in the automatic vs human speech transcription and of the perceptual experiments developed at LIMSI that aims to increase our understanding of automatic speech recognition errors. Two paradigms are described here in which human listeners are asked to transcribe speech segments containing words that are frequently misrecognized by the system. In particular, we sought to gain information about the impact of increased context to help humans disambiguate problematic lexical items, typically homophone or near-homophone words. The long-term aim of this research is to improve the modeling of ambiguous contexts so as to reduce automatic transcription errors.
This paper describes the speech-to-text systems used to provide automatic transcriptions used in the Quaero 2010 evaluation of Machine Translation from speech. Quaero (www.quaero.org) is a large research and industrial innovation program focusing on technologies for automatic analysis and classification of multimedia and multilingual documents. The ASR transcript is the result of a Rover combination of systems from three teams ( KIT, RWTH, LIMSI+VR) for the French and German languages. The casesensitive word error rates (WER) of the combined systems were respectively 20.8% and 18.1% on the 2010 evaluation data, relative WER reductions of 14.6% and 17.4% respectively over the best component system.
Question Answering (QA) technology aims at providing relevant answers to natural language questions. Most Question Answering research has focused on mining document collections containing written texts to answer written questions. In addition to written sources, a large (and growing) amount of potentially interesting information appears in spoken documents, such as broadcast news, speeches, seminars, meetings or telephone conversations. The QAST track (Question-Answering on Speech Transcripts) was introduced in CLEF to investigate the problem of question answering in such audio documents. This paper describes in detail the evaluation protocol and tools designed and developed for the CLEF-QAST evaluation campaigns that have taken place between 2007 and 2009. We first remind the data, question sets, and submission procedures that were produced or set up during these three campaigns. As for the evaluation procedure, the interface that was developed to ease the assessors work is described. In addition, this paper introduces a methodology for a semi-automatic evaluation of QAST systems based on time slot comparisons. Finally, the QAST Evaluation Package 2007-2009 resulting from these evaluation campaigns is also introduced.
This paper reports on the QAST track of CLEF aiming to evaluate Question Answering on Speech Transcriptions. Accessing information in spoken documents provides additional challenges to those of text-based QA, needing to address the characteristics of spoken language, as well as errors in the case of automatic transcriptions of spontaneous speech. The framework and results of the pilot QAst evaluation held as part of CLEF 2007 is described, illustrating some of the additional challenges posed by QA in spoken documents relative to written ones. The current plans for future multiple-language and multiple-task QAst evaluations are described.
Being the clients first interface, call centres worldwide contain a huge amount of information of all kind under the form of conversational speech. If accessible, this information can be used to detect eg. major events and organizational flaws, improve customer relations and marketing strategies. An efficient way to exploit the unstructured data of telephone calls is data-mining, but current techniques apply on text only. The CallSurf project gathers a number of academic and industrial partners covering the complete platform, from automatic transcription to information retrieval and data mining. This paper concentrates on the speech recognition module as it discusses the collection, the manual transcription of the training corpus and the techniques used to build the language model. The NLP techniques used to pre-process the transcribed corpus for data mining are POS tagging, lemmatization, noun group and named entity recognition. Some of them have been especially adapted to the conversational speech characteristics. POS tagging and preliminary data mining results obtained on the manually transcribed corpus are briefly discussed.
The pronunciation lexicon is a fundamental element in an automatic speech transcription system. It associates each lexical entry (usually a grapheme), with one or more phonemic or phone-like forms, the pronunciation variants. Thorough knowledge of the target language is a priori necessary to establish the pronunciation baseforms and variants. The reliance on human expertise can pose difficulties in developing a system for a language where such knowledge may not be readily available. In this article a speech recognizer is used to help select pronunciation variants in Amharic, the official language of Ethiopia, focusing on alternate choices for vowels. This study is carried out using an audio corpus composed of 37 hours of speech from radio broadcasts which were orthographically transcribed by native speakers. Since the corpus is relatively small for estimating pronunciation variants, a first set of studies were carried out at a syllabic level. Word lexica were then constructed based on the observed syllable occurences. Automatic alignments were compared for lexica containing different vowel variants, with both context-independent and context-dependent acoustic models sets. The variant2+ measure proposed in (Adda-Decker and Lamel, 1999) is used to assess the potential need for pronunciation variants.
Dans cet article, nous présentons un gestionnaire de dialogue pour un système de demande d’informations à reconnaissance vocale. Le gestionnaire de dialogue dispose de différentes sources de connaissance, des connaissances statiques et des connaissances dynamiques. Ces connaissances sont gérées et utilisées par le gestionnaire de dialogue via des stratégies. Elles sont mises en oeuvre et organisées en fonction des objectifs concernant le système de dialogue et en fonction des choix ergonomiques que nous avons retenus. Le gestionnaire de dialogue utilise un modèle de dialogue fondé sur la détermination de phases et un modèle de la tâche dynamique. Il augmente les possibilités d’adaptation de la stratégie en fonction des historiques et de l’état du dialogue. Ce gestionnaire de dialogue, implémenté et évalué lors de la dernière campagne d’évaluation du projet LE-3 ARISE, a permi une amélioration du taux de succès de dialogue (de 53% à 85%).