2020
pdf
bib
abs
Building a Time-Aligned Cross-Linguistic Reference Corpus from Language Documentation Data (DoReCo)
Ludger Paschen
|
François Delafontaine
|
Christoph Draxler
|
Susanne Fuchs
|
Matthew Stave
|
Frank Seifart
Proceedings of the Twelfth Language Resources and Evaluation Conference
Natural speech data on many languages have been collected by language documentation projects aiming to preserve lingustic and cultural traditions in audivisual records. These data hold great potential for large-scale cross-linguistic research into phonetics and language processing. Major obstacles to utilizing such data for typological studies include the non-homogenous nature of file formats and annotation conventions found both across and within archived collections. Moreover, time-aligned audio transcriptions are typically only available at the level of broad (multi-word) phrases but not at the word and segment levels. We report on solutions developed for these issues within the DoReCo (DOcumentation REference COrpus) project. DoReCo aims at providing time-aligned transcriptions for at least 50 collections of under-resourced languages. This paper gives a preliminary overview of the current state of the project and details our workflow, in particular standardization of formats and conventions, the addition of segmental alignments with WebMAUS, and DoReCo’s applicability for subsequent research programs. By making the data accessible to the scientific community, DoReCo is designed to bridge the gap between language documentation and linguistic inquiry.
pdf
bib
abs
Automatic Period Segmentation of Oral French
Natalia Kalashnikova
|
Loïc Grobol
|
Iris Eshkol-Taravella
|
François Delafontaine
Proceedings of the Twelfth Language Resources and Evaluation Conference
Natural Language Processing in oral speech segmentation is still looking for a minimal unit to analyze. In this work, we present a comparison of two automatic segmentation methods of macro-syntactic periods which allows to take into account syntactic and prosodic components of speech. We compare the performances of an existing tool Analor (Avanzi, Lacheret-Dujour, Victorri, 2008) developed for automatic segmentation of prosodic periods and of CRF models relying on syntactic and / or prosodic features. We find that Analor tends to divide speech into smaller segments and that CRF models detect larger segments rather than macro-syntactic periods. However, in general CRF models perform better results than Analor in terms of F-measure.
pdf
bib
abs
Segmentation automatique en périodes pour le français parlé (Automatic Period Segmentation of Oral French)
Natalia Kalashnikova
|
Iris Eshkol-Taravella
|
Loïc Grobol
|
François Delafontaine
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles
Nous proposons la comparaison de deux méthodes de segmentation automatique du français parlé en périodes macro-syntaxiques, qui permettent d’analyser la syntaxe et la prosodie du discours. Nous comparons l’outil Analor (Avanzi et al., 2008) qui a été développé pour la segmentation des périodes prosodiques et les modèles de segmentations utilisant des CRF et des traits prosodiques et / ou morphosyntaxiques. Les résultats montrent qu’Analor divise le discours en plus petits segments prosodiques tandis que les modèles CRF détectent des segments plus larges que les périodes macro-syntaxiques. Cependant, les modèles CRF ont de meilleurs résultats qu’Analor en termes de F-mesure.