Cet article présente deux ressources récemment développées pour explorer l’interface prosodie-syntaxe en pidgin nigérian, une langue à faibles ressources d’Afrique de l’Ouest. La première est un treebank intonosyntaxique dans laquelle chaque token est associé à une série de caractéristiques prosodiques au niveau de la syllabe, ce qui permet d’analyser diverses structures syntaxiques et prosodiques en utilisant une même interface. La seconde est un système de synthèse de la parole entraîné sur le même ensemble de données, conçu pour permettre un contrôle direct sur les contours intonatifs de la parole générée. Cet outil a été développé pour nous permettre de tester les hypothèses formulées à partir de l’exploration du treebank. Cet article est largement une adaptation de deux publications récentes présentant chaque outil, avec un accent sur leur interconnexion dans notre recherche en cours.
This paper presents a new phonetic resource for Nigerian Pidgin, a low-resource language of West Africa. Aiming to provide a new tool for research on intonosyntax, we have augmented an existing syntactic treebank of Nigerian Pidgin, associating each orthographically transcribed token with a series of syllable-level alignments and phonetizations. Syllables are further described using a set of continuous and discrete prosodic features. This new approach provides a simple tool for researchers to explore the prosodic characteristics of various syntactic phenomena. In this paper, we present the format of the corpus, the various features added, and several explorations that can be performed using an online interface. We also present a prosodically specified lexicon extracted using this resource. In it, each orthographic form is accompanied by the frequency of its phoneme-level variants, as well as the suprasegmental features that most frequently accompany each syllable. Finally, we present several additional case studies on how this corpus can used in the study of the language’s prosody.
We present a French corpus of political interviews labeled at the utterance level according to expressive dimensions such as Arousal. This corpus consists of 7.5 hours of high-quality audio-visual recordings with transcription. At the time of this publication, 1 hour of speech was segmented into short utterances, each manually annotated in Arousal. Our segmentation approach differs from similar corpora and allows us to perform an automatic Arousal prediction baseline by building a speech-based classification model. Although this paper focuses on the acoustic expression of Arousal, it paves the way for future work on conflictual and hostile expression recognition as well as multimodal architectures.
The automatic stance detection task consists in determining the attitude expressed in a text toward a target (text, claim, or entity). This is a typical intermediate task for the fake news detection or analysis, which is a considerably widespread and a particularly difficult issue to overcome. This work aims at the creation of a human-annotated corpus for the automatic stance detection of tweets written in French. It exploits a corpus of tweets collected during July and August 2018. To the best of our knowledge, this is the first freely available stance annotated tweet corpus in the French language. The four classes broadly adopted by the community were chosen for the annotation: support, deny, query, and comment with the addition of the ignore class. This paper presents the corpus along with the tools used to build it, its construction, an analysis of the inter-rater reliability, as well as the challenges and questions that were raised during the building process.