2024
pdf
bib
abs
Daily auditory environments in French-speaking infants: A longitudinal dataset
Estelle Hervé
|
Clément François
|
Laurent Prevot
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Babies’ daily auditory environment plays a crucial role in language development. Most previous research estimating the quantitative and qualitative aspects of early speech inputs has predominantly focused on English- and Spanish-speaking families. In addition, validation studies for daylong recordings’ analysis tools are scarce on French data sets.In this paper, we present a French corpus of daylong audio recordings longitudinally collected with the LENA (Language ENvironment Analysis) system from infants aged 3 to 24 months. We conduct a thorough exploration of this data set, which serves as a quality check for both the data and the analysis tools.We evaluate the reliability of LENA metrics by systematically comparing them with those obtained from the ChildProject set of tools and by checking the known dynamics of the metrics with age. These metrics are also used to replicate, on our data set, findings from (Warlaumont et al, 2014) about the increase of infants’ speech vocalizations and temporal contingencies between infants and caregivers with age.
pdf
bib
abs
Experimenting with Discourse Segmentation of Taiwan Southern Min Spontaneous Speech
Laurent Prévot
|
Sheng-Fu Wang
Proceedings of the 5th Workshop on Computational Approaches to Discourse (CODI 2024)
Discourse segmentation received increased attention in the past years, however the majority of studies have focused on written genres and with high-resource languages. This paper investigates discourse segmentation of a Taiwan Southern Min spontaneous speech corpus. We compare the fine-tuning a Language Model (LLM using two approaches: supervised, thanks to a high-quality annotated dataset, and weakly-supervised, requiring only a small amount of manual labeling. The corpus used here is transcribed with both Chinese characters and romanized transcription. This allows us to compare the impact of the written form on the discourse segmentation task. Additionally, the dataset includes manual prosodic breaks labeling, allowing an exploration of the role prosody can play in contemporary discourse segmentation systems grounded in LLMs. In our study, the supervised approach outperforms weak-supervision ; character-based version demonstrated better scores compared to the romanized version; and prosodic information proved to be an interesting source to increase discourse segmentation performance.
pdf
bib
abs
MEETING: A corpus of French meeting-style conversations
Julie Hunter
|
Hiroyoshi Yamasaki
|
Océane Granier
|
Jérôme Louradour
|
Roxane Bertrand
|
Kate Thompson
|
Laurent Prévot
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
We present the MEETING corpus, a dataset of roughly 95 hours of spontaneous meeting-style conversations in French. The corpus is designed to serve as a foundation for downstream tasks such as meeting summarization. In its current state, it offers 25 hours of manually corrected transcripts that are aligned with the audio signal, making it a valuable resource for evaluating ASR and speaker recognition systems. It also includes automatic transcripts and alignments of the whole corpus which can be used for downstream NLP tasks. The aim of this paper is to describe the conception, production and annotation of the corpus up to the transcription level as well as to provide statistics that shed light on the main linguistic features of the corpus.
pdf
bib
abs
ChiCA: un corpus de conversations face-à-face vs. Zoom entre enfants et parents
Dhia Elhak Goumri
|
Abhishek Agrawal
|
Mitja Nikolaus
|
Hong Duc Thang Vu
|
Kübra Bodur
|
Elias Semmar
|
Cassandre Armand
|
Chiara Mazzocconi
|
Shreejata Gupta
|
Laurent Prévot
|
Benoit Favre
|
Leonor Becerra-Bonache
|
Abdellah Fourtassi
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 2 : traductions d'articles publiès
Les études existantes sur la parole en interaction naturelle se sont principalement concentrées sur les deux extrémités du spectre développemental, c’est-à-dire la petite enfance et l’âge adulte, laissant un vide dans nos connaissances sur la manière dont se déroule le développement, en particulier pendant l’age scolaire (6 à 11 ans). Le travail actuel contribue à combler cette lacune en introduisant un corpus développemental de conversations entre enfants et parents à domicile, impliquant des groupes d’enfants âgés de 7, 9 et 11 ans dont la langue maternelle est le français. Chaque dyade a été enregistrée deux fois: une fois en face-à-face et une fois en utilisant des appels vidéo par ordinateur. Pour les paramètres en face-à-face, nous avons capitalisé sur les progrès récents en matière de technologie de suivi oculaire mobile et de détection des mouvements de la tête pour optimiser le caractère naturel des enregistrements, nous permettant d’obtenir à la fois des données précises et écologiquement valides. De plus, nous avons contourné les difficultés de l’annotation manuelle en nous appuyant, dans la mesure du possible, sur des outils automatiques de traitement de la parole et de vision par ordinateur. Enfin, pour démontrer la richesse de ce corpus pour l’étude du développement communicatif de l’enfant, nous fournissons des analyses préliminaires comparant plusieurs mesures de la dynamique conversationnelle entre l’enfant et le parent selon l’âge, la modalité et le support communicatif. Nous espérons que le travail actuel ouvrira la voie à de futures découvertes sur les propriétés et les mécanismes du développement communicatif multimodal pendant l’age scolaire de l’enfant.
pdf
bib
abs
CHICA: A Developmental Corpus of Child-Caregiver’s Face-to-face vs. Video Call Conversations in Middle Childhood
Dhia Elhak Goumri
|
Abhishek Agrawal
|
Mitja Nikolaus
|
Hong Duc Thang Vu
|
Kübra Bodur
|
Elias Emmar
|
Cassandre Armand
|
Chiara Mazzocconi
|
Shreejata Gupta
|
Laurent Prévot
|
Benoit Favre
|
Leonor Becerra-Bonache
|
Abdellah Fourtassi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Existing studies of naturally occurring language-in-interaction have largely focused on the two ends of the developmental spectrum, i.e., early childhood and adulthood, leaving a gap in our knowledge about how development unfolds, especially across middle childhood. The current work contributes to filling this gap by introducing CHICA (for Child Interpersonal Communication Analysis), a developmental corpus of child-caregiver conversations at home, involving groups of French-speaking children aged 7, 9, and 11 years old. Each dyad was recorded twice: once in a face-to-face setting and once using computer-mediated video calls. For the face-to-face settings, we capitalized on recent advances in mobile, lightweight eye-tracking and head motion detection technology to optimize the naturalness of the recordings, allowing us to obtain both precise and ecologically valid data. Further, we mitigated the challenges of manual annotation by relying – to the extent possible – on automatic tools in speech processing and computer vision. Finally, to demonstrate the richness of this corpus for the study of child communicative development, we provide preliminary analyses comparing several measures of child-caregiver conversational dynamics across developmental age, modality, and communicative medium. We hope the current corpus will allow new discoveries into the properties and mechanisms of multimodal communicative development across middle childhood.
pdf
bib
abs
Conversational Feedback in Scripted versus Spontaneous Dialogues: A Comparative Analysis
Ildiko Pilan
|
Laurent Prévot
|
Hendrik Buschmeier
|
Pierre Lison
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Scripted dialogues such as movie and TV subtitles constitute a widespread source of training data for conversational NLP models. However, there are notable linguistic differences between these dialogues and spontaneous interactions, especially regarding the occurrence of communicative feedback such as backchannels, acknowledgments, or clarification requests. This paper presents a quantitative analysis of such feedback phenomena in both subtitles and spontaneous conversations. Based on conversational data spanning eight languages and multiple genres, we extract lexical statistics, classifications from a dialogue act tagger, expert annotations and labels derived from a fine-tuned Large Language Model (LLM). Our main empirical findings are that (1) communicative feedback is markedly less frequent in subtitles than in spontaneous dialogues and (2) subtitles contain a higher proportion of negative feedback. We also show that dialogues generated by standard LLMs lie much closer to scripted dialogues than spontaneous interactions in terms of communicative feedback.
2023
pdf
bib
abs
Comparing Methods for Segmenting Elementary Discourse Units in a French Conversational Corpus
Laurent Prevot
|
Julie Hunter
|
Philippe Muller
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
While discourse parsing has made considerable progress in recent years, discourse segmentation of conversational speech remains a difficult issue. In this paper, we exploit a French data set that has been manually segmented into discourse units to compare two approaches to discourse segmentation: fine-tuning existing systems on manual segmentation vs. using hand-crafted labelling rules to develop a weakly supervised segmenter. Our results show that both approaches yield similar performance in terms of f-score while data programming requires less manual annotation work. In a second experiment we play with the amount of training data used for fine-tuning systems and show that a small amount of hand labelled data is enough to obtain good results (although significantly lower than in the first experiment using all the annotated data available).
2022
pdf
bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni
|
Nora Hollenstein
|
Cassandra Jacobs
|
Yohei Oseki
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
pdf
bib
abs
CMCL 2022 Shared Task on Multilingual and Crosslingual Prediction of Human Reading Behavior
Nora Hollenstein
|
Emmanuele Chersoni
|
Cassandra Jacobs
|
Yohei Oseki
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
We present the second shared task on eye-tracking data prediction of the Cognitive Modeling and Computational Linguistics Workshop (CMCL). Differently from the previous edition, participating teams are asked to predict eye-tracking features from multiple languages, including a surprise language for which there were no available training data. Moreover, the task also included the prediction of standard deviations of feature values in order to account for individual differences between readers.A total of six teams registered to the task. For the first subtask on multilingual prediction, the winning team proposed a regression model based on lexical features, while for the second subtask on cross-lingual prediction, the winning team used a hybrid model based on a multilingual transformer embeddings as well as statistical features.
2021
pdf
bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni
|
Nora Hollenstein
|
Cassandra Jacobs
|
Yohei Oseki
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
pdf
bib
abs
CMCL 2021 Shared Task on Eye-Tracking Prediction
Nora Hollenstein
|
Emmanuele Chersoni
|
Cassandra L. Jacobs
|
Yohei Oseki
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Eye-tracking data from reading represent an important resource for both linguistics and natural language processing. The ability to accurately model gaze features is crucial to advance our understanding of language processing. This paper describes the Shared Task on Eye-Tracking Data Prediction, jointly organized with the eleventh edition of the Work- shop on Cognitive Modeling and Computational Linguistics (CMCL 2021). The goal of the task is to predict 5 different token- level eye-tracking metrics of the Zurich Cognitive Language Processing Corpus (ZuCo). Eye-tracking data were recorded during natural reading of English sentences. In total, we received submissions from 13 registered teams, whose systems include boosting algorithms with handcrafted features, neural models leveraging transformer language models, or hybrid approaches. The winning system used a range of linguistic and psychometric features in a gradient boosting framework.
2020
pdf
bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni
|
Cassandra Jacobs
|
Yohei Oseki
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
pdf
bib
abs
The ISO Standard for Dialogue Act Annotation, Second Edition
Harry Bunt
|
Volha Petukhova
|
Emer Gilmartin
|
Catherine Pelachaud
|
Alex Fang
|
Simon Keizer
|
Laurent Prévot
Proceedings of the Twelfth Language Resources and Evaluation Conference
ISO standard 24617-2 for dialogue act annotation, established in 2012, has in the past few years been used both in corpus annotation and in the design of components for spoken and multimodal dialogue systems. This has brought some inaccuracies and undesirbale limitations of the standard to light, which are addressed in a proposed second edition. This second edition allows a more accurate annotation of dependence relations and rhetorical relations in dialogue. Following the ISO 24617-4 principles of semantic annotation, and borrowing ideas from EmotionML, a triple-layered plug-in mechanism is introduced which allows dialogue act descriptions to be enriched with information about their semantic content, about accompanying emotions, and other information, and allows the annotation scheme to be customised by adding application-specific dialogue act types.
pdf
bib
abs
Multimodal Corpus of Bidirectional Conversation of Human-human and Human-robot Interaction during fMRI Scanning
Birgit Rauchbauer
|
Youssef Hmamouche
|
Brigitte Bigi
|
Laurent Prévot
|
Magalie Ochs
|
Thierry Chaminade
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper we present investigation of real-life, bi-directional conversations. We introduce the multimodal corpus derived from these natural conversations alternating between human-human and human-robot interactions. The human-robot interactions were used as a control condition for the social nature of the human-human conversations. The experimental set up consisted of conversations between the participant in a functional magnetic resonance imaging (fMRI) scanner and a human confederate or conversational robot outside the scanner room, connected via bidirectional audio and unidirectional videoconferencing (from the outside to inside the scanner). A cover story provided a framework for natural, real-life conversations about images of an advertisement campaign. During the conversations we collected a multimodal corpus for a comprehensive characterization of bi-directional conversations. In this paper we introduce this multimodal corpus which includes neural data from functional magnetic resonance imaging (fMRI), physiological data (blood flow pulse and respiration), transcribed conversational data, as well as face and eye-tracking recordings. Thus, we present a unique corpus to study human conversations including neural, physiological and behavioral data.
pdf
bib
abs
BrainPredict: a Tool for Predicting and Visualising Local Brain Activity
Youssef Hmamouche
|
Laurent Prévot
|
Magalie Ochs
|
Thierry Chaminade
Proceedings of the Twelfth Language Resources and Evaluation Conference
In this paper, we present a tool allowing dynamic prediction and visualization of an individual’s local brain activity during a conversation. The prediction module of this tool is based on classifiers trained using a corpus of human-human and human-robot conversations including fMRI recordings. More precisely, the module takes as input behavioral features computed from raw data, mainly the participant and the interlocutor speech but also the participant’s visual input and eye movements. The visualisation module shows in real-time the dynamics of brain active areas synchronised with the behavioral raw data. In addition, it shows which integrated behavioral features are used to predict the activity in individual brain areas.
pdf
bib
Exploiting weak-supervision for classifying Non-Sentential Utterances in Mandarin Conversations
Xin-Yi Chen
|
Laurent Prévot
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation
pdf
bib
abs
Filtering conversations through dialogue acts labels for improving corpus-based convergence studies
Simone Fuscone
|
Benoit Favre
|
Laurent Prévot
Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue
Cognitive models of conversation and research on user-adaptation in dialogue systems involves a better understanding of speakers convergence in conversation. Convergence effects have been established on controlled data sets, for various acoustic and linguistic variables. Tracking interpersonal dynamics on generic corpora has provided positive but more contrasted outcomes. We propose here to enrich large conversational corpora with dialogue act (DA) information. We use DA-labels as filters in order to create data sub sets featuring homogeneous conversational activity. Those data sets allow a more precise comparison between speakers’ speech variables. Our experiences consist of comparing convergence on low level variables (Energy, Pitch, Speech Rate) measured on raw data sets, with human and automatically DA-labelled data sets. We found that such filtering does help in observing convergence suggesting that studies on interpersonal dynamics should consider such high level dialogue activity types and their related NLP topics as important ingredients of their toolboxes.
pdf
bib
Comparaison linguistique et neuro-physiologique de conversations humain humain et humain robot [Linguistic and neuro-physiological comparison of human-human and human-robot conversations]
Charlie Hallart
|
Juliette Maes
|
Nicolas Spatola
|
Laurent Prévot
|
Thierry Chaminade
Traitement Automatique des Langues, Volume 61, Numéro 3 : Dialogue et systèmes de dialogue [Dialogue and dialogue systems]
2019
pdf
bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni
|
Cassandra Jacobs
|
Alessandro Lenci
|
Tal Linzen
|
Laurent Prévot
|
Enrico Santus
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
2018
pdf
bib
Downward Compatible Revision of Dialogue Annotation
Harry Bunt
|
Emer Gilmartin
|
Simon Keizer
|
Catherine Pelachaud
|
Volha Petukhova
|
Laurent Prévot
|
Mariët Theune
Proceedings of the 14th Joint ACL-ISO Workshop on Interoperable Semantic Annotation
2016
pdf
bib
abs
LexFr: Adapting the LexIt Framework to Build a Corpus-based French Subcategorization Lexicon
Giulia Rambelli
|
Gianluca Lebani
|
Laurent Prévot
|
Alessandro Lenci
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
This paper introduces LexFr, a corpus-based French lexical resource built by adapting the framework LexIt, originally developed to describe the combinatorial potential of Italian predicates. As in the original framework, the behavior of a group of target predicates is characterized by a series of syntactic (i.e., subcategorization frames) and semantic (i.e., selectional preferences) statistical information (a.k.a. distributional profiles) whose extraction process is mostly unsupervised. The first release of LexFr includes information for 2,493 verbs, 7,939 nouns and 2,628 adjectives. In these pages we describe the adaptation process and evaluated the final resource by comparing the information collected for 20 test verbs against the information available in a gold standard dictionary. In the best performing setting, we obtained 0.74 precision, 0.66 recall and 0.70 F-measure.
pdf
bib
abs
4Couv: A New Treebank for French
Philippe Blache
|
Grégoire de Montcheuil
|
Laurent Prévot
|
Stéphane Rauzy
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
The question of the type of text used as primary data in treebanks is of certain importance. First, it has an influence at the discourse level: an article is not organized in the same way as a novel or a technical document. Moreover, it also has consequences in terms of semantic interpretation: some types of texts can be easier to interpret than others. We present in this paper a new type of treebank which presents the particularity to answer to specific needs of experimental linguistic. It is made of short texts (book backcovers) that presents a strong coherence in their organization and can be rapidly interpreted. This type of text is adapted to short reading sessions, making it easy to acquire physiological data (e.g. eye movement, electroencepholagraphy). Such a resource offers reliable data when looking for correlations between computational models and human language processing.
pdf
bib
abs
A CUP of CoFee: A large Collection of feedback Utterances Provided with communicative function annotations
Laurent Prévot
|
Jan Gorisch
|
Roxane Bertrand
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
There have been several attempts to annotate communicative functions to utterances of verbal feedback in English previously. Here, we suggest an annotation scheme for verbal and non-verbal feedback utterances in French including the categories base, attitude, previous and visual. The data comprises conversations, maptasks and negotiations from which we extracted ca. 13,000 candidate feedback utterances and gestures. 12 students were recruited for the annotation campaign of ca. 9,500 instances. Each instance was annotated by between 2 and 7 raters. The evaluation of the annotation agreement resulted in an average best-pair kappa of 0.6. While the base category with the values acknowledgement, evaluation, answer, elicit achieve good agreement, this is not the case for the other main categories. The data sets, which also include automatic extractions of lexical, positional and acoustic features, are freely available and will further be used for machine learning classification experiments to analyse the form-function relationship of feedback.
2015
pdf
bib
A SIP of CoFee : A Sample of Interesting Productions of Conversational Feedback
Laurent Prévot
|
Jan Gorisch
|
Roxane Bertrand
|
Emilien Gorène
|
Brigitte Bigi
Proceedings of the 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue
pdf
bib
Annotation and Classification of French Feedback Communicative Functions
Laurent Prévot
|
Jan Gorisch
|
Sankar Mukherjee
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation
2014
pdf
bib
Skillex: a graph-based lexical score for measuring the semantic efficiency of used verbs by human subjects describing actions
Bruno Gaume
|
Karine Duvignau
|
Emmanuel Navarro
|
Yann Desalle
|
Hintat Cheung
|
Shu-Kai Hsieh
|
Pierre Magistry
|
Laurent Prévot
Traitement Automatique des Langues, Volume 55, Numéro 3 : Traitement automatique du langage naturel et sciences cognitives [Natural Language Processing and Cognitive Sciences]
pdf
bib
abs
Representing Multimodal Linguistic Annotated data
Brigitte Bigi
|
Tatsuya Watanabe
|
Laurent Prévot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
The question of interoperability for linguistic annotated resources covers different aspects. First, it requires a representation framework making it possible to compare, and eventually merge, different annotation schema. In this paper, a general description level representing the multimodal linguistic annotations is proposed. It focuses on time representation and on the data content representation: This paper reconsiders and enhances the current and generalized representation of annotations. An XML schema of such annotations is proposed. A Python API is also proposed. This framework is implemented in a multi-platform software and distributed under the terms of the GNU Public License.
pdf
bib
abs
Aix Map Task corpus: The French multimodal corpus of task-oriented dialogue
Jan Gorisch
|
Corine Astésano
|
Ellen Gurman Bard
|
Brigitte Bigi
|
Laurent Prévot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
This paper introduces the Aix Map Task corpus, a corpus of audio and video recordings of task-oriented dialogues. It was modelled after the original HCRC Map Task corpus. Lexical material was designed for the analysis of speech and prosody, as described in Astésano et al. (2007). The design of the lexical material, the protocol and some basic quantitative features of the existing corpus are presented. The corpus was collected under two communicative conditions, one audio-only condition and one face-to-face condition. The recordings took place in a studio and a sound attenuated booth respectively, with head-set microphones (and in the face-to-face condition with two video cameras). The recordings have been segmented into Inter-Pausal-Units and transcribed using transcription conventions containing actual productions and canonical forms of what was said. It is made publicly available online.
pdf
bib
abs
Segmentation evaluation metrics, a comparison grounded on prosodic and discourse units
Klim Peshkov
|
Laurent Prévot
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Knowledge on evaluation metrics and best practices of using them have improved fast in the recent years Fort et al. (2012). However, the advances concern mostly evaluation of classification related tasks. Segmentation tasks have received less attention. Nevertheless, there are crucial in a large number of linguistic studies. A range of metrics is available (F-score on boundaries, F-score on units, WindowDiff ((WD), Boundary Similarity (BS) but it is still relatively difficult to interpret these metrics on various linguistic segmentation tasks, such as prosodic and discourse segmentation. In this paper, we consider real segmented datasets (introduced in Peshkov et al. (2012)) as references which we deteriorate in different ways (random addition of boundaries, random removal boundaries, near-miss errors introduction). This provide us with various measures on controlled datasets and with an interesting benchmark for various linguistic segmentation tasks.
2013
pdf
bib
Observing Features of PTT Neologisms: A Corpus-driven Study with N-gram Model
Tsun-Jui Liu
|
Shu-Kai Hsieh
|
Laurent Prevot
Proceedings of the 25th Conference on Computational Linguistics and Speech Processing (ROCLING 2013)
pdf
bib
A quantitative view of feedback lexical markers in conversational French
Laurent Prévot
|
Brigitte Bigi
|
Roxane Bertrand
Proceedings of the SIGDIAL 2013 Conference
pdf
bib
A Quantitative Comparative Study of Prosodic and Discourse Units, the Case of French and Taiwan Mandarin
Laurent Prévot
|
Shu-Chuan Tseng
|
Alvin Cheng-Hsien Chen
|
Klim Peshkov
Proceedings of the 27th Pacific Asia Conference on Language, Information, and Computation (PACLIC 27)
2012
pdf
bib
abs
An empirical resource for discovering cognitive principles of discourse organisation: the ANNODIS corpus
Stergos Afantenos
|
Nicholas Asher
|
Farah Benamara
|
Myriam Bras
|
Cécile Fabre
|
Mai Ho-dac
|
Anne Le Draoulec
|
Philippe Muller
|
Marie-Paule Péry-Woodley
|
Laurent Prévot
|
Josette Rebeyrolles
|
Ludovic Tanguy
|
Marianne Vergez-Couret
|
Laure Vieu
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)
This paper describes the ANNODIS resource, a discourse-level annotated corpus for French. The corpus combines two perspectives on discourse: a bottom-up approach and a top-down approach. The bottom-up view incrementally builds a structure from elementary discourse units, while the top-down view focuses on the selective annotation of multi-level discourse structures. The corpus is composed of texts that are diversified with respect to genre, length and type of discursive organisation. The methodology followed here involves an iterative design of annotation guidelines in order to reach satisfactory inter-annotator agreement levels. This allows us to raise a few issues relevant for the comparison of such complex objects as discourse structures. The corpus also serves as a source of empirical evidence for discourse theories. We present here two first analyses taking advantage of this new annotated corpus --one that tested hypotheses on constraints governing discourse structure, and another that studied the variations in composition and signalling of multi-level discourse structures.
2011
pdf
bib
abs
Un calcul de termes typés pour la pragmatique lexicale: chemins et voyageurs fictifs dans un corpus de récits de voyage (A calculation of typed terms for lexical pragmatics: paths and fictional travellers in a travel stories corpus)
Richard Moot
|
Laurent Prévot
|
Christian Retoré
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Ce travail s’inscrit dans l’analyse automatique d’un corpus de récits de voyage. À cette fin, nous raffinons la sémantique de Montague pour rendre compte des phénomènes d’adaptation du sens des mots au contexte dans lequel ils apparaissent. Ici, nous modélisons les constructions de type ‘le chemin descend pendant une demi-heure’ où ledit chemin introduit un voyageur fictif qui le parcourt, en étendant des idées que le dernier auteur a développé avec Bassac et Mery. Cette introduction du voyageur utilise la montée de type afin que le quantificateur introduisant le voyageur porte sur toute la phrase et que les propriétés du chemin ne deviennent pas des propriétés du voyageur, fût-il fictif. Cette analyse sémantique (ou plutôt sa traduction en lambda-DRT) est d’ores et déjà implantée pour une partie du lexique de Grail.
2010
pdf
bib
A Formal Scheme for Multimodal Grammars
Philippe Blache
|
Laurent Prévot
Coling 2010: Posters
pdf
bib
abs
The OTIM Formal Annotation Model: A Preliminary Step before Annotation Scheme
Philippe Blache
|
Roxane Bertrand
|
Mathilde Guardiola
|
Marie-Laure Guénot
|
Christine Meunier
|
Irina Nesterenko
|
Berthille Pallaud
|
Laurent Prévot
|
Béatrice Priego-Valverde
|
Stéphane Rauzy
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)
Large annotation projects, typically those addressing the question of multimodal annotation in which many different kinds of information have to be encoded, have to elaborate precise and high level annotation schemes. Doing this requires first to define the structure of the information: the different objects and their organization. This stage has to be as much independent as possible from the coding language constraints. This is the reason why we propose a preliminary formal annotation model, represented with typed feature structures. This representation requires a precise definition of the different objects, their properties (or features) and their relations, represented in terms of type hierarchies. This approach has been used to specify the annotation scheme of a large multimodal annotation project (OTIM) and experimented in the annotation of a multimodal corpus (CID, Corpus of Interactional Data). This project aims at collecting, annotating and exploiting a dialogue video corpus in a multimodal perspective (including speech and gesture modalities). The corpus itself, is made of 8 hours of dialogues, fully transcribed and richly annotated (phonetics, syntax, pragmatics, gestures, etc.).
pdf
bib
Computational Modeling of Verb Acquisition, from a Monolingual to a Bilingual Study
Laurent Prévot
|
Chun-Han Chang
|
Yann Desalle
Proceedings of the 24th Pacific Asia Conference on Language, Information and Computation
2009
pdf
bib
abs
ANNODIS: une approche outillée de l’annotation de structures discursives
Marie-Paule Péry-Woodley
|
Nicholas Asher
|
Patrice Enjalbert
|
Farah Benamara
|
Myriam Bras
|
Cécile Fabre
|
Stéphane Ferrari
|
Lydia-Mai Ho-Dac
|
Anne Le Draoulec
|
Yann Mathet
|
Philippe Muller
|
Laurent Prévot
|
Josette Rebeyrolle
|
Ludovic Tanguy
|
Marianne Vergez-Couret
|
Laure Vieu
|
Antoine Widlöcher
Actes de la 16ème conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Le projet ANNODIS vise la construction d’un corpus de textes annotés au niveau discursif ainsi que le développement d’outils pour l’annotation et l’exploitation de corpus. Les annotations adoptent deux points de vue complémentaires : une perspective ascendante part d’unités de discours minimales pour construire des structures complexes via un jeu de relations de discours ; une perspective descendante aborde le texte dans son entier et se base sur des indices pré-identifiés pour détecter des structures discursives de haut niveau. La construction du corpus est associée à la création de deux interfaces : la première assiste l’annotation manuelle des relations et structures discursives en permettant une visualisation du marquage issu des prétraitements ; une seconde sera destinée à l’exploitation des annotations. Nous présentons les modèles et protocoles d’annotation élaborés pour mettre en oeuvre, au travers de l’interface dédiée, la campagne d’annotation.
pdf
bib
Wiktionary for Natural Language Processing: Methodology and Limitations
Emmanuel Navarro
|
Franck Sajous
|
Bruno Gaume
|
Laurent Prévot
|
ShuKai Hsieh
|
Ivy Kuo
|
Pierre Magistry
|
Chu-Ren Huang
Proceedings of the 2009 Workshop on The People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (People’s Web)
pdf
bib
Using Extra-Linguistic Material for Mandarin-French Verbal Constructions Comparison
Pierre Magistry
|
Laurent Prévot
|
Hintat Cheung
|
Chien-yun Shiao
|
Yann Desalle
|
Bruno Gaume
Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation, Volume 1
2008
pdf
bib
abs
Extracting Concrete Senses of Lexicon through Measurement of Conceptual Similarity in Ontologies
Siaw-Fong Chung
|
Laurent Prévot
|
Mingwei Xu
|
Kathleen Ahrens
|
Shu-Kai Hsieh
|
Chu-Ren Huang
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The measurement of conceptual similarity in a hierarchical structure has been proposed by studies such as Wu and Palmer (1994) which have been summarized and evaluated in Budanisky and Hirst (2006). The present study applies the measurement of conceptual similarity to conceptual metaphor research by comparing concreteness of ontological resource nodes to several prototypical concrete nodes selected by human subjects. Here, the purpose of comparing conceptual similarity between nodes is to select a concrete sense for a word which is used metaphorically. Through using WordNet-SUMO interface such as SinicaBow (Huang, Chang and Lee, 2004), concrete senses of a lexicon will be selected once its SUMO nodes have been compared in terms of conceptual similarity with the prototypical concrete nodes. This study has strong implications for the interaction of psycholinguistic and computational linguistic fields in conceptual metaphor research.
pdf
bib
Toward a cognitive organization for electronic dictionaries, the case for semantic proxemy
Bruno Gaume
|
Karine Duvignau
|
Laurent Prévot
|
Yann Desalle
Coling 2008: Proceedings of the Workshop on Cognitive Aspects of the Lexicon (COGALEX 2008)
2007
pdf
bib
Rethinking Chinese Word Segmentation: Tokenization, Character Classification, or Wordbreak Identification
Chu-Ren Huang
|
Petr Šimon
|
Shu-Kai Hsieh
|
Laurent Prévot
Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions
2006
pdf
bib
Infrastructure for Standardization of Asian Language Resources
Takenobu Tokunaga
|
Virach Sornlertlamvanich
|
Thatsanee Charoenporn
|
Nicoletta Calzolari
|
Monica Monachini
|
Claudia Soria
|
Chu-Ren Huang
|
YingJu Xia
|
Hao Yu
|
Laurent Prevot
|
Kiyoaki Shirai
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions
pdf
bib
Using the Swadesh list for creating a simple common taxonomy
Laurent Prévot
|
Chu-Ren Huang
|
I-Li Su
Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation
2005
pdf
bib
Interfacing Ontologies and Lexical Resources
Laurent Prevot
|
Stefano Borgo
|
Alessandro Oltramari
Proceedings of OntoLex 2005 - Ontologies and Lexical Resources