2024
pdf
bib
abs
Distinguishing Fictional Voices: a Study of Authorship Verification Models for Quotation Attribution
Gaspard Michel
|
Elena Epure
|
Romain Hennequin
|
Christophe Cerisara
Proceedings of the 8th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2024)
Recent approaches to automatically detect the speaker of an utterance of direct speech often disregard general information about characters in favor of local information found in the context, such as surrounding mentions of entities. In this work, we explore stylistic representations of characters built by encoding their quotes with off-the-shelf pretrained Authorship Verification models in a large corpus of English novels (the Project Dialogism Novel Corpus). Results suggest that the combination of stylistic and topical information captured in some of these models accurately distinguish characters among each other, but does not necessarily improve over semantic-only models when attributing quotes. However, these results vary across novels and more investigation of stylometric models particularly tailored for literary texts and the study of characters should be conducted.
pdf
bib
abs
Improving Quotation Attribution with Fictional Character Embeddings
Gaspard Michel
|
Elena V. Epure
|
Romain Hennequin
|
Christophe Cerisara
Findings of the Association for Computational Linguistics: EMNLP 2024
Humans naturally attribute utterances of direct speech to their speaker in literary works.When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural networks when processing contextual information.However, these systems inherently lack character representations, which often leads to errors in more challenging examples of attribution: anaphoric and implicit quotes.In this work, we propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global stylistic information of characters derived from an off-the-shelf stylometric model, Universal Authorship Representation (UAR).We create DramaCV, a corpus of English drama plays from the 15th to 20th century that we automatically annotate for Authorship Verification of fictional characters utterances, and release two versions of UAR trained on DramaCV, that are tailored for literary characters analysis.Then, through an extensive evaluation on 28 novels, we show that combining BookNLP’s contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes, reaching state-of-the-art performance.Code and data can be found at https://github.com/deezer/character_embeddings_qa.
2023
pdf
bib
abs
Participation de l’équipe TTGV à DEFT 2023~: Réponse automatique à des QCM issus d’examens en pharmacie
Andréa Blivet
|
Solène Degrutère
|
Barbara Gendron
|
Aurélien Renault
|
Cyrille Siouffi
|
Vanessa Gaudray Bouju
|
Christophe Cerisara
|
Hélène Flamein
|
Gaël Guibon
|
Matthieu Labeau
|
Tom Rousseau
Actes de CORIA-TALN 2023. Actes du Défi Fouille de Textes@TALN2023
Cet article présente l’approche de l’équipe TTGV dans le cadre de sa participation aux deux tâches proposées lors du DEFT 2023 : l’identification du nombre de réponses supposément justes à un QCM et la prédiction de l’ensemble de réponses correctes parmi les cinq proposées pour une question donnée. Cet article présente les différentes méthodologies mises en oeuvre, explorant ainsi un large éventail d’approches et de techniques pour aborder dans un premier temps la distinction entre les questions appelant une seule ou plusieurs réponses avant de s’interroger sur l’identification des réponses correctes. Nous détaillerons les différentes méthodes utilisées, en mettant en exergue leurs avantages et leurs limites respectives. Ensuite, nous présenterons les résultats obtenus pour chaque approche. Enfin, nous discuterons des limitations intrinsèques aux tâches elles-mêmes ainsi qu’aux approches envisagées dans cette contribution.
2022
pdf
bib
abs
Unsupervised multiple-choice question generation for out-of-domain Q&A fine-tuning
Guillaume Le Berre
|
Christophe Cerisara
|
Philippe Langlais
|
Guy Lapalme
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Pre-trained models have shown very good performances on a number of question answering benchmarks especially when fine-tuned on multiple question answering datasets at once. In this work, we propose an approach for generating a fine-tuning dataset thanks to a rule-based algorithm that generates questions and answers from unannotated sentences. We show that the state-of-the-art model UnifiedQA can greatly benefit from such a system on a multiple-choice benchmark about physics, biology and chemistry it has never been trained on. We further show that improved performances may be obtained by selecting the most challenging distractors (wrong answers), with a dedicated ranker based on a pretrained RoBERTa model.
2018
pdf
bib
abs
Multi-task dialog act and sentiment recognition on Mastodon
Christophe Cerisara
|
Somayeh Jafaritazehjani
|
Adedayo Oluokun
|
Hoa T. Le
Proceedings of the 27th International Conference on Computational Linguistics
Because of license restrictions, it often becomes impossible to strictly reproduce most research results on Twitter data already a few months after the creation of the corpus. This situation worsened gradually as time passes and tweets become inaccessible. This is a critical issue for reproducible and accountable research on social media. We partly solve this challenge by annotating a new Twitter-like corpus from an alternative large social medium with licenses that are compatible with reproducible experiments: Mastodon. We manually annotate both dialogues and sentiments on this corpus, and train a multi-task hierarchical recurrent network on joint sentiment and dialog act recognition. We experimentally demonstrate that transfer learning may be efficiently achieved between both tasks, and further analyze some specific correlations between sentiments and dialogues on social media. Both the annotated corpus and deep network are released with an open-source license.
2016
pdf
bib
abs
Weakly-supervised text-to-speech alignment confidence measure
Guillaume Serrière
|
Christophe Cerisara
|
Dominique Fohr
|
Odile Mella
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers
This work proposes a new confidence measure for evaluating text-to-speech alignment systems outputs, which is a key component for many applications, such as semi-automatic corpus anonymization, lips syncing, film dubbing, corpus preparation for speech synthesis and speech recognition acoustic models training. This confidence measure exploits deep neural networks that are trained on large corpora without direct supervision. It is evaluated on an open-source spontaneous speech corpus and outperforms a confidence score derived from a state-of-the-art text-to-speech aligner. We further show that this confidence measure can be used to fine-tune the output of this aligner and improve the quality of the resulting alignment.
2015
pdf
bib
A Domain Agnostic Approach to Verbalizing n-ary Events without Parallel Corpora
Bikash Gyawali
|
Claire Gardent
|
Christophe Cerisara
Proceedings of the 15th European Workshop on Natural Language Generation (ENLG)
2013
pdf
bib
Unsupervised structured semantic inference for spoken dialog reservation tasks
Alejandra Lorenzo
|
Lina Rojas-Barahona
|
Christophe Cerisara
Proceedings of the SIGDIAL 2013 Conference
2012
pdf
bib
Unsupervised frame based Semantic Role Induction: application to French and English
Alejandra Lorenzo
|
Christophe Cerisara
Proceedings of the ACL 2012 Joint Workshop on Statistical Parsing and Semantic Processing of Morphologically Rich Languages
2011
pdf
bib
abs
Vers la détection des dislocations à gauche dans les transcriptions automatiques du Français parlé (Towards automatic recognition of left dislocation in transcriptions of Spoken French)
Corinna Anderson
|
Christophe Cerisara
|
Claire Gardent
Actes de la 18e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts
Ce travail prend place dans le cadre plus général du développement d’une plate-forme d’analyse syntaxique du français parlé. Nous décrivons la conception d’un modèle automatique pour résoudre le lien anaphorique présent dans les dislocations à gauche dans un corpus de français parlé radiophonique. La détection de ces structures devrait permettre à terme d’améliorer notre analyseur syntaxique en enrichissant les informations prises en compte dans nos modèles automatiques. La résolution du lien anaphorique est réalisée en deux étapes : un premier niveau à base de règles filtre les configurations candidates, et un second niveau s’appuie sur un modèle appris selon le critère du maximum d’entropie. Une évaluation expérimentale réalisée par validation croisée sur un corpus annoté manuellement donne une F-mesure de l’ordre de 40%.