2024
pdf
bib
abs
Claire: Large Language Models for Spontaneous French Dialogue
Jérôme Louradour
|
Julie Hunter
|
Ismaïl Harrando
|
Guokan Shang
|
Virgile Rennard
|
Jean-Pierre Lorré
Actes de la 31ème Conférence sur le Traitement Automatique des Langues Naturelles, volume 1 : articles longs et prises de position
Nous présentons la famille de modèles Claire, une collection de modèles de langage conçus pour améliorer les tâches nécessitant la compréhension des conversations parlées, tel que le résumé de réunions. Nos modèles résultent de la poursuite du pré-entraînement de deux modèles de base exclusivement sur des transcriptions de conversations et des pièces de théâtre. Aussi nous nous concentrons sur les données en français afin de contrebalancer l’accent mis sur l’anglais dans la plupart des corpus d’apprentissage. Cet article décrit le corpus utilisé, l’entraînement des modèles ainsi que leur évaluation. Les modèles, les données et le code qui en résultent sont publiés sous licences ouvertes, et partagés sur Hugging Face et GitHub.
2023
pdf
bib
abs
FREDSum: A Dialogue Summarization Corpus for French Political Debates
Virgile Rennard
|
Guokan Shang
|
Damien Grari
|
Julie Hunter
|
Michalis Vazirgiannis
Findings of the Association for Computational Linguistics: EMNLP 2023
Recent advances in deep learning, and especially the invention of encoder-decoder architectures, have significantly improved the performance of abstractive summarization systems. While the majority of research has focused on written documents, we have observed an increasing interest in the summarization of dialogues and multi-party conversations over the past few years. In this paper, we present a dataset of French political debates for the purpose of enhancing resources for multi-lingual dialogue summarization. Our dataset consists of manually transcribed and annotated political debates, covering a range of topics and perspectives. We highlight the importance of high-quality transcription and annotations for training accurate and effective dialogue summarization models, and emphasize the need for multilingual resources to support dialogue summarization in non-English languages. We also provide baseline experiments using state-of-the-art methods, and encourage further research in this area to advance the field of dialogue summarization. Our dataset will be made publicly available for use by the research community, enabling further advances in multilingual dialogue summarization.
pdf
bib
abs
Automatic Analysis of Substantiation in Scientific Peer Reviews
Yanzhu Guo
|
Guokan Shang
|
Virgile Rennard
|
Michalis Vazirgiannis
|
Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2023
With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation — one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence — and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years. The dataset is available at https://github.com/YanzhuGuo/SubstanReview.
pdf
bib
abs
Abstractive Meeting Summarization: A Survey
Virgile Rennard
|
Guokan Shang
|
Julie Hunter
|
Michalis Vazirgiannis
Transactions of the Association for Computational Linguistics, Volume 11
A system that could reliably identify and sum up the most important points of a conversation would be valuable in a wide variety of real-world contexts, from business meetings to medical consultations to customer service calls. Recent advances in deep learning, and especially the invention of encoder-decoder architectures, has significantly improved language generation systems, opening the door to improved forms of abstractive summarization—a form of summarization particularly well-suited for multi-party conversation. In this paper, we provide an overview of the challenges raised by the task of abstractive meeting summarization and of the data sets, models, and evaluation metrics that have been used to tackle the problems.
2022
pdf
bib
abs
Political Communities on Twitter: Case Study of the 2022 French Presidential Election
Hadi Abdine
|
Yanzhu Guo
|
Virgile Rennard
|
Michalis Vazirgiannis
Proceedings of the LREC 2022 workshop on Natural Language Processing for Political Sciences
With the significant increase in users on social media platforms, a new means of political campaigning has appeared. Twitter and Facebook are now notable campaigning tools during elections. Indeed, the candidates and their parties now take to the internet to interact and spread their ideas. In this paper, we aim to identify political communities formed on Twitter during the 2022 French presidential election and analyze each respective community. We create a large-scale Twitter dataset containing 1.2 million users and 62.6 million tweets that mention keywords relevant to the election. We perform community detection on a retweet graph of users and propose an in-depth analysis of the stance of each community. Finally, we attempt to detect offensive tweets and automatic bots, comparing across communities in order to gain insight into each candidate’s supporter demographics and online campaign strategy.
2021
pdf
bib
abs
BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets
Yanzhu Guo
|
Virgile Rennard
|
Christos Xypolopoulos
|
Michalis Vazirgiannis
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.