Chloé Clavel

Also published as: Chloe Clavel, Chloé Clavel

2026

Cross-lingual and cross-country approaches to argument component detection: a comparative study.
Cecilia Graiff | Chloé Clavel | Benoît Sagot
Proceedings of the First Workshop on Multilingual Multicultural Evaluation

Argument mining in multilingual settings has rarely been investigated, due to the lack of annotated resources and to the inherent difficulty of the task. We benchmark the performance of models on cross-lingual and cross-country argument component detection, focusing on political data from the US and France. To do so, we introduce FrenchPolArg, a corpus of argumentative political discourse in French, and we automatically translate already existing US-English resources. We benchmark three different cross-lingual and cross-country pipelines, and compare their results to find the best-performing one. We obtain promising results to be integrated in semi-automatic annotation workflows to reduce the time and cost of annotations.

2025

pdf bib abs

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights
Célia Nouri | Jean-Philippe Cointet | Chloé Clavel
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Detecting abusive language in social media conversations poses significant challenges, as identifying abusiveness often depends on the conversational context, characterized by the content and topology of preceding comments. Traditional Abusive Language Detection (ALD) models often overlook this context, which can lead to unreliable performance metrics. Recent Natural Language Processing (NLP) approaches that incorporate conversational context often rely on limited or overly simplified representations of this context, leading to inconsistent and sometimes inconclusive results. In this paper, we propose a novel approach that utilizes graph neural networks (GNNs) to model social media conversations as graphs, where nodes represent comments, and edges capture reply structures. We systematically investigate various graph representations and context windows to identify the optimal configurations for ALD. Our GNN model outperforms both context-agnostic baselines and linear context-aware methods, achieving significant improvements in F1 scores. These findings demonstrate the critical role of structured conversational context and establish GNNs as a robust framework for advancing context-aware ALD.

pdf bib abs

Potentially Problematic Word Usages and How to Detect Them: A Survey
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)

We introduce and explore the concept of potentially problematic word usages (PPWUs): word occurrences that are likely to cause communication breakdowns of a semantic nature. While much research has been devoted to lexical complexity, ambiguity, vagueness and related issues, no work has attempted to fully capture the intricate nature of PPWUs. We review linguistic factors, datasets and metrics that can be helpful for PPWU detection. We also discuss challenges to their study, such as their complexity and subjectivity, and highlight the need for future work on this phenomenon.

pdf bib abs

EmoDynamiX: Emotional Support Dialogue Strategy Prediction by Modelling MiXed Emotions and Discourse Dynamics
Chenwei Wan | Matthieu Labeau | Chloé Clavel
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Designing emotionally intelligent conversational systems to provide comfort and advice to people experiencing distress is a compelling area of research. Recently, with advancements in large language models (LLMs), end-to-end dialogue agents without explicit strategy prediction steps have become prevalent. However, implicit strategy planning lacks transparency, and recent studies show that LLMs’ inherent preference bias towards certain socio-emotional strategies hinders the delivery of high-quality emotional support. To address this challenge, we propose decoupling strategy prediction from language generation, and introduce a novel dialogue strategy prediction framework, EmoDynamiX, which models the discourse dynamics between user fine-grained emotions and system strategies using a heterogeneous graph for better performance and transparency. Experimental results on two ESC datasets show EmoDynamiX outperforms previous state-of-the-art methods with a significant margin (better proficiency and lower preference bias). Our approach also exhibits better transparency by allowing backtracing of decision making.

pdf bib abs

Understanding Social Interactions in the Era of LLMs – the Challenges of Transparency
Chloé Clavel
Proceedings of the 2nd LUHME Workshop

Research on AI and social interaction is not entirely new — it falls within the field of social and affective computing, which emerged in the late 1990s. To understand social interactions, the research community has long drawn on both artificial intelligence and social science. In recent years, however, the field has shifted toward a dominant focus on generative large language models (LLMs). These models are undeniably powerful but often opaque. In this talk, I will present our current work on developing machine learning approaches — from classical methods to LLMs — for modeling the socio-emotional layer of interaction, with a particular focus on improving model transparency. I will also briefly present some of the applications we are developing to support human skill development, particularly in the fields of education and health.

pdf bib abs

Benchmarking Linguistic Diversity of Large Language Models
Yanzhu Guo | Guokan Shang | Chloé Clavel
Transactions of the Association for Computational Linguistics, Volume 13

The development and evaluation of Large Language Models (LLMs) has primarily focused on their task-solving capabilities, with recent models even surpassing human performance in some areas. However, this focus often neglects whether machine-generated language matches the human level of diversity, in terms of vocabulary choice, syntactic construction, and expression of meaning, raising questions about whether the fundamentals of language generation have been fully addressed. This paper emphasizes the importance of examining the preservation of human linguistic richness by language models, given the concerning surge in online content produced or aided by LLMs. We adapt a comprehensive framework for evaluating LLMs from various linguistic diversity perspectives including lexical, syntactic, and semantic dimensions. Using this framework, we benchmark several state-of-the-art LLMs across all diversity dimensions, and conduct an in-depth analysis for syntactic diversity. Finally, we analyze how the design, development, and deployment choices of LLMs impact the linguistic diversity of their outputs, focusing on the creative task of story generation.

pdf bib abs

“Mm, Wat?” Detecting Other-initiated Repair Requests in Dialogue
Anh Ha Ngo | Nicolas Rollet | Catherine Pelachaud | Chloé Clavel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.

pdf bib abs

Toward the Automatic Detection of Word Meaning Negotiation Indicators in Conversation
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2025

Word Meaning Negotiations (WMN) are sequences in conversation where speakers collectively discuss and shape word meaning. These exchanges can provide insight into conversational dynamics and word-related misunderstandings, but they are hard to find in corpora. In order to facilitate data collection and speed up the WMN annotation process, we introduce the task of detecting WMN indicators – utterances where a speaker signals the need to clarify or challenge word meaning. We train a wide range of models and reveal the difficulty of the task. Our models have better precision than previous regular-expression based approaches and show some generalization abilities, but have moderate recall. However, this constitutes a promising first step toward an iterative process for obtaining more data.

pdf bib abs

EmoDynamiX : Prédiction de stratégies de dialogue pour le support émotionnel via la modélisation de mélange d’émotions et de la dynamique du discours
Chenwei Wan | Matthieu Labeau | Chloé Clavel
Actes des 32ème Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 2 : traductions d'articles publiés

Concevoir des systèmes conversationnels dotés d’une intelligence émotionnelle pour apporter du réconfort et des conseils aux personnes en détresse constitue un domaine de recherche particulièrement prometteur. Récemment, grâce aux avancées des grands modèles de langue (LLMs), les agents conversationnels entraînés de bout en bout sans étapes explicites de prédiction de stratégie de dialogue sont devenus plus courants. Cependant, la planification implicite de stratégie manque de transparence, et des études récentes montrent que la préférence inhérente des LLMs pour certaines stratégies socioémotionnelles nuit à la qualité du soutien émotionnel fourni. Pour relever ce défi, nous proposons de dissocier la prédiction de stratégies de la génération du langage et introduisons un nouveau cadre de prédiction de stratégie conversationnelle, EmoDynamiX, qui modélise la dynamique du discours entre les émotions fines du côté de l’utilisateur et les stratégies du système au moyen d’un graphe hétérogène, afin d’améliorer à la fois les performances et la transparence. Les résultats expérimentaux sur deux jeux de données de conversations pour le support émotionnel (ESC) montrent qu’EmoDynamiX surpasse de manière significative les méthodes antérieures à l’état de l’art (avec une meilleure maîtrise et un biais de préférence plus faible). Notre approche offre également une meilleure transparence en permettant de retracer le processus de prise de décision.

pdf bib abs

Décoder le pouvoir de persuasion dans les concours d’éloquence : une étude sur la capacité des modèles de langues à évaluer la prise de parole en public
Alisa Barkar | Mathieu Chollet | Matthieu Labeau | Beatrice Biancardi | Chloé Clavel
Actes de l'atelier Évaluation des modèles génératifs (LLM) et challenge 2025 (EvalLLM)

L’importance des compétences en prise de parole en public (PPP) stimule le développement de systèmes d’évaluation automatisée, mais l’intégration des grandes modèles de langue (LLMs) reste peu explorée. Nous proposons un cadre où les LLMs évaluent des critères issus de la littérature et de retours de formateurs. Nous testons trois approches : des prédictions LLM directes à zéro coup (RMSE 0, 8) par rapport à des prédictions de persuasion basées sur des caractéristiques lexicales fabriquées à la main (RMSE 0, 51) ou basées sur des critères évalués par LLM 0, 6 insérés en entrée dans ElasticNet. L’analyse des liens entre critères et caractéristiques lexicales montre que seul le critère de niveau de langue évalué par LLM est prévisible (score F1 de 0, 56) soulignant les limites actuelles des LLMs pour l’analyse de la PPP. Code source et données disponibles sur GitHub.

2024

pdf bib abs

Exploration of Human Repair Initiation in Task-oriented Dialogue: A Linguistic Feature-based Approach
Anh Ngo | Dirk Heylen | Nicolas Rollet | Catherine Pelachaud | Chloé Clavel
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

In daily conversations, people often encounter problems prompting conversational repair to enhance mutual understanding. By employing an automatic coreference solver, alongside examining repetition, we identify various linguistic features that distinguish turns when the addressee initiates repair from those when they do not. Our findings reveal distinct patterns that characterize the repair sequence and each type of repair initiation.

pdf bib abs

MAFALDA: A Benchmark and Comprehensive Study of Fallacy Detection and Classification
Chadi Helwe | Tom Calamai | Pierre-Henri Paris | Chloé Clavel | Fabian Suchanek
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

We introduce MAFALDA, a benchmark for fallacy classification that merges and unites previous fallacy datasets. It comes with a taxonomy that aligns, refines, and unifies existing classifications of fallacies. We further provide a manual annotation of a part of the dataset together with manual explanations for each annotation. We propose a new annotation scheme tailored for subjective NLP tasks, and a new evaluation method designed to handle subjectivity. We then evaluate several language models under a zero-shot learning setting and human performances on MAFALDA to assess their capability to detect and classify fallacies.

pdf bib abs

The Impact of Word Splitting on the Semantic Content of Contextualized Word Representations
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Transactions of the Association for Computational Linguistics, Volume 12

When deriving contextualized word representations from language models, a decision needs to be made on how to obtain one for out-of-vocabulary (OOV) words that are segmented into subwords. What is the best way to represent these words with a single vector, and are these representations of worse quality than those of in-vocabulary words? We carry out an intrinsic evaluation of embeddings from different models on semantic similarity tasks involving OOV words. Our analysis reveals, among other interesting findings, that the quality of representations of words that are split is often, but not always, worse than that of the embeddings of known words. Their similarity values, however, must be interpreted with caution.

pdf bib abs

The Curious Decline of Linguistic Diversity: Training Language Models on Synthetic Text
Yanzhu Guo | Guokan Shang | Michalis Vazirgiannis | Chloé Clavel
Findings of the Association for Computational Linguistics: NAACL 2024

This study investigates the consequences of training language models on synthetic data generated by their predecessors, an increasingly prevalent practice given the prominence of powerful generative models. Diverging from the usual emphasis on performance metrics, we focus on the impact of this training methodology on linguistic diversity, especially when conducted recursively over time. To assess this, we adapt and develop a set of novel metrics targeting lexical, syntactic, and semantic diversity, applying them in recursive finetuning experiments across various natural language generation tasks in English. Our findings reveal a consistent decrease in the diversity of the model outputs through successive iterations, especially remarkable for tasks demanding high levels of creativity. This trend underscores the potential risks of training language models on synthetic text, particularly concerning the preservation of linguistic richness. Our study highlights the need for careful consideration of the long-term effects of such training approaches on the linguistic capabilities of language models.

pdf bib abs

Do Language Models Enjoy Their Own Stories? Prompting Large Language Models for Automatic Story Evaluation
Cyril Chhun | Fabian M. Suchanek | Chloé Clavel
Transactions of the Association for Computational Linguistics, Volume 12

Storytelling is an integral part of human experience and plays a crucial role in social interactions. Thus, Automatic Story Evaluation (ASE) and Generation (ASG) could benefit society in multiple ways, but they are challenging tasks which require high-level human abilities such as creativity, reasoning, and deep understanding. Meanwhile, Large Language Models (LLMs) now achieve state-of-the-art performance on many NLP tasks. In this paper, we study whether LLMs can be used as substitutes for human annotators for ASE. We perform an extensive analysis of the correlations between LLM ratings, other automatic measures, and human annotations, and we explore the influence of prompting on the results and the explainability of LLM behaviour. Most notably, we find that LLMs outperform current automatic measures for system-level evaluation but still struggle at providing satisfactory explanations for their answers.

2023

pdf bib abs

When to generate hedges in peer-tutoring interactions
Alafate Abulimiti | Chloé Clavel | Justine Cassell
Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue

This paper explores the application of machine learning techniques to predict where hedging occurs in peer-tutoring interactions. The study uses a naturalistic face-to-face dataset annotated for natural language turns, conversational strategies, tutoring strategies, and nonverbal behaviors. These elements are processed into a vector representation of the previous turns, which serves as input to several machine learning models, including MLP and LSTM. The results show that embedding layers, capturing the semantic information of the previous turns, significantly improves the model’s performance. Additionally, the study provides insights into the importance of various features, such as interpersonal rapport and nonverbal behaviors, in predicting hedges by using Shapley values for feature explanation. We discover that the eye gaze of both the tutor and the tutee has a significant impact on hedge prediction. We further validate this observation through a follow-up ablation study.

pdf bib abs

Protocole d’annotation multi-label pour une nouvelle approche à la génération de réponse socio-émotionnelle orientée-tâche
Lorraine Vanel | Alya Yacoubi | Chloe Clavel
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 1 : travaux de recherche originaux -- articles longs

Depuis l’apparition des systèmes conversationnels, la modélisation des comportements humains constitue un axe de recherche majeur afin de renforcer l’expression des attributs émotionnels de ces systèmes. En nous intéressant aux agents conversationnels génératifs orientés-tâches, nous proposons une nouvelle approche pour rendre la réponse générée plus pertinente au contexte émotionnel de l’interlocuteur. Cette approche consiste à ajouter une étape supplémentaire de prédiction de labels pour conditionner la réponse générée et assurer sa pertinence au contexte socio-émotionnel de l’utilisateur. Nous proposons une formulation de cette nouvelle tâche de prédiction en nous appuyant sur un protocole d’annotation de données que nous avons conçu et implémenté. À travers cet article, nous apportons les contributions suivantes: la formulation de la tâche de prédiction de labels socio-émotionnels et la description du protocole d’annotation associé. Avec cette méthodologie, nous visons à développer des systèmes conversationnels socialement pertinents et indépendants.

pdf bib abs

Un mot, deux facettes : traces des opinions dans les représentations contextualisées des mots
Aina Garí Soler | Matthieu Labeau | Chloe Clavel
Actes de CORIA-TALN 2023. Actes de la 30e Conférence sur le Traitement Automatique des Langues Naturelles (TALN), volume 4 : articles déjà soumis ou acceptés en conférence internationale

La façon dont nous utilisons les mots est influencée par notre opinion. Nous cherchons à savoir si cela se reflète dans les plongements de mots contextualisés. Par exemple, la représentation d’ « animal » est-elle différente pour les gens qui voudraient abolir les zoos et ceux qui ne le voudraient pas ? Nous explorons cette question du point de vue du changement sémantique des mots. Nos expériences avec des représentations dérivées d’ensembles de données annotés avec les points de vue révèlent des différences minimes, mais significatives, entre postures opposées.

pdf bib abs

How About Kind of Generating Hedges using End-to-End Neural Models?
Alafate Abulimiti | Chloé Clavel | Justine Cassell
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hedging is a strategy for softening the impact of a statement in conversation. In reducing the strength of an expression, it may help to avoid embarrassment (more technically, “face threat”) to one’s listener. For this reason, it is often found in contexts of instruction, such as tutoring. In this work, we develop a model of hedge generation based on i) fine-tuning state-of-the-art language models trained on human-human tutoring data, followed by ii) reranking to select the candidate that best matches the expected hedging strategy within a candidate pool using a hedge classifier. We apply this method to a natural peer-tutoring corpus containing a significant number of disfluencies, repetitions, and repairs. The results show that generation in this noisy environment is feasible with reranking. By conducting an error analysis for both approaches, we reveal the challenges faced by systems attempting to accomplish both social and task-oriented goals in conversation.

pdf bib abs

Automatic Analysis of Substantiation in Scientific Peer Reviews
Yanzhu Guo | Guokan Shang | Virgile Rennard | Michalis Vazirgiannis | Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2023

With the increasing amount of problematic peer reviews in top AI conferences, the community is urgently in need of automatic quality control measures. In this paper, we restrict our attention to substantiation — one popular quality aspect indicating whether the claims in a review are sufficiently supported by evidence — and provide a solution automatizing this evaluation process. To achieve this goal, we first formulate the problem as claim-evidence pair extraction in scientific peer reviews, and collect SubstanReview, the first annotated dataset for this task. SubstanReview consists of 550 reviews from NLP conferences annotated by domain experts. On the basis of this dataset, we train an argument mining system to automatically analyze the level of substantiation in peer reviews. We also perform data analysis on the SubstanReview dataset to obtain meaningful insights on peer reviewing quality in NLP conferences over recent years. The dataset is available at https://github.com/YanzhuGuo/SubstanReview.

pdf bib abs

Measuring Lexico-Semantic Alignment in Debates with Contextualized Word Representations
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023)

Dialog participants sometimes align their linguistic styles, e.g., they use the same words and syntactic constructions as their interlocutors. We propose to investigate the notion of lexico-semantic alignment: to what extent do speakers convey the same meaning when they use the same words? We design measures of lexico-semantic alignment relying on contextualized word representations. We show that they reflect interesting semantic differences between the two sides of a debate and that they can assist in the task of debate’s winner prediction.

2022

pdf bib abs

Opinions in Interactions : New Annotations of the SEMAINE Database
Valentin Barriere | Slim Essid | Chloé Clavel
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we present the process we used in order to collect new annotations of opinions over the multimodal corpus SEMAINE composed of dyadic interactions. The dataset had already been annotated continuously in two affective dimensions related to the emotions: Valence and Arousal. We annotated the part of SEMAINE called Solid SAL composed of 79 interactions between a user and an operator playing the role of a virtual agent designed to engage a person in a sustained, emotionally colored conversation. We aligned the audio at the word level using the available high-quality manual transcriptions. The annotated dataset contains 5627 speech turns for a total of 73,944 words, corresponding to 6 hours 20 minutes of dyadic interactions. Each interaction has been labeled by three annotators at the speech turn level following a three-step process. This method allows us to obtain a precise annotation regarding the opinion of a speaker. We obtain thus a dataset dense in opinions, with more than 48% of the annotated speech turns containing at least one opinion. We then propose a new baseline for the detection of opinions in interactions improving slightly a state of the art model with RoBERTa embeddings. The obtained results on the database are promising with a F1-score at 0.72.

pdf bib abs

Studying Alignment in a Collaborative Learning Activity via Automatic Methods: The Link Between What We Say and Do
Utku Norman | Tanvi Dinkar | Barbara Bruno | Chloé Clavel
Dialogue & Discourse Volume 13

A dialogue is successful when there is alignment between the speakers, at different linguistic levels. In this work, we consider the dialogue occurring between interlocutors engaged in a collaborative learning task, where they are evaluated on how well they performed and how much they learnt. Our main contribution is to propose new automatic measures to study alignment; focusing on lexical alignment, and a new alignment context that we introduce termed as behavioural alignment (when an instruction given by one interlocutor was followed with concrete actions in a physical environment by another). Thus we propose methodologies to create a link between what was said, and what was done as a consequence. To do so, we focus on expressions related to the task in the situated activity. These expressions are minimally required by the interlocutors to make progress in the task. We then observe how these local alignment contexts build to dialogue level phenomena; success in the task. What distinguishes our approach from other works, is the treatment of alignment as a procedure that occurs in stages. Since we utilise a dataset of spontaneous speech dialogues elicited from children, a second contribution of our work is to study how spontaneous speech phenomena (such as when interlocutors say "uh", "oh" ...) are used in the process of alignment. Lastly, we make public the dataset to study alignment in educational dialogues. Our results show that all teams lexically and behaviourally align to some degree regardless of their performance and learning, and our measures capture that teams that did not succeed in the task were simply slower to collaborate. Thus we find that teams that performed better, were faster to align. Furthermore, our methodology captures a productive, collaborative period that includes the time where the interlocutors came up with their best solutions. We also find that well-performing teams verbalise the marker "oh" more when they are behaviourally aligned, compared to other times in the dialogue; showing that this marker is an important cue in alignment. To the best of our knowledge, we are the first to study the role of "oh" as an information management marker in a behavioural context (i.e. in connection to actions taken in a physical environment), compared to only a verbal one. Our measures contribute to the research in the field of educational dialogue and the intersection between dialogue and collaborative learning research.

pdf bib abs

Polysemy in Spoken Conversations and Written Texts
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Our discourses are full of potential lexical ambiguities, due in part to the pervasive use of words having multiple senses. Sometimes, one word may even be used in more than one sense throughout a text. But, to what extent is this true for different kinds of texts? Does the use of polysemous words change when a discourse involves two people, or when speakers have time to plan what to say? We investigate these questions by comparing the polysemy level of texts of different nature, with a focus on spontaneous spoken dialogs; unlike previous work which examines solely scripted, written, monolog-like data. We compare multiple metrics that presuppose different conceptualizations of text polysemy, i.e., they consider the observed or the potential number of senses of words, or their sense distribution in a discourse. We show that the polysemy level of texts varies greatly depending on the kind of text considered, with dialog and spoken discourses having generally a higher polysemy level than written monologs. Additionally, our results emphasize the need for relaxing the popular “one sense per discourse” hypothesis.

pdf bib abs

EZCAT: an Easy Conversation Annotation Tool
Gaël Guibon | Luce Lefeuvre | Matthieu Labeau | Chloé Clavel
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Users generate content constantly, leading to new data requiring annotation. Among this data, textual conversations are created every day and come with some specificities: they are mostly private through instant messaging applications, requiring the conversational context to be labeled. These specificities led to several annotation tools dedicated to conversation, and mostly dedicated to dialogue tasks, requiring complex annotation schemata, not always customizable and not taking into account conversation-level labels. In this paper, we present EZCAT, an easy-to-use interface to annotate conversations in a two-level configurable schema, leveraging message-level labels and conversation-level labels. Our interface is characterized by the voluntary absence of a server and accounts management, enhancing its availability to anyone, and the control over data, which is crucial to confidential conversations. We also present our first usage of EZCAT along with our annotation schema we used to annotate confidential customer service conversations. EZCAT is freely available at https://gguibon.github.io/ezcat.

pdf bib abs

One Word, Two Sides: Traces of Stance in Contextualized Word Representations
Aina Garí Soler | Matthieu Labeau | Chloé Clavel
Proceedings of the 29th International Conference on Computational Linguistics

The way we use words is influenced by our opinion. We investigate whether this is reflected in contextualized word embeddings. For example, is the representation of “animal” different between people who would abolish zoos and those who would not? We explore this question from a Lexical Semantic Change standpoint. Our experiments with BERT embeddings derived from datasets with stance annotations reveal small but significant differences in word representations between opposing stances.

pdf bib abs

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency
Yanzhu Guo | Chloé Clavel | Moussa Kamal Eddine | Michalis Vazirgiannis
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

The topic of summarization evaluation has recently attracted a surge of attention due to the rapid development of abstractive summarization systems. However, the formulation of the task is rather ambiguous, neither the linguistic nor the natural language processing communities have succeeded in giving a mutually agreed-upon definition. Due to this lack of well-defined formulation, a large number of popular abstractive summarization datasets are constructed in a manner that neither guarantees validity nor meets one of the most essential criteria of summarization: factual consistency. In this paper, we address this issue by combining state-of-the-art factual consistency models to identify the problematic instances present in popular summarization datasets. We release SummFC, a filtered summarization dataset with improved factual consistency, and demonstrate that models trained on this dataset achieve improved performance in nearly all quality aspects. We argue that our dataset should become a valid benchmark for developing and evaluating summarization systems.

pdf bib abs

LogiTorch: A PyTorch-based library for logical reasoning on natural language
Chadi Helwe | Chloé Clavel | Fabian Suchanek
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Logical reasoning on natural language is one of the most challenging tasks for deep learning models. There has been an increasing interest in developing new benchmarks to evaluate the reasoning capabilities of language models such as BERT. In parallel, new models based on transformers have emerged to achieve ever better performance on these datasets. However, there is currently no library for logical reasoning that includes such benchmarks and models. This paper introduces LogiTorch, a PyTorch-based library that includes different logical reasoning benchmarks, different models, as well as utility functions such as co-reference resolution. This makes it easy to directly use the preprocessed datasets, to run the models, or to finetune them with different hyperparameters. LogiTorch is open source and can be found on GitHub.

pdf bib

Fillers in Spoken Language Understanding: Computational and Psycholinguistic Perspectives
Tanvi Dinkar | Chloé Clavel | Ioana Vasilescu
Traitement Automatique des Langues, Volume 63, Numéro 3 : Etats de l'art en TAL [Review articles in NLP]

pdf bib abs

“You might think about slightly revising the title”: Identifying Hedges in Peer-tutoring Interactions
Yann Raphalen | Chloé Clavel | Justine Cassell
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Hedges have an important role in the management of rapport. In peer-tutoring, they are notably used by tutors in dyads experiencing low rapport to tone down the impact of instructions and negative feedback. Pursuing the objective of building a tutoring agent that manages rapport with teenagers in order to improve learning, we used a multimodal peer-tutoring dataset to construct a computational framework for identifying hedges. We compared approaches relying on pre-trained resources with others that integrate insights from the social science literature. Our best performance involved a hybrid approach that outperforms the existing baseline while being easier to interpret. We employ a model explainability tool to explore the features that characterize hedges in peer-tutoring conversations, and we identify some novel features, and the benefits of a such a hybrid model approach.

pdf bib abs

TINA: Textual Inference with Negation Augmentation
Chadi Helwe | Simon Coumes | Chloé Clavel | Fabian Suchanek
Findings of the Association for Computational Linguistics: EMNLP 2022

Transformer-based language models achieve state-of-the-art results on several natural language processing tasks. One of these is textual entailment, i.e., the task of determining whether a premise logically entails a hypothesis. However, the models perform poorly on this task when the examples contain negations. In this paper, we propose a new definition of textual entailment that captures also negation. This allows us to develop TINA (Textual Inference with Negation Augmentation), a principled technique for negated data augmentation that can be combined with the unlikelihood loss function.Our experiments with different transformer-based models show that our method can significantly improve the performance of the models on textual entailment datasets with negation – without sacrificing performance on datasets without negation.

pdf bib abs

Of Human Criteria and Automatic Metrics: A Benchmark of the Evaluation of Story Generation
Cyril Chhun | Pierre Colombo | Fabian M. Suchanek | Chloé Clavel
Proceedings of the 29th International Conference on Computational Linguistics

Research on Automatic Story Generation (ASG) relies heavily on human and automatic evaluation. However, there is no consensus on which human evaluation criteria to use, and no analysis of how well automatic criteria correlate with them. In this paper, we propose to re-evaluate ASG evaluation. We introduce a set of 6 orthogonal and comprehensive human criteria, carefully motivated by the social sciences literature. We also present HANNA, an annotated dataset of 1,056 stories produced by 10 different ASG systems. HANNA allows us to quantitatively evaluate the correlations of 72 automatic metrics with human criteria. Our analysis highlights the weaknesses of current metrics for ASG and allows us to formulate practical recommendations for ASG evaluation.

2021

pdf bib abs

Improving Multimodal fusion via Mutual Dependency Maximisation
Pierre Colombo | Emile Chapuis | Matthieu Labeau | Chloé Clavel
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Multimodal sentiment analysis is a trending area of research, and multimodal fusion is one of its most active topic. Acknowledging humans communicate through a variety of channels (i.e visual, acoustic, linguistic), multimodal systems aim at integrating different unimodal representations into a synthetic one. So far, a consequent effort has been made on developing complex architectures allowing the fusion of these modalities. However, such systems are mainly trained by minimising simple losses such as L₁ or cross-entropy. In this work, we investigate unexplored penalties and propose a set of new objectives that measure the dependency between modalities. We demonstrate that our new penalties lead to a consistent improvement (up to 4.3 on accuracy) across a large variety of state-of-the-art models on two well-known sentiment analysis datasets: CMU-MOSI and CMU-MOSEI. Our method not only achieves a new SOTA on both datasets but also produces representations that are more robust to modality drops. Finally, a by-product of our methods includes a statistical network which can be used to interpret the high dimensional representations learnt by the model.

pdf bib

Beam Search with Bidirectional Strategies for Neural Response Generation
Pierre Colombo | Chloé Clavel | Chouchang Yack | Giovanna Varni
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

pdf bib abs

Few-Shot Emotion Recognition in Conversation with Sequential Prototypical Networks
Gaël Guibon | Matthieu Labeau | Hélène Flamein | Luce Lefeuvre | Chloé Clavel
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Several recent studies on dyadic human-human interactions have been done on conversations without specific business objectives. However, many companies might benefit from studies dedicated to more precise environments such as after sales services or customer satisfaction surveys. In this work, we place ourselves in the scope of a live chat customer service in which we want to detect emotions and their evolution in the conversation flow. This context leads to multiple challenges that range from exploiting restricted, small and mostly unlabeled datasets to finding and adapting methods for such context. We tackle these challenges by using Few-Shot Learning while making the hypothesis it can serve conversational emotion classification for different languages and sparse labels. We contribute by proposing a variation of Prototypical Networks for sequence labeling in conversation that we name ProtoSeq. We test this method on two datasets with different languages: daily conversations in English and customer service chat conversations in French. When applied to emotion classification in conversations, our method proved to be competitive even when compared to other ones.

pdf bib abs

Automatic Text Evaluation through the Lens of Wasserstein Barycenters
Pierre Colombo | Guillaume Staerman | Chloé Clavel | Pablo Piantanida
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

A new metric BaryScore to evaluate text generation based on deep contextualized embeddings (e.g., BERT, Roberta, ELMo) is introduced. This metric is motivated by a new framework relying on optimal transport tools, i.e., Wasserstein distance and barycenter. By modelling the layer output of deep contextualized embeddings as a probability distribution rather than by a vector embedding; this framework provides a natural way to aggregate the different outputs through the Wasserstein space topology. In addition, it provides theoretical grounds to our metric and offers an alternative to available solutions (e.g., MoverScore and BertScore). Numerical evaluation is performed on four different tasks: machine translation, summarization, data2text generation and image captioning. Our results show that BaryScore outperforms other BERT based metrics and exhibits more consistent behaviour in particular for text summarization.

pdf bib abs

Meta-learning for Classifying Previously Unseen Data Source into Previously Unseen Emotional Categories
Gaël Guibon | Matthieu Labeau | Hélène Flamein | Luce Lefeuvre | Chloé Clavel
Proceedings of the 1st Workshop on Meta Learning and Its Applications to Natural Language Processing

In this paper, we place ourselves in a classification scenario in which the target classes and data type are not accessible during training. We use a meta-learning approach to determine whether or not meta-trained information from common social network data with fine-grained emotion labels can achieve competitive performance on messages labeled with different emotion categories. We leverage few-shot learning to match with the classification scenario and consider metric learning based meta-learning by setting up Prototypical Networks with a Transformer encoder, trained in an episodic fashion. This approach proves to be effective for capturing meta-information from a source emotional tag set to predict previously unseen emotional tags. Even though shifting the data type triggers an expected performance drop, our meta-learning approach achieves decent results when compared to the fully supervised one.

pdf bib abs

Code-switched inspired losses for spoken dialog representations
Pierre Colombo | Emile Chapuis | Matthieu Labeau | Chloé Clavel
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Spoken dialogue systems need to be able to handle both multiple languages and multilinguality inside a conversation (e.g in case of code-switching). In this work, we introduce new pretraining losses tailored to learn generic multilingual spoken dialogue representations. The goal of these losses is to expose the model to code-switched language. In order to scale up training, we automatically build a pretraining corpus composed of multilingual conversations in five different languages (French, Italian, English, German and Spanish) from OpenSubtitles, a huge multilingual corpus composed of 24.3G tokens. We test the generic representations on MIAM, a new benchmark composed of five dialogue act corpora on the same aforementioned languages as well as on two novel multilingual tasks (i.e multilingual mask utterance retrieval and multilingual inconsistency identification). Our experiments show that our new losses achieve a better performance in both monolingual and multilingual settings.

pdf bib abs

A Novel Estimator of Mutual Information for Learning to Disentangle Textual Representations
Pierre Colombo | Pablo Piantanida | Chloé Clavel
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Learning disentangled representations of textual data is essential for many natural language tasks such as fair classification, style transfer and sentence generation, among others. The existent dominant approaches in the context of text data either rely on training an adversary (discriminator) that aims at making attribute values difficult to be inferred from the latent code or rely on minimising variational bounds of the mutual information between latent code and the value attribute. However, the available methods suffer of the impossibility to provide a fine-grained control of the degree (or force) of disentanglement. In contrast to adversarial methods, which are remarkably simple, although the adversary seems to be performing perfectly well during the training phase, after it is completed a fair amount of information about the undesired attribute still remains. This paper introduces a novel variational upper bound to the mutual information between an attribute and the latent code of an encoder. Our bound aims at controlling the approximation error via the Renyi’s divergence, leading to both better disentangled representations and in particular, a precise control of the desirable degree of disentanglement than state-of-the-art methods proposed for textual data. Furthermore, it does not suffer from the degeneracy of other losses in multi-class scenarios. We show the superiority of this method on fair classification and on textual style transfer tasks. Additionally, we provide new insights illustrating various trade-offs in style transfer when attempting to learn disentangled representations and quality of the generated sentence.

pdf bib

From local hesitations to global impressions of a speaker’s feeling of knowing
Tanvi Dinkar | Beatrice Biancardi | Chloé Clavel
Proceedings of the 4th International Conference on Natural Language and Speech Processing (ICNLSP 2021)

pdf bib abs

Méta-apprentissage : classification de messages en catégories émotionnelles inconnues en entraînement (Meta-learning : Classifying Messages into Unseen Emotional Categories)
Gaël Guibon | Matthieu Labeau | Hélène Flamein | Luce Lefeuvre | Chloé Clavel
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Dans cet article nous reproduisons un scénario d’apprentissage selon lequel les données cibles ne sont pas accessibles et seules des données connexes le sont. Nous utilisons une approche par méta-apprentissage afin de déterminer si les méta-informations apprises à partir de messages issus de médias sociaux, finement annotés en émotions, peuvent produire de bonnes performances une fois utilisées sur des messages issus de conversations, étiquetés en émotions avec une granularité différente. Nous mettons à profit l’apprentissage sur quelques exemples (few-shot learning) pour la mise en place de ce scénario. Cette approche se montre efficace pour capturer les méta-informations d’un jeu d’étiquettes émotionnelles pour prédire des étiquettes jusqu’alors inconnues au modèle. Bien que le fait de varier le type de données engendre une baisse de performance, notre approche par méta-apprentissage atteint des résultats décents comparés au référentiel d’apprentissage supervisé.

2020

pdf bib abs

The importance of fillers for text representations of speech transcripts
Tanvi Dinkar | Pierre Colombo | Matthieu Labeau | Chloé Clavel
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

While being an essential component of spoken language, fillers (e.g. “um” or “uh”) often remain overlooked in Spoken Language Understanding (SLU) tasks. We explore the possibility of representing them with deep contextualised embeddings, showing improvements on modelling spoken language and two downstream tasks — predicting a speaker’s stance and expressed confidence.

pdf bib abs

The POTUS Corpus, a Database of Weekly Addresses for the Study of Stance in Politics and Virtual Agents
Thomas Janssoone | Kévin Bailly | Gaël Richard | Chloé Clavel
Proceedings of the Twelfth Language Resources and Evaluation Conference

One of the main challenges in the field of Embodied Conversational Agent (ECA) is to generate socially believable agents. The common strategy for agent behaviour synthesis is to rely on dedicated corpus analysis. Such a corpus is composed of multimedia files of socio-emotional behaviors which have been annotated by external observers. The underlying idea is to identify interaction information for the agent’s socio-emotional behavior by checking whether the intended socio-emotional behavior is actually perceived by humans. Then, the annotations can be used as learning classes for machine learning algorithms applied to the social signals. This paper introduces the POTUS Corpus composed of high-quality audio-video files of political addresses to the American people. Two protagonists are present in this database. First, it includes speeches of former president Barack Obama to the American people. Secondly, it provides videos of these same speeches given by a virtual agent named Rodrigue. The ECA reproduces the original address as closely as possible using social signals automatically extracted from the original one. Both are annotated for social attitudes, providing information about the stance observed in each file. It also provides the social signals automatically extracted from Obama’s addresses used to generate Rodrigue’s ones.

pdf bib abs

Hierarchical Pre-training for Sequence Labelling in Spoken Dialog
Emile Chapuis | Pierre Colombo | Matteo Manica | Matthieu Labeau | Chloé Clavel
Findings of the Association for Computational Linguistics: EMNLP 2020

Sequence labelling tasks like Dialog Act and Emotion/Sentiment identification are a key component of spoken dialog systems. In this work, we propose a new approach to learn generic representations adapted to spoken dialog, which we evaluate on a new benchmark we call Sequence labellIng evaLuatIon benChmark fOr spoken laNguagE benchmark (SILICONE). SILICONE is model-agnostic and contains 10 different datasets of various sizes. We obtain our representations with a hierarchical encoder based on transformer architectures, for which we extend two well-known pre-training objectives. Pre-training is performed on OpenSubtitles: a large corpus of spoken dialog containing over 2.3 billion of tokens. We demonstrate how hierarchical encoders achieve competitive results with consistently fewer parameters compared to state-of-the-art models and we show their importance for both pre-training and fine-tuning.

pdf bib abs

Multimodal Analysis of Cohesion in Multi-party Interactions
Reshmashree Bangalore Kantharaju | Caroline Langlet | Mukesh Barange | Chloé Clavel | Catherine Pelachaud
Proceedings of the Twelfth Language Resources and Evaluation Conference

Group cohesion is an emergent phenomenon that describes the tendency of the group members’ shared commitment to group tasks and the interpersonal attraction among them. This paper presents a multimodal analysis of group cohesion using a corpus of multi-party interactions. We utilize 16 two-minute segments annotated with cohesion from the AMI corpus. We define three layers of modalities: non-verbal social cues, dialogue acts and interruptions. The initial analysis is performed at the individual level and later, we combine the different modalities to observe their impact on perceived level of cohesion. Results indicate that occurrence of laughter and interruption are higher in high cohesive segments. We also observe that, dialogue acts and head nods did not have an impact on the level of cohesion by itself. However, when combined there was an impact on the perceived level of cohesion. Overall, the analysis shows that multimodal cues are crucial for accurate analysis of group cohesion.

2019

pdf bib abs

From the Token to the Review: A Hierarchical Multimodal approach to Opinion Mining
Alexandre Garcia | Pierre Colombo | Florence d’Alché-Buc | Slim Essid | Chloé Clavel
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The task of predicting fine grained user opinion based on spontaneous spoken language is a key problem arising in the development of Computational Agents as well as in the development of social network based opinion miners. Unfortunately, gathering reliable data on which a model can be trained is notoriously difficult and existing works rely only on coarsely labeled opinions. In this work we aim at bridging the gap separating fine grained opinion models already developed for written language and coarse grained models developed for spontaneous multimodal opinion mining. We take advantage of the implicit hierarchical structure of opinions to build a joint fine and coarse grained opinion model that exploits different views of the opinion expression. The resulting model shares some properties with attention-based models and is shown to provide competitive results on a recently released multimodal fine grained annotated corpus.

2017

pdf bib abs

Automatic Measures to Characterise Verbal Alignment in Human-Agent Interaction
Guillaume Dubuisson Duplessis | Chloé Clavel | Frédéric Landragin
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

This work aims at characterising verbal alignment processes for improving virtual agent communicative capabilities. We propose computationally inexpensive measures of verbal alignment based on expression repetition in dyadic textual dialogues. Using these measures, we present a contrastive study between Human-Human and Human-Agent dialogues on a negotiation task. We exhibit quantitative differences in the strength and orientation of verbal alignment showing the ability of our approach to characterise important aspects of verbal alignment.

2015

pdf bib

Improving social relationships in face-to-face human-agent interactions: when the agent wants to know user’s likes and dislikes
Caroline Langlet | Chloé Clavel
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

2014

pdf bib

Modelling agent’s questions for analysing user’s affects, appreciations and judgements in human-agent interaction (Modélisation des questions de l’agent pour l’analyse des affects, jugements et appréciations de l’utilisateur dans les interactions humain-agent) [in French]
Caroline Langlet | Chloé Clavel
Proceedings of TALN 2014 (Volume 2: Short Papers)

pdf bib abs

Comparative analysis of verbal alignment in human-human and human-agent interactions
Sabrina Campano | Jessica Durand | Chloé Clavel
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

Engagement is an important feature in human-human and human-agent interaction. In this paper, we investigate lexical alignment as a cue of engagement, relying on two different corpora : CID and SEMAINE. Our final goal is to build a virtual conversational character that could use alignment strategies to maintain user’s engagement. To do so, we investigate two alignment processes : shared vocabulary and other-repetitions. A quantitative and qualitative approach is proposed to characterize these aspects in human-human (CID) and human-operator (SEMAINE) interactions. Our results show that these processes are observable in both corpora, indicating a stable pattern that can be further modelled in conversational agents.

2012

pdf bib

Quel est l’apport de la détection d’entités nommées pour l’extraction d’information en domaine restreint ? (What is the contribution of named entities detection for information extraction in restricted domain ?) [in French]
Camille Dutrey | Chloé Clavel | Sophie Rosset | Ioana Vasilescu | Martine Adda-Decker
Proceedings of the Joint Conference JEP-TALN-RECITAL 2012, volume 2: TALN

2010

pdf bib abs

L’apport des concepts métiers pour la classification des questions ouvertes d’enquête
Ludivine Kuznik | Anne-Laure Guénet | Anne Peradotto | Chloé Clavel
Actes de la 17e conférence sur le Traitement Automatique des Langues Naturelles. Articles courts

EDF utilise les techniques de Text Mining pour optimiser sa relation client, en analysant des réponses aux questions ouvertes d’enquête de satisfaction, et des retranscriptions de conversations issues des centres d’appels. Dans cet article, nous présentons les différentes contraintes applicatives liées à l’utilisation d’outils de text mining pour l’analyse de données clients. Après une analyse des différents outils présents sur le marché, nous avons identifié la technologie Skill CartridgeTM fournie par la société TEMIS comme la plus adaptée à nos besoins. Cette technologie nous permet une modélisation sémantique de concepts liés au motif d’insatisfaction. L’apport de cette modélisation est illustrée pour une tâche de classification de réponses d’enquêtes de satisfaction chargée d’évaluer la fidélité des clients EDF. La modélisation sémantique a permis une nette amélioration des scores de classification (F-mesure = 75,5%) notamment pour les catégories correspondant à la satisfaction et au mécontentement.

pdf bib

Extraction probabiliste de chaînes de mots relatives à une opinion [A probabilistic approach for extracting opinion-related word chains from texts]
Remi Lavalley | Chloe Clavel | Patrice Bellot
Traitement Automatique des Langues, Volume 51, Numéro 3 : Opinions, sentiments et jugements d’évaluation [Opinions, sentiment and evaluative language]

2006

pdf bib abs

Fear-type emotions of the SAFE Corpus: annotation issues
Chloé Clavel | Ioana Vasilescu | Laurence Devillers | Thibaut Ehrette | Gaël Richard
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

The present research focuses on annotation issues in the context of the acoustic detection of fear-type emotions for surveillance applications. The emotional speech material used for this study comes from the previously collected SAFE Database (Situation Analysis in a Fictional and Emotional Database) which consists of audio-visual sequences extracted from movie fictions. A generic annotation scheme was developed to annotate the various emotional manifestations contained in the corpus. The annotation was carried out by two labellers and the two annotations strategies are confronted. It emerges that the borderline between emotion and neutral vary according to the labeller. An acoustic validation by a third labeller allows at analysing the two strategies. Two human strategies are then observed: a first one, context-oriented which mixes audio and contextual (video) information in emotion categorization; and a second one, based mainly on audio information. The k-means clustering confirms the role of audio cues in human annotation strategies. It particularly helps in evaluating those strategies from the point of view of a detection system based on audio cues.