Auriane Boudin


2026

Audio and video tokenizers are autoencoders trained to represent the content of recordings as a sequence of vectors. They are prevalently used to interface large language models with non-textual modalities. While they allow advanced applications such as video generation, the envelope of their limitations is not known in the context of multimodal conversation. This work focuses on backchannels, which listeners use to signal to the speaker that they are listening. This feedback is essential to maintain the conversation flow. We evaluate whether a representative set of audio and video tokenizers encode backchannels using linear probing. Results show that although audio tokenizers capture the phenomenon relatively well, backchannels are not linearly separated by video tokenizers. However, joint representations resulting from concatenating representations in both modalities improve accuracy significantly over audio-only representations, suggesting to train multimodal tokenizers.

2024

In the realm of human communication, feedback plays a pivotal role in shaping the dynamics of conversations. This study delves into the multifaceted relationship between listener feedback, narration quality and distraction effects. We present an analysis conducted on the SMYLE corpus, specifically enriched for this study, where 30 dyads of participants engaged in 1) face-to-face storytelling (8.2 hours) followed by 2) a free conversation (7.8 hours). The storytelling task unfolds in two conditions, where a storyteller engages with either a “normal” or a “distracted” listener. Examining the feedback impact on storytellers, we discover a positive correlation between the frequency of specific feedback and the narration quality in normal conditions, providing an encouraging conclusion regarding the enhancement of interaction through specific feedback in distraction-free settings. In contrast, in distracted settings, a negative correlation emerges, suggesting that increased specific feedback may disrupt narration quality, underscoring the complexity of feedback dynamics in human communication. The contribution of this paper is twofold: first presenting a new and highly enriched resource for the analysis of discourse phenomena in controlled and normal conditions; second providing new results on feedback production, its form and its consequence on the discourse quality (with direct applications in human-machine interaction).

2022

The aim of this study is to investigate conversational feedbacks that contain smiles and laughs. Firstly, we propose a statistical analysis of smiles and laughs used as generic and specific feedbacks in a corpus of French talk-in-interaction. Our results show that smiles of low intensity are preferentially used to produce generic feedbacks while high intensity smiles and laughs are preferentially used to produce specific feedbacks. Secondly, based on a machine learning approach, we propose a hierarchical classification of feedback to automatically predict not only the presence/absence of a smile but, also the type of smiles according to an intensity-scale (low or high).