Stéphane Dupont


2022

pdf bib
Analysis of Co-Laughter Gesture Relationship on RGB Videos in Dyadic Conversation Context
Hugo Bohy | Ahmad Hammoudeh | Antoine Maiorca | Stéphane Dupont | Thierry Dutoit
Proceedings of the Workshop on Smiling and Laughter across Contexts and the Life-span within the 13th Language Resources and Evaluation Conference

The development of virtual agents has enabled human-avatar interactions to become increasingly rich and varied. Moreover, an expressive virtual agent i.e. that mimics the natural expression of emotions, enhances social interaction between a user (human) and an agent (intelligent machine). The set of non-verbal behaviors of a virtual character is, therefore, an important component in the context of human-machine interaction. Laughter is not just an audio signal, but an intrinsic relationship of multimodal non-verbal communication, in addition to audio, it includes facial expressions and body movements. Motion analysis often relies on a relevant motion capture dataset, but the main issue is that the acquisition of such a dataset is expensive and time-consuming. This work studies the relationship between laughter and body movements in dyadic conversations between two interlocutors. The body movements were extracted from videos using deep learning based pose estimator model. We found that, in the explored NDC-ME dataset, a single statistical feature (i.e, the maximum value, or the maximum of Fourier transform) of a joint movement weakly correlates with laughter intensity by 30%. However, we did not find a direct correlation between audio features and body movements. We discuss about the challenges to use such dataset for the audio-driven co-laughter motion synthesis task.

pdf bib
Are There Any Body-movement Differences between Women and Men When They Laugh?
Ahmad Hammoudeh | Antoine Maiorca | Stéphane Dupont | Thierry Dutoit
Proceedings of the Workshop on Smiling and Laughter across Contexts and the Life-span within the 13th Language Resources and Evaluation Conference

Smiling differences between men and women have been studied in psychology. Women smile more than men although the expressiveness of women is not universally more across all facial actions. There are also body movement differences between women and men. For example, more open-body postures were reported for men, but are there any body-movement differences between men and women when they laugh? To investigate this question, we study body-movement signals extracted from recorded laughter videos using a deep learning pose estimation model. Initial results showed a higher Fourier Transform amplitude of thorax and shoulder movements for females while males had a higher Fourier transform amplitude of Elbow movement. The differences were not limited to a small frequency range but covered most of the frequency spectrum. However, further investigations are still needed.

2020

pdf bib
A Transformer-based joint-encoding for Emotion Recognition and Sentiment Analysis
Jean-Benoit Delbrouck | Noé Tits | Mathilde Brousmiche | Stéphane Dupont
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

Understanding expressed sentiment and emotions are two crucial factors in human multimodal language. This paper describes a Transformer-based joint-encoding (TBJE) for the task of Emotion Recognition and Sentiment Analysis. In addition to use the Transformer architecture, our approach relies on a modular co-attention and a glimpse layer to jointly encode one or more modalities. The proposed solution has also been submitted to the ACL20: Second Grand-Challenge on Multimodal Language to be evaluated on the CMU-MOSEI dataset. The code to replicate the presented experiments is open-source .

pdf bib
Modulated Fusion using Transformer for Linguistic-Acoustic Emotion Recognition
Jean-Benoit Delbrouck | Noé Tits | Stéphane Dupont
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

This paper aims to bring a new lightweight yet powerful solution for the task of Emotion Recognition and Sentiment Analysis. Our motivation is to propose two architectures based on Transformers and modulation that combine the linguistic and acoustic inputs from a wide range of datasets to challenge, and sometimes surpass, the state-of-the-art in the field. To demonstrate the efficiency of our models, we carefully evaluate their performances on the IEMOCAP, MOSI, MOSEI and MELD dataset. The experiments can be directly replicated and the code is fully open for future researches.

2017

pdf bib
An empirical study on the effectiveness of images in Multimodal Neural Machine Translation
Jean-Benoit Delbrouck | Stéphane Dupont
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In state-of-the-art Neural Machine Translation (NMT), an attention mechanism is used during decoding to enhance the translation. At every step, the decoder uses this mechanism to focus on different parts of the source sentence to gather the most useful information before outputting its target word. Recently, the effectiveness of the attention mechanism has also been explored for multi-modal tasks, where it becomes possible to focus both on sentence parts and image regions that they describe. In this paper, we compare several attention mechanism on the multi-modal translation task (English, image → German) and evaluate the ability of the model to make use of images to improve translation. We surpass state-of-the-art scores on the Multi30k data set, we nevertheless identify and report different misbehavior of the machine while translating.

2016

pdf bib
AVAB-DBS: an Audio-Visual Affect Bursts Database for Synthesis
Kevin El Haddad | Hüseyin Çakmak | Stéphane Dupont | Thierry Dutoit
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

It has been shown that adding expressivity and emotional expressions to an agent’s communication systems would improve the interaction quality between this agent and a human user. In this paper we present a multimodal database of affect bursts, which are very short non-verbal expressions with facial, vocal, and gestural components that are highly synchronized and triggered by an identifiable event. This database contains motion capture and audio data of affect bursts representing disgust, startle and surprise recorded at three different levels of arousal each. This database is to be used for synthesis purposes in order to generate affect bursts of these emotions on a continuous arousal level scale.