Dong Won Lee
2024
Global Reward to Local Rewards: Multimodal-Guided Decomposition for Improving Dialogue Agents
Dong Won Lee
|
Hae Won Park
|
Yoon Kim
|
Cynthia Breazeal
|
Louis-Philippe Morency
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We describe an approach for aligning an LLM based dialogue agent for long-term social dialogue, where there is only a single global score given by the user at the end of the session. In this paper, we propose the usage of denser naturally-occurring multimodal communicative signals as local implicit feedback to improve the turn-level utterance generation. Therefore, our approach (dubbed GELI) learns a local, turn-level reward model by decomposing the human-provided Global Explicit (GE) session level reward, using Local Implicit (LI) multimodal reward signals to crossmodally shape the reward decomposition step. This decomposed reward model is then used as part of the RLHF pipeline to improve an LLM-based dialog agent. We run quantitative and qualitative human studies on two large-scale datasets to evaluate the performance of our GELI approach, and find that it shows consistent improvements across various conversational metrics compared to baseline methods.
2020
No Gestures Left Behind: Learning Relationships between Spoken Language and Freeform Gestures
Chaitanya Ahuja
|
Dong Won Lee
|
Ryo Ishii
|
Louis-Philippe Morency
Findings of the Association for Computational Linguistics: EMNLP 2020
We study relationships between spoken language and co-speech gestures in context of two key challenges. First, distributions of text and gestures are inherently skewed making it important to model the long tail. Second, gesture predictions are made at a subword level, making it important to learn relationships between language and acoustic cues. We introduce AISLe, which combines adversarial learning with importance sampling to strike a balance between precision and coverage. We propose the use of a multimodal multiscale attention block to perform subword alignment without the need of explicit alignment between language and acoustic cues. Finally, to empirically study the importance of language in this task, we extend the dataset proposed in Ahuja et al. (2020) with automatically extracted transcripts for audio signals. We substantiate the effectiveness of our approach through large-scale quantitative and user studies, which show that our proposed methodology significantly outperforms previous state-of-the-art approaches for gesture generation. Link to code, data and videos: https://github.com/chahuja/aisle
Search
Co-authors
- Louis-Philippe Morency 2
- Hae Won Park 1
- Yoon Kim 1
- Cynthia Breazeal 1
- Chaitanya Ahuja 1
- show all...