Shuwen Qiu


2024

pdf bib
MindDial: Enhancing Conversational Agents with Theory-of-Mind for Common Ground Alignment and Negotiation
Shuwen Qiu | Mingdian Liu | Hengli Li | Song-Chun Zhu | Zilong Zheng
Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue

Humans talk in daily conversations while aligning and negotiating the expressed meanings or common ground. Despite the impressive conversational abilities of the large generative language models, they do not consider the individual differences in contextual understanding in a shared situated environment. In this work, we propose MindDial, a novel conversational framework that can generate situated free-form responses to align and negotiate common ground. We design an explicit mind module that can track three-level beliefs – the speaker’s belief, the speaker’s prediction of the listener’s belief, and the belief gap between the first two. Then the next response is generated to resolve the belief difference and take task-related action. Our framework is applied to both prompting and fine-tuning-based models, and is evaluated across scenarios involving both common ground alignment and negotiation. Experiments show that models with mind modeling can generate more human-like responses when aligning and negotiating common ground. The ablation study further validates the three-level belief design can aggregate information and improve task outcomes in both cooperative and negotiating settings.

2023

pdf bib
Topic and Style-aware Transformer for Multimodal Emotion Recognition
Shuwen Qiu | Nitesh Sekhar | Prateek Singhal
Findings of the Association for Computational Linguistics: ACL 2023

Understanding emotion expressions in multimodal signals is key for machines to have a better understanding of human communication. While language, visual and acoustic modalities can provide clues from different perspectives, the visual modality is shown to make minimal contribution to the performance in the emotion recognition field due to its high dimensionality. Therefore, we first leverage the strong multimodality backbone VATT to project the visual signal to the common space with language and acoustic signals. Also, we propose content-oriented features Topic and Speaking style on top of it to approach the subjectivity issues. Experiments conducted on the benchmark dataset MOSEI show our model can outperform SOTA results and effectively incorporate visual signals and handle subjectivity issues by serving as content “normalization”.

2021

pdf bib
GRICE: A Grammar-based Dataset for Recovering Implicature and Conversational rEasoning
Zilong Zheng | Shuwen Qiu | Lifeng Fan | Yixin Zhu | Song-Chun Zhu
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021