Zi Huang


2024

pdf bib
Event-Content-Oriented Dialogue Generation in Short Video
Fenghua Cheng | Xue Li | Zi Huang | Jinxiang Wang | Sen Wang
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Understanding complex events from different modalities, associating to external knowledge and generating response in a clear point of view are still unexplored in today’s multi-modal dialogue research. The great challenges include 1) lack of event-based multi-modal dialogue dataset; 2) understanding of complex events and 3) heterogeneity gap between different modalities. To overcome these challenges, we firstly introduce a novel event-oriented video-dialogue dataset called SportsVD (Sports-domain Video-dialogue Dataset). To our best knowledge, SportsVD is the first dataset that consists of complex events videos and opinion-based conversations with regards to contents in these events. Meanwhile, we present multi-modal dialogue generation method VCD (Video Commentary Dialogue) to generate human-like response according to event contents in the video and related external knowledge. In contrast to previous video-based dialogue generation, we focus on opinion-based response and the understanding of longer and more complex event contents. We evaluate VCD’s performance on SportsVD and other baselines under several automatic metrics. Experiments demonstrate VCD can outperform among other state-of-the-art baselines. Our work is available at https://github.com/Cheng-Fenghua/SportsVD.

2023

pdf bib
Abstract then Play: A Skill-centric Reinforcement Learning Framework for Text-based Games
Anjie Zhu | Peng-Fei Zhang | Yi Zhang | Zi Huang | Jie Shao
Findings of the Association for Computational Linguistics: ACL 2023

Text-based games present an exciting test-bed for reinforcement learning algorithms in the natural language environment. In these adventure games, an agent must learn to interact with the environment through text in order to accomplish tasks, facing large and combinational action space as well as partial observability issues. However, existing solutions fail to decompose the task and abstract the action autonomously, which either pre-specify the subtasks or pre-train on the human gameplay dataset. In this work, we introduce a novel skill-centric reinforcement learning framework, which is capable of abstracting the action in an end-to-end manner. To learn a more disentangled skill, we focus on the informativeness and distinguishability of the skill in accordance with the information bottleneck principle. Specifically, we introduce a discriminator to enable the skill to reflect the trajectory and push their representations onto the unit hypersphere to distribute uniformly. Moreover, a self-predictive mechanism is employed to learn inverse and forward dynamics, and a self-recovery mechanism is leveraged to refine the action representation, thus resulting in a more comprehensive perception of dynamics and more effective representations of textual state and action. Empirical experiments are carried out on the Jericho environment and the results validate the superiority against state-of-the-art baselines.