Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation

We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) which presents obstacles to exploiting the power of large-scale pre-training; and (2) the necessity of taking into account the complementarity of various modalities throughout the reasoning process. Although having made remarkable progress in video-grounded dialogue generation, existing methods still fall short when it comes to integrating with PLMs in a way that allows information from different modalities to complement each other. To alleviate these issues, we first propose extracting pertinent information from videos and turning it into reasoning paths that are acceptable to PLMs. Additionally, we propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities (i.e., video and dialogue context). Empirical experiment results on two public datasets indicate that the proposed model can significantly outperform state-of-the-art models by large margins on both automatic and human evaluations.


Introduction
Conversing with computers has become a crucial step toward general artificial intelligence, and it has attracted increasing attention from AI and NLP researchers.Multi-turn dialogue response generation and multi-modal question answering are two high-profile initiatives made toward this goal.The task of multi-turn dialogue response generation necessitates the agent comprehending the key information in the dialogue context in order to provide a cohesive, fluent and informative response (Zhao et al., 2017;Tao et al., 2018).Multi-modal question answering, on the other hand, necessitates the agent's understanding of both the textual and visual contexts (Antol et al., 2015;Tapaswi et al., 2016;Jang et al., 2017).The video-grounded dialogue (Alamri et al., 2018;Pasunuru and Bansal, 2018) is a generalization of the above two tasks, in which the agent must observe multi-modal contents and engage in a conversation with the human, rather than simply responding to the last utterance or ignoring the visual contents.Compared to multi-turn dialogue response generation and multi-modal question answering, the distinctive challenges posed by video-grounded dialogue generation can be summarized as: (1) Unlike traditional multi-turn dialogue that can directly use large-scale pre-trained language models (PLMs), video-grounded dialogue cannot directly use PLMs due to their incapacity to process video input; (2) In comparison to multi-modal question answering, video-grounded dialogue necessitates reasoning on both video and multi-turn textual context, and there is usually a complementarity between different modalities that should be taken into account.
Although having made notable progress in videogrounded dialogue, existing approaches still fail to recognize the aforementioned challenges.On one hand, existing approaches cannot be effectively combined with PLMs, which presents obstacles to exploiting the power of state-of-the-art pre-training technology.The reasons can be summarized into two categories: (1) Simply appending the video features to the text embeddings presents a challenge for the model to obtain an in-depth understanding of the video (Li et al., 2020;Le and Hoi, 2020;Le et al., 2021).To investigate this problem further, we compare the performance of these models before and after removing the video from the input.As demonstrated in Table 1, most metrics only show a tiny shift, and several even increase once the video is removed; and (2) Overly complex designs for the Transformer that are difficult to transfer to PLMs (Le et al., 2020;Kim et al., 2021;Geng et  2021).On the other hand, multi-modal information should be used in conjunction with each other, and reasoning on different modalities should be done collaboratively rather than independently.
Existing approaches fall short when it comes to reasoning jointly on multi-modalities, since they either separate the reasoning of different modalities (Li et al., 2020) or employ a cross-modal attention mechanism which is difficult to train without direct supervision (Le et al., 2020;Kim et al., 2021;Geng et al., 2021).
To address the aforementioned issues, we propose extracting relevant information from videos and converting it into reasoning paths, which are in the form of natural language and can be fed directly into PLMs.Besides, we propose a multiagent reasoning framework that is based on the multi-agent reinforcement learning (MARL) theory.Specifically, we design a video agent and a context agent which learn to find the chains of reasoning on the multi-modal semantic graphs.We further design a central communicator to make the two agents work in a collaborative manner.Our framework has the following advantages: (1) the multi-modal reasoning paths are compatible with the input of PLMs; (2) the reasoning process can be "supervised" by designing appropriate reward functions; and (3) the communication mechanism allows the information from different modalities better complement each other.We conduct extensive experiments on two benchmark datasets for video-grounded dialogue generation, including AVSD@DSTC7 (Alamri et al., 2018) and Twitch-FIFA (Pasunuru and Bansal, 2018).Experiment results show that, thanks to the multi-agent reasoning framework, our model can significantly outperform state-of-the-art methods in terms of both automatic and human evaluations.
Our contributions in the paper are three-fold: (1)

Related Work
The majority of early works on dialogue generation use hand-crafted rules or templates to construct dialogue systems (Weizenbaum, 1966;Wallace, 2009).
A number of initiatives have been made to develop end-to-end open-domain dialogue generation models (Ritter et al., 2011;Gehring et al., 2017;Vaswani et al., 2017), which have been inspired by the developments in the field of machine translation.Following that, the vanilla encoder-decoder architecture is frequently utilized to enhance response quality, and numerous modifications to this architecture have been made to enhance response diversity (Zhao et al., 2017;Tao et al., 2018), model the structure of conversation contexts (Zhang et al., 2019), introduce external knowledge (Dinan et al., 2019;Zhao et al., 2020) and control response attributes (Wang et al., 2018;See et al., 2019;Wang et al., 2020).The research on generating dialogue from video was started by Alamri et al. (2018).After that, Hori et al. (2019a) present an LSTM-based encoderdecoder architecture with multi-modal attention that merely combines textual and visual data via a projection matrix.A multi-modal transformer network is introduced in Le et al. (2019) to encode videos and incorporate data from several modalities.Hori et al. (2019b) uses a joint student-teacher learning approach to make up for a missing video description in which the student network is trained to mimic the teacher's response.VGD-GPT (Le and Hoi, 2020) is based on a pre-trained GPT-2 model and formulates the video-grounded dialogue generation as a sequence-to-sequence task.On a pre-trained GPT-2 model, RLM (Li et al., 2020) provides a multi-task learning strategy.Additionally, BiST (Le et al., 2020) models the dependencies between text and visual in two directions: spatial to temporal and temporal to spatial.With visual attention, PDC-GPT (Le et al., 2021) learns to anticipate the reasoning process on turn-level semantic graphs.For further reasoning, SCGA (Kim et al., 2021) constructs a structured graph based on a multi-modal coreference technique, while STSGR (Geng et al., 2021) introduce a shuffled transformer reasoning framework on semantic scene graph.In contrast to previous approaches, this paper focuses on how to build a multi-modal reasoning approach that can cooperate with PLMs in a way that facilitates the complementary nature of information from various modalities.
The study of reasoning on various types of graph structures for dialogue generation is related to our work.Moon et al. (2019) create a KG walk path for each entity retrieved in an effort to explain conversational reasoning in a natural way.Jung et al. (2020) develop a dialogue-conditioned path traversal model with attention flows and improve the comprehension of the path reasoning process.Xu et al. (2020) propose to represent dialogue transitions as graphs.Previous approaches typically concentrate on textual graphs, but video-grounded dialogue contains multi-modal contexts, which makes it difficult to conduct reasoning.

Overview
Suppose that we have a dataset with N denoting the total number of datapoints.
For the i-th datapoint, V i signifies a brief video clip, } denoting the j-th utterance.n and m are the number of utterances in a context and the number of words in an utterance respectively.R i is a response that is factually consistent with the video while also catching up with the dialogue context.Our goal is to learn a generation model p(R|V, U ; θ)1 (θ denotes the parameters of the model) from D, so that given a new dialogue context U associated with a video V , one can generate a response following p(R|V, U ; θ).
To alleviate the heterogeneity of different modalities, we first represent the video as well as the dialogue context as semantic graphs (will be elaborated in Section 3.2). Figure 1 illustrates the architecture of the proposed model.In a nutshell, the model is composed of a multi-modal reasoning module and a generation module.The multi-modal reasoning module is responsible for extracting crucial signals from multi-modal contexts (Section 3.3).Specifically, it consists of a video agent, a text agent and a central communicator.The video agent and the text agent are responsible for extracting reasoning paths from the video semantic graph and the text semantic graph respectively.Taking the latest context utterance as input, they determine the query entities from which they start traversing the graphs to find the answer-providing paths.To search for answer-providing paths more efficiently, we devise a central communicator to transport the entire path histories between video and text agents.The reasoning paths, which form interpretable provenances for the prediction, are integrated by the generation module to synthesize a response (Section 3.4).

Multi-Modal Graph Construction
The crucial step in building the semantic graph for video reasoning is gathering the collection of facts from the unstructured video data, which take the form of subject-predicate-object triplets.Although there have been some previous attempts to extract such triplets from videos using relation detection (Liu et al., 2020), the models that have been made public struggle to build the proper relations because of the dramatic domain discrepancy between their training corpus and the benchmark dataset for video-grounded dialogue.Therefore, we resort to video action recognition (Zhu et al., 2020) to extract meaningful structural representations from video.Specifically, we first employ the slowfast model (Feichtenhofer et al., 2019), which is pre-trained on the Charades (Sigurdsson et al., 2016) and Kinetics dataset (Kay et al., 2017), to extract all potential action classes and only reserve those with a probability greater than 0.5.Given the extracted facts {(e v s , r, e v o )} with e v s , r v and e v o standing for subject, predicate and object respectively, we can construct a video semantic graph ) in which the entities e v s and e v o are represented as nodes (i.e., e v s , e v o ∈ N v ) and the relation r v is represented as a labeled edge connecting them (i.e., (e v s , r, e v o ) ∈ E v ).The semantic graph for dialogue context, G u = (N u , E u ), is constructed in a similar way, except that we employ open information extraction (Ope-nIE) technology to extract subject-predicate-object triplets.Specifically, we first apply the co-reference resolution tool (e.g., AllenNLP (Gardner et al., 2017) in our experiments) to restore all the pronouns to their original name entities.Then we extract all relation triplets in a dialogue context by combining the outputs of OpenIE 5.1 (Saha and Mausam, 2018) and Stanford OpenIE (Angeli et al., 2015).We further remove unnecessary information after getting all the triplets by combining all entities with high semantic similarity, as determined by the cosine similarity between the word2vec embeddings (Mikolov et al., 2013).

Multi-Agent Reasoning Process
Inspired by recent advances in graph-grounded generation (Moon et al., 2019;Xu et al., 2020), we decompose the problem of video-grounded generation into two steps: (1) identify answer-providing paths on the graph that might contain crucial signals for catching up with the context; (2) generate a response using the extracted paths as additional information.However, independently extracting the chains of reasoning for each modality will result in a sub-optimal solution, since the video provides crucial guidance for text reasoning and vice versa.To this purpose, we propose a multi-agent reasoning framework, where agents responsible for different modalities can work in a collaborative manner.
We formulate the multi-modal reasoning task as a partially observable multi-agent sequential decision process on semantic graphs G v and G u .Intuitively, we want a state s t at time t to be a summary of previous experiences: t , e u t ) and a t = (r v t , r u t ) stand for observations/entities of all agents at time t and actions/relations taken by them respectively.o 1 = (e v 1 , e u 1 ) is the query entity to start traversing on the graphs and is defined as: where E v (e v ), E u (e u ) and E(u n ) denote the embeddings for e v , e u and the last utterance u n respectively.The state of the environment is ubiquitous and shared by all agents, but in a multi-agent setting, the observation and the action are both private and only accessible by the individual agent.Take the video agent as an example.When receiving a local observation e v t , or the current location on G v , it will select an outgoing edge r v t with its own private policy network p v (r|s t ), and obtain a reward from the environment.We also design a central communicator to encode historical information and promote multiple agents to work in a collaborative way.The details about the central communicator, the private policy network, and the reward will be described as follows: Central Communicator.To make full use of the information from different modalities and facilitate the reasoning process, we design a central communicator which can get access to the local observations and actions of all agents (Feng et al., 2018).The central communicator works by recursively encoding the historical information (i.e., (o 1 , a 1 , • • • , o t , a t )) into a message h t and transporting this message between agents.Specifically, we implement the central communicator as a recurrent neural network, with the hidden state h t encoding the past observations and actions.At time t, the central communicator takes current observation o t and action a t as input, and updates the hidden state as: (2) where W c is a learnable projection matrix, ), E u (r u t ) and E u (e u t ) are embeddings of r v t , e v t , r u t and e u t respectively.Consequently, with the help of message h t−1 , the full state can be approximated as s v t ≈ (h t−1 , e v t ) for the video agent or s v u ≈ (h t−1 , e u t ) for the context agent.
Private Policy Network.Each agent has its own private policy network that chooses an outgoing edge at the current location.Take the video agent as an example.With the guidance of transported message h t−1 , the policy network can be approximated as p v (r|s t ) ≈ p v (r|h t−1 , e v t ), which is formally defined as: (3) where W a is a learnable parameter and R(e v t ) denotes all outgoing edges of the node e v t .We define the private policy network for context agent p u (r|h t−1 , e u t ; ψ u ) following the same procedure.Reward.We only have a reward once the complete chain of reasoning is obtained.Given the final state s T = (o 1 , a 1 , • • • , a T −1 , o T ) (T is the maximum time constraint), we can obtain the reasoning paths p v and p u for video and dialogue context respectively, and define the reward as: ) where p gt is the subject-predicate-object triplet extracted from the ground-truth response, and ROUGE(•, •) is a function that returns the ROUGE-1 score (Lin, 2004) between two sequences.

Generation Module
We employ the pre-trained GPT-2 (Radford et al., 2019) as the backbone of our generation module, which synthesizes a response conditioning on the reasoning path p v for video data, the reasoning path p u for dialogue context and the last utterance in context Formally, the input of the generation module is defined as: where [SEP] is a special token separating different types of data.The probability of generating the response R = {w 1 r , w 2 r , • • • , w m r } is formulated as: (6)

Learning Details
To estimate θ (i.e., parameters of the generation module), we directly minimize the negative loglikelihood of response R through MLE loss: The parameters of private policy networks (i.e., ψ v and ψ u ), as well as the parameters of the central communicator (i.e., ϕ), are optimized through policy-gradient method (Sutton et al., 2000).Specifically, we sample reasoning paths pv and pu according to the private policy networks and the central communicator, and define the loss as follows: where Re(•) is the reward function defined in Eq. 4 and sT is the final state when sampling the chains of reasoning.The general training process is conducted by alternately optimizing L mle and 4 Experiments

Datasets
We evaluate our model on two benchmark datasets for video-grounded dialogue generation: AVSD@DSTC7 This dataset is constructed by Alamri et al. (2018) through crowd-sourcing and contains conversations about Charades videos (Sigurdsson et al., 2016).
Twitch-FIFA.This dataset is collected by crawling live-broadcast soccer game videos and the chats from Twitch.tv (Pasunuru and Bansal, 2018).
To facilitate reproducibility, we adopt the datasets shared by the publishers and conduct preprocessing strictly following the official code.Table 3 reports the statistics of AVSD@DSTC7 and Twitch-FIFA.
Human Evaluation.We also conduct a human evaluation to deepen our understanding of the qual- ity of responses produced by different models.We randomly sample 300 examples from the test sets of AVSD@DSTC7, and hire 6 well-educated native speakers to conduct qualitative analysis on the results produced by our model and all competitive baselines, which are randomly mixed to obscure identification.The annotators evaluate the quality of the responses using three criteria: (1) Language Fluency: whether the response is fluent and devoid of grammatical errors, (2) Context Coherence: whether the response is coherent with the dialogue context, and (3) Factual Correctness: whether the response is factually consistent with the events depicted in the video.Each annotator rates each response for each aspect with a score from {0, 1, 2} (representing "bad", "fair" and "good" respectively).Each response receives three scores for the aforementioned 3 aspects, and Fleiss' kappa (Fleiss, 1971) is used to gauge the level of agreement between all annotators.

Baseline Models
The following models are selected as baselines: (1) Naive Fusion: A model proposed by Hori et al. (2019a)  teaching learning method.(4) RLM: A model proposed by Li et al. (2020) that is trained with multi-task learning objectives to learn joint representations among different modalities.( 5) VGD-GPT2: A model proposed by Le and Hoi (2020) that leverages the power of pre-trained language models for improving video-grounded dialogue.( 6) BiST: A model proposed by Le et al. (2020) that exploits both spatial and temporal-level information to promote video understanding.( 7) PDC+GPT2: A model proposed by Le et al. (2021) that conducts reasoning on the dialogue history to model the information flow at turn level.
All the baselines are taken from their opensource implementations or re-implemented strictly following the details in the original papers.

Implementation Details
In our experiments, the maximum time constraint T which serves as the stop criteria for reasoning is set as 3.The embedding sizes for relations and entities are all set as 100.The central communicator is implemented as an LSTM network whose size of the hidden state is set as 200.The generation module is implemented on the basis of the pre-trained GPT-2 (small) model which has 117M parameters.All models are learned with Adam (Kingma and Ba, 2015) optimizer with β 1 = 0.9 and β 2 = 0.999.We initialize the learning rates to 0.001 and 6.25e−5 for the multi-modal reasoning module and the generation module respectively and optimize the model with a linear learning rate decay strategy.The batch size is set as 8 in our experiments.In the test phase, we employ beam search in response decoding and set the beam size, max decode length and the length penalty as 4, 16 and 0.1 respectively.Early stopping on validation is adopted as a regularization strategy.All models are trained on an 8×RTX 3090 Ti machine.

Evaluation Results
In this section, we will compare the performance of various models on AVSD@DSTC7 and Twitch- FIFA.We conduct two experiment settings for AVSD@DSTC7, including with caption and without caption, since the video caption is unavailable in most real-world scenarios.Table 2 and Table 4 show the performance of our model on AVSD@DSTC and Twitch-FIFA respectively.
From the results, we can observe that: (1) Our model achieves the new state-of-theart on most metrics in both datasets, illuminating the effectiveness of the proposed multi-agent reasoning framework and the multi-modal semantic graphs.In particular, the proposed model outperforms RLM and PDC-GPT, the two best baselines on AVSD@DSTC7, since they both directly feed the video features for the generation procedure which presents obstacles for the PLMs to conduct multi-modal reasoning.This is also supported by the results in our pilot study (as shown in Table 1).
(2) In the AVSD@DSTC7 dataset, the caption has a significant impact on models since, in the absence of caption, all models' performance significantly degrades.Another intriguing finding is that reasoning-based methods (e.g., BiST, PDC-GPT and ours) rely less on the video caption as compared to methods without explicit reasoning (e.g., RLM and VGD-GPT2).This confirms the need for multi-modal reasoning, and our proposed method of collaborative reasoning is more effective.
(3) Because there is a lot of noise in the live-broadcast data collection, which is closer to the real-world scenario than the manually-labeled dataset, the PLMs-based methods (e.g., RLM and ours) perform better on Twitch-FIFA than others.This further emphasizes the value of integrating PLMs with multi-modal reasoning, one of the benefits of our proposed method.
Human Evaluation.Table 5 shows the results of human evaluation.Although our model achieves a language fluency score that is comparable to other baselines, it attains a significant improvement in context coherency and factual correctness, which is congruent with the results of our pilot experiments and automatic evaluation.The fact that all kappa values are more than 0.6 shows that the annotators are in agreement.

Discussions
Ablation Study.In addition to the main experiments, we compare the full model with the following variations to gain a better understanding of how each component affects the general performance: (1) -G u : the context graph is removed; (2) -G v : the video graph is removed; (3) -G u & G v : both the context graph and the video graph are removed.Here, the model directly generates the response based on the dialogue context and the video features provided by Alamri et al. (2018); and (4) -Communicator: the Central Communicator is removed.In this instance, the context agent and video agent each independently reason on graphs.
The experiment results of ablation are shown in Table 6.We can draw the following conclusions: (1) the multi-modal semantic graphs are both significant, as the performance is negatively impacted by deleting one or more of them.Although they are built using off-the-shelf tools with heuristics, they nonetheless contain significant information that enables the agents to locate chains of reasoning that lead to solutions; and (2) the communicator is helpful because it enables crucial signals from different modalities to reinforce each other.
Effect the Maximum Time Constraint T .We continue to look at how sensitive the model is to various selections of the maximum time constraint T .In order to achieve this, we change the value of T in {1; 2; 3}, and then report the evaluation results in Table 7.As can be shown, our model performs best when T = 2 since a larger maximum time constraint will introduce more irrelevant entities and relations into generation, whereas a smaller number (i.e., T = 1) limits the reasoning paths to only the entity that is most similar to the last utterance.
Case Study.We further conduct a case study to have a deeper understanding of the multi-modal reasoning process in our model.Figure 2 shows an example from the test set of AVSD@DSTC7.We can see that our model is able to precisely construct the reasoning paths for dialogue context and video respectively, and to produce a response that accurately captures the factual information in the video.For comparison, we also provide the results of a variant in which the central communicator has been eliminated.We can observe that the communication mechanism can effectively assist in retrieving relevant signals from multi-modal data.

Conclusion
We propose a multi-modal reasoning framework that can be used in conjunction with PLMs to enable the complementation of information from various modalities.Specifically, we devise a video agent and a context agent to extract reasoning paths on video and dialogue contexts respectively.A central communicator is also designed to transport information between the two agents and enables their cooperative operation.The general framework is optimized through multi-agent reinforcement learning.Evaluation results on two benchmarks indicate that our model can significantly outperform stateof-the-art methods.

Limitations
We also recognize that our model has its certain limitations: (i) Due to multi-modal semantic graphs, our framework needs higher computation overheads to extract triplet relations from video and perform reasoning on dual graphs.Nonetheless, the multi-modal reasoning paths which are compatible with PLMs make our model still practical and scalable.(ii) The performance of our model may be limited to some extent by the quality of the dual graphs created by off-the-shelf tools.

Ethics Statement
This paper studies video-grounded dialogue generation and proposes a multi-modal reasoning framework based on multi-agent reinforcement learning to facilitate the complementarity of information from different modalities.There are no ethical concerns with this study.The datasets we used are widely used by other academics and are typically accessible to the public.No ethical or societal prejudice is introduced by the suggested strategy.

Figure 1 :
Figure 1: Architecture of the proposed model.
appears to be a bedroom.Without Communicate: it looks like a living room.

Figure 2 :
Figure 2: A case from the test set of AVSD@DSTC7.

Table 2 :
Automatic evaluation results on the test set of AVSD@DSTC7.Numbers in bold are the best results.Significant improvements over the best baseline results are marked with ⋆ (t-test with p-value < 0.05).

Table 3 :
Statistics of the two datasets.

Table 4 :
Automatic evaluation results on the test set of Twitch-FIFA.Numbers in bold are the best results.Significant improvements over the best baseline results are marked with ⋆ (t-test with p-value < 0.05).

Table 5 :
Human evaluation results on AVSD@DSTC7.Numbers in bold are the best results.

Table 7 :
Performance of our model under different maximum time constraints.