Information-Theoretic Text Hallucination Reduction for Video-grounded Dialogue

Video-grounded Dialogue (VGD) aims to decode an answer sentence to a question regarding a given video and dialogue context. Despite the recent success of multi-modal reasoning to generate answer sentences, existing dialogue systems still suffer from a text hallucination problem, which denotes indiscriminate text-copying from input texts without an understanding of the question. This is due to learning spurious correlations from the fact that answer sentences in the dataset usually include the words of input texts, thus the VGD system excessively relies on copying words from input texts by hoping those words to overlap with ground-truth texts. Hence, we design Text Hallucination Mitigating (THAM) framework, which incorporates Text Hallucination Regularization (THR) loss derived from the proposed information-theoretic text hallucination measurement approach. Applying THAM with current dialogue systems validates the effectiveness on VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows enhanced interpretability.


Introduction
Achieving a natural conversational agent that can do 'look' (i.e., understand what they are seeing) and 'tell' (i.e., converse what they are thinking) is desiderata in our vision-language community.By the broad application of conversational agent, it can potentially assist various subsections of our environment including education, entertainment, security, and visual or other impairments.For the natural conversation between humans and computers, a video-grounded dialogue (VGD) task (Alamri et al., 2019;Hori et al., 2020) has been introduced to generate adequate conversational responses to the queries of humans, while following up on video and dialogue context, which gives more challenging than traditional image-grounded or text-grounded * Corresponding author dialogue tasks.To be specific, given video V , video caption C, dialogue history of past Q&A pairs: H = {(Q 1 , A 1 ), ..., (Q r−1 , A r−1 )}, and current r-th round question Q r , VGD system is expected to make free-form answer sentence A r to given question.Despite recent advancements in multimodal interactions including transformer (Vaswani et al., 2017), current VGD systems still suffer text hallucination problem, which denotes indiscriminate text-copying from input texts (i.e., question, caption, and dialogue history) to decode answer tokens, but the generated answer sentences are rather inadequate and not related to the question.This is because current VGD systems learn spurious correlations from the fact that many ground-truth answers in the dataset include partial input texts, thus they perform incorrect text-copy from input texts, namely text hallucination, even in answers where input texts are unnecessary.
Figure 1 gives two indiscriminate text hallucinating cases confounded by spurious correlations in VGD.As shown in Figure 1(a), for the given question 'does he place the towel and clothes anywhere?',we human identify where the man placed the towel and clothes, and if it cannot be confirmed, we give a sentence meaning 'unknown'.However, in many cases, VGD systems are optimized in situations where they could find clues in video and dialogue, so for a case that they can not find clues, they simply pretend to know the answer by copying texts from input sentences without reasoning why the question is not answerable.Thus, the VGD systems depend on indiscriminate text hallucination, copying input sentences (i.e., questions, caption, dialogue), hoping the copied answer words to overlap the ground-truth words.Figure 1(b) presents another dependence on this text hallucination even in the answerable question.Given the question of 'does the man wear glasses?', the current VGD system provides incorrect answer without referring to the video and focuses on pretending to know the C: a smiling person grabs a towel , and fresh clothes .then the person begins undressing.
: where does the video take place ?: looks like a bedroom maybe.
: what happens after he enters the room ?: he grabs a towel and some clothes and then goes off screen.
: does he place the towel and clothes anywhere ?
: yes, he grabs a towel and clothes .: not that i can see because he goes off screen with them .: what is happening in the video ?: a man is using the sink and getting a red bag from the bottom drawer.
: can you hear any sounds ?: you can hear the faucet when he turns it on.
: does the man wear glasses ?: he does not wear glasses .: he has a pair of glasses on during the video .answer via copying input texts.This is because the system is holding overconfidence in the text hallucination, such that it ignores the meaning of the question and video.Therefore, current VGD systems are prone to rely on language model tainted with incorrect text hallucination, which hinders them from accurately learning question-answer association.
Our manual studies in Figure 2 give experimental evidence that the answer sentences predicted by current VGD systems (Le et al., 2019b;Li et al., 2021) are dependent on indiscriminate text hallucination.Figure 2(a) presents sentence similarity score, BLEU (Papineni et al., 2002), which computes word overlapping between (1) predicted answers and input texts (i.e., caption, dialogue and question), and (2) ground-truth answers and input texts from AVSD1 validation dataset.The higher scores between predicted answers and input texts explain the reliance on input texts for decoding answer tokens.We may take this for granted, but as shown in Figure 2(b), the problem gets distinguishable when collecting all the 'incorrect' 2 predictions.Many failure cases (i.e., incorrect predictions) include that the predicted answers are more similar to input texts, which proves indiscriminate text hallucination without the understanding of given questions and videos.
One straightforward solution to mitigate this indiscriminate text hallucination is to extend the dataset using augmentations or modulating answer descriptions to be more stereoscopic.However, the augmentation has limitations in terms of diversity and the modulated descriptions can be sometimes ad-hoc and unnecessarily extravagant.Intrigued by the current overconfidence in text hallucination of VGD systems, we contrive to build Text Hallucination Mitigating (THAM) framework that mitigates feature-level hallucination effects via introducing information-theoretic regularization.THAM framework incorporates Text Hallucination Regularization (THR) loss derived from the mutual information between the response language model and the proposed hallucination lan-guage model.Minimizing THR loss contributes to reducing indiscriminate text copying and boosting dialogue performances.THAM validates effectiveness with steady performance gain on top of the current several runner models (Hori et al., 2019a;Le et al., 2019b;Kim et al., 2021;Li et al., 2021) via a model-agnostic approach.experimental results show state-of-the-art performances on two VGD benchmarks (i.e., AVSD@DSTC7 and AVSD@DSTC8) and enhanced interpretability.

Video-grounded Dialogues
Visual Question Answering (VQA) (Antol et al., 2015;Li et al., 2022;Xiao et al., 2022) is one of the proxy tasks for evaluating multi-modal understanding of vision-language systems.The recent success of natural language processing (Devlin et al., 2018;Radford et al., 2019) gives a bridge to advance VQA for video-grounded dialogue (VGD) system (Alamri et al., 2019;Hori et al., 2020), which aims to generate open-ended answer sentence founded on video and dialogue of human.For this VGD, many recurrent neural networks (Nguyen et al., 2019;Sanabria et al., 2019) have been proposed to hold meaningful semantics along the consecutive dialogues, and a transformer-based VGD system (Li et al., 2021) has also been introduced to enhance multi-modal interaction between video and text, including word-embedding attention (Lee et al., 2020), hierarchical attention (Le et al., 2019a) and pointer-augmented decoding (Le and Chen, 2020).Furthermore, graph representation is considered to connect common semantics among intraframes and inter-frames (Geng et al., 2021) and to uncover co-referencing between frames and texts (Kim et al., 2021).However, these systems still suffer from the hallucination problem in generating answer sentences and for this problem, we proposed an information-theoretic text hallucination mitigating framework.

Estimating Mutual Information
To identify the feature-level text hallucination, we first introduce the mutual information I(•; •), which measures co-dependence between two random variables X and Y over the space X × Y like below: where H(•) is the Shannon entropy and H(X|Y ) is the conditional entropy of X given Y .This mutual information is also equal to the Kullback-Leibler (KL-) divergence D KL (•||•) between joint probability distribution P XY and the product of marginals P X ⊗ P Y like below: where, given two probability distributions p(x) and q(x) on variable x, KL divergence is defined as: As the KL divergence increases, the co-dependence between X and Y becomes stronger.However, calculating KL divergence is tractable for only a few cases (i.e., discrete variables), as it is unavailable to hold exact distributions of the training dataset.Recent approach (Belghazi et al., 2018) is performed on estimating mutual information for continuous high-dimensional variables using neural network founded on the Donsker-Varadhan representation3 (Donsker and Varadhan, 1975) defined below: where T φ : R D → R is a neural network parameterized by φ ∈ Φ, and the expectations of E P XY and E P X ⊗P Y are approximated by empirical sampling.Thus, maximizing I φ (X; Y ) provides a tight lower bound of original mutual information4 I(X; Y ).

Video-grounded Dialogue Task
Video-grounded Dialogue (VGD) aims to produce free-form natural language answer for a given question.In the formal definition of the VGD task (Alamri et al., 2019), VGD system takes tuples (v, h, q r ) as inputs and produces answer sentence a r , where v is video, h is dialogue history and q r question asked at current round r ∈ {1, • • • , R}.Here, the dialogue history h = {c, (q 1 , a 1 ), • • • , (q r−1 , a r−1 )} is a set of questionanswer pairs of previous rounds and caption c describing the summary of the video.For training of the VGD system, we perform next-word prediction, where it is trained to predict t-th answer word token a r t for given inputs of tuples (v,h,q r ) and partial answer word tokens a r <t before t-th.: what is happening in the video?: there 's a person sitting on the sofa.
: is the guy alone or is someone with him?:he seems to be alone.

Text Hallucination Mitigating Framework
In Figure 3, to build Text Hallucination Mitigating (THAM) framework, we prepare three different language models composed of encoder-decoder pairs: (1) Response Language Model (RLM), (2) Hallucination Language Model (HLM), and (3) Language Model (LM).RLM is a naive VGD model, such that it is given complete samples of v, h, q, and partial answer a r <t to predict the next answer token a r t .HLM is designed to generate answer tokens relying on the text hallucination, where HLM is given deficient input texts (i.e., h, a r <t ) without question, which is unavailable to reason the correct answer and inevitably relies on hallucinating sentence to overlap with ground-truth words via copying input texts without knowledge of the question.Using this HLM, our proposed Text Hallucination Regularization (THR) mitigates feature-level hallucination effects in the RLM via minimizing the mutual information between the features of RLM encoder and hallucinating features of HLM encoder.However, not all the features of HLM are bad, because HLM, as a language model, is also trained to make a grammatically complete sentence, where those grammatical knowledge should be removed before performing THR.Therefore we train another language model (LM), which predicts the next answer token a r t from only given partial answer a r <t .We remove encoder features of LM from those of HLM in advance and apply the THR loss.

Input representations
We give formal feature definitions of v, h, q r and a r embedded into d-dimensional space.Following (Hori et al., 2019b;Li et al., 2021), for the video features, we utilize the I3D model (Carreira and Zisserman, 2017) pre-trained on YouTube videos and the Kinetics dataset (Kay et al., 2017) to get 2048-dimensional rgb features v rgb ∈ R L×2048 and optical flow features v opt ∈ R L×2048 in the images, where the L is the number of video frames.Audio features are also available in the video of the AVSD dataset, we get 128-dimensional features5 v aud ∈ R L×128 using pre-trained VGGish (Hershey et al., 2017).The aforementioned three features are concatenated along feature dimension axis and embedded into d-dimensional space as: where For the text features, we follow the T5-base Transformer (Raffel et al., 2020) and tokenize all the sentences (i.e., q r , h, a r ) into a series of Word-Pieces (Wu et al., 2016).The final representations for each sub-word token are obtained by summing up their token embeddings and relative positional embeddings, followed by a layer normalization (Ba et al., 2016).We give formal definitions of them as: history h ∈ R L h ×d , question q ∈ R Lq×d and answer a ∈ R La×d , where L h , L q and L a are the number of tokens of each text6 .

Text Hallucination Regularization
Text Hallucination Regularization (THR) is designed for the VGD model (i.e., RLM) to mitigate indiscriminate text hallucination (i.e., text or word copying) from input texts without understanding of the question.As we describe the methodology of THAM in Section 4, here, we focused on mathematical formulations for the reproducibility of THAM with proposed THR.
Training language models.To prepare own purpose of three language models (i.e., RLM, HLM, LM), as the first stage, we train them with their defined inputs in the followings.Response Language Model (RLM) is designed for original purpose of VGD, where it is given complete input sample as X <t = [v||h||q||a <t ] and trained to generate next word tokens for answer sentence a r = {a r 1 , • • • , a r m } with sentence length m using cross-entropy loss like below: Hallucination Language Model (HLM) is intended to learn reliance on text hallucination effects for generating an answer.To train HLM, we utilize the fact that ground-truth answer sentences of VGD are usually similar to the partial texts of inputs.Therefore, we give the HLM with deficient input texts X <t = [v||h||a <t ] without question like: where the deficient input texts make it difficult for HLM to perform correct answer reasoning.(See more results in the ablation studies of where f denotes the transformer encoders of each model.These two features (i.e., F <t , F <t ) are outputs from the position of a t−1 in the transformer.
Here, we refer to F <t as 'factual' features and F <t as 'hallucinating' features.Our proposed THR aims to hold feature-level independence between factual features and hallucinating features via minimizing mutual information among them.However the grammatical knowledge in F <t to build language sentence still should be correlated with F <t , as both language models are trained from grammatically complete ground-truth language sentences.Thus, we prepare pure language model (LM), which predicts answer token with only given partial answer tokens X † <t = [a <t ]: where we get pure language features F † <t = f LM (X † <t , θ † ) ∈ R d from the LM's encoder, which has the only grammatical knowledge to make complete language.We remain pure hallucinating effects via subtracting the language features F † <t from the hallucinating features F <t : where the G <t is the pure hallucinating (pure-h) features, which hold hallucinating effects without grammatical knowledge.Founded on factual features F <t and pure-h features G <t , we finally define THR loss.THR loss calculates feature-level mutual information between F <t and G <t .Thanks to the mutual information neural estimator (MINE) (Belghazi et al., 2018), we get high-dimensional mutual information between the F <t and the G <t , where we utilize it as THR loss for a regularization: By minimizing L T HR (θ, φ) with respect to the parameter θ, we train the RLM to be independent of HLM's indiscriminate text hallucination7 .Following the maximizing lower bound of estimated mutual information in Equation 4, the final objective function is formulated as: where α is a hyperparameter and the objective function is a minimax problem, we alternate to train and update the parameters θ and φ in every epoch.AVSD@DSTC7 Methods B1 B2 B3 B4 M R C Baseline (Hori et al., 2019a) 0.621 0.480 0.379 0.305 0.217 0.481 0.733 HMA (Le et al., 2019a) 0.633 0.490 0.386 0.310 0.242 0.515 0.856 RMFF (Yeh et al., 2019) 0.636 0.510 0.417 0.345 0.224 0.505 0.877 EE-DMN (Lin et al., 2019) 0.641 0.493 0.388 0.310 0.241 0.527 0.912 JMAN (Chu et al., 2020) 0

Metrics
We follow official natural language generation metrics for AVSD benchmark (i.e., BLEU, METEOR (Banerjee and Lavie, 2005), ROUGE-L (Lin, 2004), and CIDEr (Vedantam et al., 2015)).The metrics are provided by challenge organizers8 and formulated to compute the word overlapping between each generated answer and reference answer.

Results on AVSD benchmark
Table 1 summarizes the experimental results on the AVSD dataset.THAM is compared to several previous results of VGD systems (Please refer the descriptions about these VGD systems in the  Input variants on HLM BELU1 CIDEr ), where the performances of the official six references are evaluated on AVSD@DSTC7 and AVSD@DSTC8.To validate the effectiveness of proposed our THR loss, we report performances of our naive VGD model (i.e., RLM) based on the T5 Transformer (Raffel et al., 2020).Here, we use 'T5RLM' for the terminology of our RLM to avoid confusion with RLM in (Li et al., 2021) based on GPT2 Transformer (Radford et al., 2019).In the method, we select a Transformer-base encoder for THAM for its simplicity.However, as our framework can be applied to any other VGD systems in a model-agnostic manner, we also validate its effectiveness on recent runner VGD models in Table 2.In detail, we repro- duce the MTN, SCGA and RLM from their public papers and codes.For the MTN, we measure predicted answers with a single reference following the original work of it.On top of VGD models, THR loss show steady performance gain on both AVSD datasets.

Ablation Study
Table 3 summarizes the THAM results on input variants of HLM.HLM is designed to build excessive text conjugating language models via giving inputs that can not infer the correct answer.In the optimization, it is just optimized to learn spurious correlations between inputs X <t and outputs a r .Introducing history only for the inputs of HLM shows the most effectiveness.We consider this is because the history (i.e., dialogue history) contains a relatively large amount of texts, but without question, it is just captions that can not infer the answer.Here, the HLM inevitably learns indiscriminate text hallucination as HLM does not know the question: text hallucination as a result of copying a sentence from input sentences can lead to greater overlap with the ground-truth answer than simply generating an answer without knowing the question.Conversely, we also devise the HLM with an input of question without history, which was not effective in THAM performance.We consider that this is because the AVSD dataset includes some samples, where the correct answer can be easily inferred from a question alone without any other modalities, thus text conjugating on the question should be beneficial.
Figure 4 shows THR loss L T HR (θ, φ) and crossentropy loss L RLM (θ) from ablation studies with and without subtracting the encoder features of LM from the encoder features of HLM.THR loss explains the mutual information I φ (F <t , G <t ) between RLM and HLM, and the minimization of it regularizes indiscriminate text hallucination existing in RLM.For the case 'with subtracting LM', it shows that both L T HR (θ, φ) and L RLM (θ) decrease and converge according to the epoch.However, for the case 'without subtracting LM' 9 , minimizing the L T HR (θ, φ) hinders the convergence of L RLM (θ).This is because the encoder features that contribute a sentence are in both RLM and HLM, minimizing L T HR (θ, φ) without removing them from HLM becomes adversarial with learn-9 LTHR(θ, φ) = I φ (F<t, F <t ) for THR loss without LM ing from cross-entropy loss, which degrades the performance of the VGD system.

Qualitative Results
Figure 5 gives joint distributions among the language models' encoder features.Here, the RLM is fully trained from THAM framework.From 512 samples of AVSD validation set, we select a single value among the d-dimensional space at the same position of each encoder feature (i.e., F <t , F <t , G <t ). Figure 5(a) summarizes joint plots between F <t and F <t , where the correlations are confirmed due to the common grammatical knowledge from language models.However Figure 5(b) shows uncorrelated distributions between F <t and G <t , which means the grammatical knowledge is properly removed from G <t .Figure 6 gives responses of naive RLM and THAM (naive RLM + THR loss).For the question of "what are they both wearing", naive RLM shows the reliance on texts from history without understanding of the question.However, the THAM is generating correct answer sentence pertinent to the given question.

Conclusion
Text Hallucination Mitigating framework is proposed for Video-grounded Dialogue.THAM considers the text hallucination problem, which copies input texts for answer generation without understanding of the question.THAM framework incorporates Text Hallucination Regularization loss derived from proposed information-theoretic text hallucination measurement approach.Empirical results on VGD benchmarks show that THAM achieves state-of-the-art performances and effectiveness.

Limitations
The limitations of the Text Hallucination Mitigating Framework are as follows.First, our empirical analysis provides that THAM is facing a failure case about the question of sounds.In Figure 7 in supplemental materials, for the question of "what kind of noise", THAM is hallucinating response without understanding the question.Although the answer "i can hear some noise" can be plausible, but it also seems just hallucinating by copying from history texts.We speculate this is because the sound features contain less information (128 dimensions) comparing to video (2048 dimensions), which requires more specialized attention (e.g., fine-grained audio processing).For the second limitation, THAM is based on two-stage training mechanism.To perform mitigation of text hallucination, pre-training of each language model is required as a first-stage training.To overcome the aforementioned limitations, we will perform further studies and make an effort on video interpretability improvements.

Ethics Statement
As one of the interactive AI, the Video-grounded Dialogue system is designed for providing assistance to various subsections of our environments including education, entertainment, and visual impairments.Our proposed Text Hallucination Mitigation Framework have contributed to improving response qualities and alleviating abnormalities in the system.We also consider the potential negative societal impact that those who are aware of the VGD system can deliberately manipulate it to get prohibited information.Furthermore, to apply the VGD system in the real environment, fairness and bias issues of dialogue systems should also be addressed.

A Training Details.
Training.THAM is trained on NVIDIA TITAN V (12GB of memory) GPU with Adam optimizer (Kingma and Ba, 2014) with β 1 = 0.9, β 2 = 0.99, and = 10e-8.We utilized the piece-wise linearly decreased learning rate from 6.25e-4 to 0 and set the learning rate warm-up strategy to 10,000 training steps and trained the model up to 20 epochs.In Section 4.1, the interpolation is conducted via the window overlapping method.The first-stage training is performed on three language models (i.e., RLM, HLM, LM) respectively with a batch size of 8 and a dropout rate of 0.3.For the d-dimensional space, all language models use d=768.Inference.In the inference, answer generation adopts a beam search with a beam size of 5 and a length penalty of 1.0, where the maximum length of sentence is set to 30.Every performance of THAM in table 1 and 2 of the main paper is averaging from 5 times random seed validation.

B Donsker-Varadhan representation.
For the probability distribution of P and Q, the KL divergence admits the following dual representation as: where Ω is high-dimensional variables and the supremum is taken over all functions T such that the two expectiations are finite.The proof for this representation is given as follows.
The positivity of the KL-divergence gives Δ ≥ 0. Therefore, we are able to show that for any T , and the inequality is preserved via taking the supremum over the right-hand side, where the identity of Equation ( 4) also shows that the bound is tight whenever G = P , for optimal functions T taking the form T log dP dQ + C for some constant C ∈ R.

C Failure case
We also confirmed that the proposed THAM is fragile to the questions of asking sounds in the video, where it copies the input texts of "i can hear some noise" from history texts in Figure 7.While we admit that the above case can produce semantically correct answers, we feel that the VGD systems should be able to generate more rich answers using their own languages.Furthermore, the sound features contain less information (128 dimensions) compared to video (2048 dimensions), which requires more specialized attention.

Video
hallucination by non-answerable question C: a man is using a sink and is opening a bottom drawer to take out a red bag.

VideoFigure 2 :
Figure 1: Illustration of video-grounded dialogue system including incorrect answer generation by (a) nonanswerable question and (b) non-referring video.
person is sitting on a sofa.they open a laptop then eat a bag of chips.

Figure 3 :
Figure 3: Illustration of Text Hallucination Mitigating Framework (THAM) for video-grounded dialogue.THAM mitigates feature-level hallucination effects in Response Language Model via introducing Text Hallucination Regularization (THR) loss, where THR aims to minimize mutual information between encoder features of RLM and features from Hallucination Language Model.

Figure 4 :
Figure 4: Illustration of THR loss (left) and crossentropy loss (right) along the epoch on valid split of AVSD@DSTC7 with and without subtracting the encoder features of LM from the encoder features of HLM

Figure 5 :
Figure 5: Joint distributions of encoder features between (a) RLM (F < t) and HLM (F <t ), (b) RLM (F < t) and HLM with subtracting LM (G < t).(a) shows correlations with F <t by grammatical knowledge in HLM, and (b) shows relatively independent distributions by G <t .
The secondstage training is performed on RLM with THR loss with the same batch size and dropout rate with the first training.The best model is decided by the lowest validation loss on the validation-set with α = 0.01 in equation (11) of the main paper on the setting X <t = [h||a <t ].The training takes about 5 hours to be fully optimized at the losses of about 0.184 on training and 0.284 on validation.Inference time for generating the answer for a single question takes about 2 seconds.Our model is not performed on hyperparameter searching for model fine-tuning.
For a given function T , consider the Gibbs distribution G define by dG = 1 z e T dQ, whereZ = E Q [e T ].For the construction, we are available to derive 10 as:E P [T ] − logZ = E P [log dG dQ ] (13) 10 log dG dQ = log 1 Z e T = log 1 Z + T = T − logZLet Δ be the gap as:Δ := D KL (P ||Q) − (E P [T ] − log(E Q [e T ])) (14)Using the Equation (2), we can write Δ as KLdivergence:

Figure 7 :
Figure 7: Failure case on question about sounds

Table 3
models to answer the question.THR loss is defined by feature-level mutual information between RLM and HLM.To this, we first define encoder features of each trained model: (1) RLM's encoder features as F <t = f RLM (X <t , θ) ∈ R d and (2) HLM's encoder features as

Table 2 :
Experimental results on the test split of AVSD benchmark at DSTC7 and DSTC8 challenges for applying THR loss on VGD runner models (B1 = BELU1, * : reconstruction-based results, †: single reference results).

Table 3 :
Ablation study on variants of HLM to learn indiscriminate text hallucination from different text inputs on the valid split of AVSD@DSTC7.(single reference)