HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue

Video-grounded Dialogue (VGD) aims to answer questions regarding a given multi-modal input comprising video, audio, and dialogue history. Although there have been numerous efforts in developing VGD systems to improve the quality of their responses, existing systems are competent only to incorporate the information in the video and text and tend to struggle in extracting the necessary information from the audio when generating appropriate responses to the question. The VGD system seems to be deaf, and thus, we coin this symptom of current systems' ignoring audio data as a deaf response. To overcome the deaf response problem, Hearing Enhanced Audio Response (HEAR) framework is proposed to perform sensible listening by selectively attending to audio whenever the question requires it. The HEAR framework enhances the accuracy and audibility of VGD systems in a model-agnostic manner. HEAR is validated on VGD datasets (i.e., AVSD@DSTC7 and AVSD@DSTC8) and shows effectiveness with various VGD systems.


Introduction
One of the desiderata in our vision-language community is to build conversational agents that can look, listen, think and speak as humans.These agents can potentially be deployed in various subsections of society, including education, security, entertainment, and visual or other impairments.round question Q r , VGD system is expected to answer in free-form A r to the question Q r .For this VGD task, multi-modal interaction has been a popular solution, including transformer (Vaswani et al., 2017), where many studies concerning modality interactions are performed to boost performance and improve the efficiency of VGD systems.
Unfortunately, these multi-modal interactions focus only on finding the joint representations between video and language.As a consequence, current agents tend to ignore audio in generating the response.This symptom of ignoring input audio in responding to the question will be referred to as the 'deaf response'.Figure 1 represents examples of deaf responses of existing VGD systems (Yoon et al., 2022c;Li et al., 2021a).To the question "Does the video have sound?" in Figure 1 (a), the system is not able to recognize the input audio: It responds as though the audio is not present.Furthermore, even when the system does recognize the existence of audio, it lacks the capability to decipher the information within the audio accurately.This is evident in Figure 1 (b) where all the sounds of people talking are disregarded as background noise, resulting in incorrect responses.
Our experimental evidence in Figure 2  shows that the current VGD systems cannot incorporate audio information when responding. Figure 2 (a) shows the performance of response (i.e., CIDEr (Vedantam et al., 2015), ROUGE-L (Lin, 2004), BLEU (Papineni et al., 2002)) of current VGD systems according to training with and without the audio: There are very little differences in performances between the two cases.Some metrics even show a slightly higher performance when trained without audio.Furthermore, as shown in Figure 2 (b), when investigating the responses to audio-related questions 1 , their performances are noticeably lower compared to the overall response performances.Therefore, existing VGD systems tend to ignore the audio and suffer from the deaf response, which leads to incorrect answers, especially to questions pertinent to audio.
To overcome the deaf response problem, we propose Hearing Enhanced Audio Response (HEAR) framework, which allows the VGD systems to sensibly listen to audio according to the meaning of the question and perform enhanced listening to the audio.Thus, HEAR incorporates (1) Sensible Audio Listening (SAL) that selectively attends to audio according to a sensible decision of whether to focus on audio or not and (2) Reconstructive Listening Enhancement (RLE) that improves the audibility via establishing a reconstruction upper bound to connect audio with its surrounding information.For the sensible decision in SAL, we introduce two technical contributions: (1) Keyword-based Audio Sensing and (2) Semantic Neural Estimator.HEAR is applied on current runner models (Hori et al., 2019a;Yoon et al., 2022c;Li et al., 2021b) in a model-agnostic manner, where the effectiveness is validated on VGD dataset (i.e., AVSD@DSTC7, AVSD@DSTC8) with steady performance gains on natural language generation metrics. 1 We select the questions that contain words related to audio such as 'sound', 'speech', and 'noise'.

Video-grounded Dialogues
Visual Question Answering (VQA) (Antol et al., 2015;Wang et al., 2022) has been one of the proxy tasks to evaluate the multi-modal understanding of vision-language systems.In light of recent advancements in generative models in natural language processing (Devlin et al., 2018;Radford et al., 2018), VQA has evolved into a more general format of answering as video-grounded dialogue (VGD) (Alamri et al., 2019), where VGD aims to generate open-ended answer sentences from a question by referring to several input modalities (i.e., video, audio, and dialogue history).Many multi-modal interactions have been proposed, where various attention mechanisms (Sanabria et al., 2019;Le et al., 2019) have been devised to perform cross-modal interactions.To boost performances, transformerbased VGD systems (Li et al., 2021b) are utilized on top of large-scale pre-trained language models (Radford et al., 2019;Raffel et al., 2020).Another immense challenge is keeping track of extended dialogue context, and video, where memory networks (Lin et al., 2019;Xie and Iacobacci, 2020) and multi-step attention (Chu et al., 2020) were introduced to efficiently store the video and long episodic dialogue.Graph representations (Kim et al., 2021;Pham et al., 2022;Le et al., 2021) were also popular solutions for holding semantic commonalities between the dialogue and video.There have been novel structures (Hori et al., 2019b;Le et al., 2022) to enhance the multi-modal representation and frameworks (Le and Chen, 2020;Lee et al., 2020;Yoon et al., 2022c) to improve the quality of responses in terms of bias or word selection.As such, many advances have been made in the multi-modal understanding of VGD systems, but mainly between video and language.Thus, the VGD system's understanding of audio is still far from satisfactory.To this end, we first contribute to improving the 'listening' ability of VGD system.

Task Definition
Video-grounded Dialogue (Alamri et al., 2019) (VGD) task aims to generate open-ended answer sentences to a question regarding multimodal inputs composed of video, audio, and dialogue history.To build a formal task definition of VGD, a system takes tuples (v, u, h, q r ) as input and decodes answer sentence a r , where v is video, u is au- dio, h is dialogue history, and q r is a question asked at current round r ∈ {1, • • • , R}.The dialogue history h = {c, (q 1 , a 1 ), • • • , (q r−1 , a r−1 )} is composed of caption c to summarize the video and a set of question-answer pairs in previous rounds.For training VGD system, next-word prediction is performed, where the system predicts t-th answer word token a r t from the inputs of tuples (v, u, h, q r ) and partial answer word tokens a r <t before t-th.

Hearing Enhanced Audio Response
Figure 3 illustrates Hearing Enhanced Audio Response (HEAR) framework designed to enhance Dialogue Language Model (DLM) 2 in terms of two functionalities on audio: (1) Sensibility that selectively attends on audio according to the meaning of the question and (2) Audibility that performs enhanced listening to input audio.For sensibility, we propose Sensible Audio Listening (SAL) in Figure 3 (a) which trains the DLM to respond to a question by selectively weighting to audio corresponding to the audio-relatedness of the question.For audibility, we devise Reconstructive Listening Enhancement (RLE) in Figure 3 (b) which enhances audio representations by establishing a reconstruction upper bound to connect audio with its surrounding information.We alternately train DLM with SAL and RLE to fully utilize video and audio modalities.

Input representations
We formally define input feature representations of v, h, q r , and a r by embedding them into ddimensional space.For the video embedding, we use I3D model (Carreira and Zisserman, 2017) pre-2 Here, we refer to the general 'VGD system' as DLM.
trained on the Kinetics dataset (Kay et al., 2017) to get 4096-dimensional video features v ∈ R L×4096 composed of rgb and optical-flow features, where L is the number of video frames.For the audio embedding, we use VGGish model (Hershey et al., 2017) pre-trained on the AudioSet dataset (Gemmeke et al., 2017) to get 128-dimensional audio features u ∈ R L×128 , where the L is the number of audio features3 .The aforementioned video and audio features are concatenated along the feature dimension axis and embedded into d-dimensional space as audio-visual features u v as given below: where For the text features, we tokenize all the text inputs (i.e., h, q r , a r ) into a series of WordPieces (Wu et al., 2016) using the T5 Transformer (Raffel et al., 2020), such that word token representations are obtained on top of relative positional embeddings and a layer normalization (Ba et al., 2016).Thus, the formal definitions of text features are as follows: history h ∈ R L h ×d , question q ∈ R Lq×d , and answer a ∈ R La×d , where L h , L q , and L a are the numbers of tokens of each text4 .

Dialogue Language Model
Our proposed HEAR framework is performed in a model-agnostic manner, such that we first define Dialogue Language Model (DLM) as a general VGD system.For input audio, video, and texts (i.e., history and question), DLM is trained to generate next-word tokens for answer sentences a r = {a r 1 , • • • , a r m } under cross-entropy loss as: where θ is learnable weights and m is the number of word tokens in the answer sentence.In the following, our proposed SAL and RLE improve the DLM's sensibility and audibility for correctly responding to questions about audio.

Sensible Audio Listening
Sensible Audio Listening (SAL) is devised to determine whether the VGD system should listen to the audio to answer a given question.If the SAL determines that listening is required, audio is processed to be more weighted in the response.To ensure sensible decision-making within SAL, we introduce two technical decision rules: (1) Keyword-based Audio Sensing and (2) Semantic Neural Estimator.
Keyword-based Audio Sensing.In many cases, unlike general questions, audio-related questions contain specific keywords (e.g., 'sound', 'speech', 'listen') that indicate that the question concerns audio.Therefore, we conduct an empirical investigation of these keywords.If any of these keywords are present in a question, SAL identifies it as an audio-related question.In such cases, we mask the video features in the inputs, directing more attention toward the audio component as given below: where w q is all word tokens in the question q and W key = {'sound', 'speech', 'listen', . . .} is keyword set5 to investigate audio-related questions.m v ∈ R L×4096 is zero padding on video.Here, we do not perform any masking when the question is not an audio-related question (i.e., ∀w q / ∈ W key ), because other questions excluding audio-related questions do not show high reliance on a specific modality.Thus, u v (q) ∈ R L×d is sensible audiovisual features that make DLM selectively focus on audio for responding to the audio-related questions.keyword-based audio sensing.While the keywordbased approach effectively identifies the questions that directly require information about audio, we were still able to identify outliers that were not accurately identified in Figure 4 (b).Although they do not contain keywords, the meanings of the questions are related to audio.Therefore, we further devise a semantic neural estimator f ϕ that identifies the audio-related question based on the meaning of sentences.The semantic neural estimator as a BERT-based classifier takes {w cls , w q } to form an input instance, where w cls is a token for classification and the prediction target of w cls is y ∈ {y 0 , y 1 } denoting that y 1 = 1 is audiorelated question and y 0 = 0 is the other question.We first train the f ϕ with training weights ϕ under L2 loss6 given as E D (y − ŷ) 2 , where D is the dataset, ŷ = f ϕ (w cls , w q ) is prediction score of the sigmoid (i.e., 0 < ŷ < 1).For labeling the y, we first include the predictions by keywordbased audio sensing as a noisy label.We further include y 0 with (1) word-wise random shuffled versions of audio-related questions and (2) random swapping versions of non-audio questions using keyword W key .This prevents f ϕ from simply predicting based on the keywords and makes more focus on the meaning of the audio-related questions in y 1 .After training f ϕ , we calibrate the audio and video features according to the estimation of f ϕ as:

Semantic Neural Estimator.
where r = f ϕ (w cls , w q ) ∈ R is estimation score about audio-related question and u v (q) ∈ R L×d is our final sensible audio-visual features for SAL.
Based on the sensible audio-visual feature u v (q) from SAL, we train Dialogue Language Model (DLM) with cross-entropy loss as given below:

Reconstructive Listening Enhancement
For a better understanding of audio, it is crucial to improve audio representations based on a common understanding of a scene in a video.Therefore, our proposed Reconstructive Listening Enhancement (RLE) performs masked audio reconstruction by referring to surrounding information (i.e., video and audio adjacent to masked audio), which enhances the common embedding with its surroundings.Especially to perform effective enhancement in regions closer to the masked target, we propose Reconstruction Upper Bound in the following.
Audio Reconstruction.Audio reconstruction aims to predict masked audio based on observations of their surrounding audio and other modalities (i.e., video and text).We randomly select input audio with a probability of p (e.g., p=10%) as u m , where u m ∈ R M ×128 is the target audio features to be masked, m is the set of indices for masking, and M is the number of indices.We also define u \m ∈ R L×128 as surrounding audio features including masked features of zero vectors in the indices m.We introduce audio reconstruction loss L ar for DLM to reconstruct target audio features u m from the inputs of {u \m , v, h, q} as: is reconstructed audio on the masked indices m from the output of multi-layer perceptron on top of DLM encoder.
Reconstruction Upper Bound.The audio reconstruction should be based on an understanding of surrounding semantics rather than simply memorizing audio.To facilitate this, we propose Reconstruction Upper Bound (RUB) which establishes an inequality condition to ensure enhanced reconstruction when the surrounding information is provided compared to when it is not given as given below: where the audio reconstruction loss based on surroundings L ar (θ) = h θ (u m |u \m , v, h, q) should be lower than the reconstruction loss without surroundings L n ar (θ) = h θ (u m |u n \m , v n , h, q) (i.e., upper bound).As shown in Figure 5 (a), u n \m , v n are audio and video features that further mask out surroundings up to feature distance n (e.g., n = 3) from the target audio features.To train DLM to conform this inequality condition, we calculate a reconstruction upper bound loss L rub in a format of ranking loss with margin δ as given below: After that, we iteratively minimize the upper bound L n ar in a way of narrowing the surrounding masking with distance n7 as shown in Figure 5 (b).Here, we construct masking distance scheduling n = g(e) according to taring epoch e to select progressively lower n from higher n8 .Therefore, the final objective of RLE is formulated as below: where G : R + → R + is a set of scheduling functions and we select hyperbolic function9 as g(e) = round(α √ e max − e) + 1 as depicted in Figure 6 to give stable decreasing of n with α = nmax−1 √ emax−1 and n max =5, e max =15 are maximum distance and epoch.Therefore, the final RLE loss is the summation of two losses: L RLE = L ar + L rub .

Results on AVSD benchmark
Table 1 summarizes the experimental results of HEAR on the AVSD@DSTC7 and AVSD@DSTC8.HEAR shows state-of-the-art performances on all the metrics compared to previous works (i.e., please refer to Related Work for their detailed captions.).Our baseline DLM is T5 Transformer (Raffel et al., 2020), which is the same baseline (i.e., T5RLM) of THAM (Yoon et al., 2022c), but here, our proposed SAL shows more gains, and further improvements are also shown by applying RLE.As our proposed HEAR is performed in a model-agnostic manner, we also validate other VGD models with HEAR in Table 2.We utilize the public codes and papers for MTN, RLM, and T5RLM, where steady gains in all metrics are shown for all the models.

Ablation Study
Table 3 summarizes the ablative results of the proposed modules in HEAR framework.The first section of Table 3 is about our base DLM performances.In the second section, for the variants of SAL, smaller gains are obtained when only using keyword-based audio sensing.We think that using it alone was not beneficial in answering some audio-related questions that require referencing the videos, as this method unconditionally screens out video features as long as the given questions are related to audio.In the case of RLE, the audio reconstruction loss L ar generally gives a positive effect on the system performances.However, the recon- struction upper bound loss L rub becomes effective only when the L ar and L rub are used together.This denotes that the ranking loss in the L rub is founded on well-reconstructed audio features.
One may wonder if HEAR really helps answer audio-related questions. Figure 7 shows the response performances in terms of VGD systems including HEAR according to questions.HEAR performs new state-of-the-art performances on top of previous runner models, but it is also meaningful that the gains are mainly obtained from improving responses to audio-related questions.From the results in Figure 7 (b), (c) and Table 2, it is concluded that HEAR framework can be applied to any VGD system, where it exclusively contributes to improving systems' audibility and generating correct responses to given audio-related questions.
Furthermore, we found that our proposed RLE also contributes to the efficient learning of sensible audio listening (i.e., optimizing L SAL ). Figure 8 (a) shows validation losses (L SAL ) according to training epochs, where it can be seen that the L SAL is further reduced when RLE loss (L RLE ) is applied.Figure 8  ar for L rub , where the it decreases according to our masking distance scheduling g(e)11 .The green curve denotes the L ar with L rub and it shows further optimizing compared to without L rub as the L g(e) ar decreases.This denotes that neural networks can be further optimized according to the training epochs by calibrating their training objectives, which is also validated in other multi-modal systems (Yoon et al., 2023;Zheng et al., 2022) in other ways.

Qualitative Results
Figure 9 illustrates the HEAR's responses to audiorelated question.HEAR is given the question "what genre of music is playing?" as the audio-related question, and it generates the answer sentence "I can't be sure, but I think it's punk music."Here, HEAR precisely predicted that the given audio was punk music despite the challenging work of discerning what the sound is in the video.The more interesting fact is that HEAR represents its opinion as "I can't be sure" about predicting music.When we listened to that audio in Figure 9 (a), it was really quiet and difficult to identify what the sound is.We guess the HEAR also learned the knowledge about which sounds are difficult for humans to distinguish.Table 5 shows predictions on questions by Keyword-based Audio Sensing and Semantic Neural Estimator.As the keyword-based approach could not understand the meaning of the question, it shows incorrectness in some questions (i.e., (c,d,e)).
The neural estimator presents the score 0 < r < 1 denoting whether the question is related to the audio, which provides proper distinctions between audio-related questions (i.e., (a,b,c,d)) and the others (i.e., (e,f)) based on the meaning of the question.

Conclusion
We

Limitations
Our research aims to enhance the audibility of Video-grounded Dialogue (VGD) systems, where we devise Hearing Enhanced Audio Response (HEAR) Framework.As a limitation of our work, we are more focused on the methodology.We need to have a better understanding of the limitation of the proposed method to overcome any failure cases.
As shown in the failure case in Appendix, our proposed method has limitations, and we would need to consider other architecture and training methods to incorporate all the necessary information that speech holds.To understand human speech, current audio features seem necessary to be trained more by large-scale audio speech recognition datasets (Panayotov et al., 2015;Garofolo, 1993).Furthermore, although our proposed HEAR mitigates this problem in a methodological idea, we also think that the deaf problem can also be cured by expanding the audio feature dimension (i.e., 128) up to a comparable scale with video (i.e., 4096), such that they include more detailed information.To be specific, we are currently changing the current audio feature extractor (i.e., VGGish) into wav2vec 2.0 (Baevski et al., 2020), which can provide a larger dimensional audio feature (i.e., 768 dimension).We will also make the audio features (i.e., wav2vec 2.0 features) publicly available and perform a further study on this as our future work.

Ethics Statement
Video-grounded Dialogue system is one of the conversational AI agents, which is designed to provide assistance to various subsections of our environments including security, entertainment, education, and visual impairments.Our proposed Hearing Enhanced Audio Response framework contributes to improving responses to queries about audio by enhancing the audibility of VGD system.Recently chatbot systems (e.g., ChatGPT) has shown overwhelming performance, as such, we should also think about the potential negative societal impact of these systems.In this respect, we came up with two negative impacts: (1) unreliable vague information by conversational agents and (2) fairness issues in agents' responses.Therefore, word sense disambiguation techniques (Yoon et al., 2022a) and multimodal debiasing solutions (Yoon et al., 2022b;Niu et al., 2021) should also be applied to the dialogue systems.

A Experimental Details
Training.Proposed HEAR is trained on NVIDIA Quadro RTX 8000 (48GB of memory) GPU.The optimization details are as follows.The AdamW optimizer (Loshchilov and Hutter, 2017) is used with parameters as β 1 = 0.9, β 2 = 0.999, and ϵ = 10e-8.Due to the efficient usage of resources, the previous 3 question-answer pairs (i.e., {C, as a dialogue history are introduced into HEAR for answering current questions Q r .For the learning rate, we first set it as lr = 6.24e − 5 and it linearly decreases following a piecewise linear curve up to lr = 3.63e − 10 and the model is trained the model during 15 epochs.The hyperparameters are δ = 0.05 for the margin in L RLE .For the joint d-dimensional space, all modalities (i.e., video, audio, words) are embedded d = 768 dimensional space.The best model is decided by the lowest loss of the validation set on AVSD@DSTC7 (i.e., AVSD@DSTC8 contains only test set for evaluation.).It takes 14 hours to finish the training and the model is fully optimized in about 10 hours.
Inference.In the inference, answer word tokens are generated in a sequence with a probability, where the beam search is applied to avoid prompt word selection with a beam size of 5.The max length for the answer word tokens is set to 20 with a length penalty of 0.3.In the Experiment of the main paper, every performance of HEAR is averaged 15 times with a random seed number.

B Additional Experiments
To improve the reproducibility of proposed modules (i.e., SAL, RLE) in HEAR, we give detailed explanations about them with additional results and illustrations of the method.

B.1 Keyword lists for keyword-based audio sensing
For details of Section 4.3, we list the keywords that we used for keyword-based audio sensing as given below: sound, voice, speech, speak, talk, listen, hear, say, sing, music, audio, call, hum, loud, tones, utter, volume, song}, (11) where we also considered all the keywords' plural forms in W key .Figure 10 summarizes the propor-   L rub by ensuring that the audio reconstruction L ar based on the surrounding information has an improved quality than audio reconstruction L n ar without the surrounding information.To make more effective L rub , L n ar can be designed to minimize its loss by By reducing the distance n of the surrounding mask.To this end, we leverage the extent of the surrounding mask under our designed various modelings n = g(e) for masking scheduling in Figure 11.We narrowed down the extent of surrounding masking based on three different curves: linear curve, logistic curve, and hyperbolic curve.The hyperbolic curve and linear curve show effectiveness in optimizing validation loss.However, the logistic curve shows some deterioration in the optimization.We think this is because the logistic curve makes the surrounding masking applied to a very narrow area from the beginning of training, which acts like a hard negative that is almost similar to the positive reconstruction (i.e., reconstruction from only masked target audio).Providing hard negatives in early training is considered to hinder optimization when the learnable weights of the model are not properly trained.Therefore, surrounding masking with hyperbolic scheduling is the most effective.AVSD@DSTC7 under distance scheduling with the hyperbolic curve.When n max is small (e.g., n max = 2 or 3), L * ar makes audio reconstruction based on the nearest video and audio from the target audio.This can be effective in terms of improving the connectivity of neighboring video and audio at a narrow distance, however since the distance is too close, there is no significant difference from the positive audio reconstruction, which hinders reconstruction.Based on results in Table 6, negative masking was effective when n max = 4 or 5.In this case, the surrounding modalities' features for audio reconstruction were properly removed by masking, in the meanwhile the surroundings were not masked excessively.Thus the masking contributes on hard negative audio reconstruction.When n max is large (e.g., n max = 6), the perfor- Table 7: Human evaluation on responses to 50 audiorelated questions with respect to semantic adequacy, grammatical correctness, fluency.The scores are assigned as ("1: not at all", "2: neutral", "3: correct"), GT: Ground-Truth response.
mance again degrades.For this case we guess that the negative reconstruction is quite different from the positive, so it would not be an effective negative.These studies were also conducted for a larger n max , but no further improvement was confirmed.

C Human Evaluation
To assess the improvement of HEAR framework in the responses to audio-related questions, we conduct a Human Evaluation.We selected a total of 50 audio-related questions and generate their responses from 3 models (i.e., HEAR, THAM, RLM).We evaluate all responses based on a scale of 1 to 3, considering three key perspectives: semantic adequacy, grammatical correctness, and fluency.Each score denotes that "1: not at all", "2: neutral", "3: correct".Table 7 shows the human evaluation scores based on 11 evaluators.Evaluators rate the responses based on three categories, where HEAR outperformed recent runner models in all three categories, receiving higher ratings.Furthermore, through empirical validation, we have confirmed the presence of inappropriate answer responses within the Ground-Truth.As a result, the human evaluation scores appropriately reflect this by not consistently obtaining a score of 3.0 or a value close to it.How many people are in the room?

D Additional results and Failure case
1 There is just one man there.
Can you hear the tv? 2 Question (Q)  : Yes, i can hear the tv. : Yes, you can hear the tv.
Guy is sitting on bed and turns of tv with the remote control.He gets up and goes to whatever is on the floor to examine it.itations in accurately comprehending the language within audio inputs.Our empirical studies found that this challenge is prevalent among various dialogue language models, including our model.It is noted that the current audio feature used in the HEAR framework is pre-trained with the environmental sounds dataset, such as AudioSet (Gemmeke et al., 2017).However, it appears that further data training is necessary to enhance the system's ability to understand speech effectively.To address this challenge, our future work will focus on incorporating audio features, such as wav2vec 2.0 (Baevski et al., 2020), which have been trained specifically for audio speech recognition tasks.By leveraging these advanced audio features, we aim to enhance the model's ability to accurately understand and process speech-related information.

Figure 1 :
Figure 1: Current VGD system's deaf responses on questions about audio: (a) Audio is considered not present and (b) Audio is disregarded as background noise.

Figure 2 :
Figure 2: Current VGD systems' performances on AVSD dataset (validation): (a) Response performances according to training with and without audio, (b) Average performance drops on the questions about audio.

Figure 3 :
Figure 3: Illustration of Hearing Enhanced Audio Response Framework (HEAR) for video-grounded dialogue.HEAR performs sensible listening via (a) Sensible Audio Listening that selectively attends to audio corresponding to a given question and improves audibility via (b) Reconstructive Listening Enhancement that enhances audio representations by establishing a reconstruction upper bound to connect audio with its surrounding information.
Figure 4 (a) illustrates audio-related questions identified by

Figure 4 :
Figure 4: Examples of questions: (a) Predicted audiorelated questions by keyword-based audio sensing and (b) Outliers of keyword-based audio sensing.

Figure 5 :
Figure 5: Illustrations of surrounding masking.The distance n decides the extent of surrounding masking.

Figure 7 :
Figure 7: VGD systems' response performances on audio-related questions of AVSD@DSTC7 (validation): (a) Total questions, (b) Audio-related questions predicted by Semantic Neural Estimator (questions with estimation score r > 0.7), (c) Audio-related questions predicted by Keyword-based Audio Sensing.

Figure 8 :
Figure 8: Ablative results on validation losses according to epochs: (a) SAL loss L SAL with and without RLE loss L RLE , (b) Audio reconstruction loss L ar with and without reconstruction upper bound loss L rub .
(b) shows the ablation studies of audio reconstruction loss L ar with and without reconstruction upper bound L rub .The yellow curve shows the upper bound loss L g(e) is audio.The person plays a music video from his laptop.What genre of music is playing? 2 Question (Q)  : I can't be sure, but i think it's punk music. : Here, I think 2000 ' s punk rock song.A person is watching something on their laptop placed on the kitchen counter.They then open the window start cleaning it.C caption  2

Figure 9 :
Figure 9: HEAR responses to an audio-related question.

Figure 11 :
Figure 11: Illustrations of distance scheduling for surrounding masking.The distance n decides how much mask video and audio features for establishing upper bound L n ar : (a) linear curve, (b) logistic curve, (c) hyperbolic curve.Audio reconstruction losses with reconstruction upper bound using (d) linear curve scheduling, (e) logistic curve scheduling, (f) hyperbolic curve scheduling.

Figure 12
Figure12illustrates additional results of HEAR framework.When presented with the question, "Can you hear the TV?" our HEAR provides an affirmative response, stating, "Yes, I can hear the TV."This response is generated by considering both the audio input of the TV and the visual context of the TV image.Although our proposed HEAR improves the understanding of audio, were instances of failure when it comes to questions pertaining to speech recognition.As shown in Figure13, the HEAR framework demonstrates lim-

Figure 13 :
Figure 12: Illustration of additional results of HEAR

Table 3
ar : Audio Reconstruction, L rub : Reconstruction Upper Bound

Table 4 :
Predictions on questions about audio relatedness.K: Keyword-based Audio Sensing, S: Semantic Neural Estimator, H: human rating, T : true, F : false.

talk hear sound say noise audio speak etc Proportion of audio-related questions according to keywords
Figure 10: Chart for representing the proportion of audio-related questions according to keywords in W key .

Table 6
summarizes the results of HEAR according to the distance n on the validation split of

Table 6 :
Ablation study on the distance of negative masking of HEAR on valid split of AVSD@DSTC7.