Target-Aware Spatio-Temporal Reasoning via Answering Questions in Dynamics Audio-Visual Scenarios

Audio-visual question answering (AVQA) is a challenging task that requires multistep spatio-temporal reasoning over multimodal contexts. Recent works rely on elaborate target-agnostic parsing of audio-visual scenes for spatial grounding while mistreating audio and video as separate entities for temporal grounding. This paper proposes a new target-aware joint spatio-temporal grounding network for AVQA. It consists of two key components: the target-aware spatial grounding module (TSG) and the single-stream joint audio-visual temporal grounding module (JTG). The TSG can focus on audio-visual cues relevant to the query subject by utilizing explicit semantics from the question. Unlike previous two-stream temporal grounding modules that required an additional audio-visual fusion module, JTG incorporates audio-visual fusion and question-aware temporal grounding into one module with a simpler single-stream architecture. The temporal synchronization between audio and video in the JTG is facilitated by our proposed cross-modal synchrony loss (CSL). Extensive experiments verified the effectiveness of our proposed method over existing state-of-the-art methods.


Introduction
Audio-visual question answering (AVQA) has received considerable attention due to its potential applications in many real-world scenarios.It provides avenues to integrate multimodal information to achieve scene understanding ability as humans.
As shown in Figure 1, the AVQA model aims to answer questions regarding visual objects, sound patterns, and their spatio-temporal associations.Compared to traditional video question answering, the AVQA task presents specific challenges in the following areas.Firstly, it involves effectively fus-Figure 1: An illustration of the Audio-Visual Question Answering task.Concretely, the question is centered around the "instruments" (i.e., the target) and is broken down into "how many", "did not sound", and "from beginning to end" in terms of visual space, audio, and temporality, respectively.Identifying the three instruments that did not produce sound throughout the video may entail a significant time investment for a human viewer.Nonetheless, for an AI system with effective spatio-temporal reasoning capabilities, the task can be accomplished much more efficiently.
ing audio and visual information to obtain the correlation of the two modalities, especially when there are multiple sounding sources, such as ambient noise, or similar categories in either audio or visual feature space, such as guitars and ukuleles.Secondly, it requires capturing the question-relevant audiovisual features while maintaining their temporal synchronization in a multimedia video.
Although there have been a number of promising works (Zhou et al., 2021;Tian et al., 2020;Lin and Wang, 2020) in the audio-visual scene understanding community that attempted to solve the first challenge, they are primarily a targetless parsing of the entire audio-visual scenes.Most of them (Xuan et al., 2020;Wu et al., 2019;Mercea et al., 2022) obtain untargeted sound-related visual regions by designing attention schemes performing on audio-to-visual while ignoring the questionoriented information from the text modality.However, the understanding of audio-visual scenes in AVQA tasks is often target-oriented.For exam-Figure 2: Comparison of different question-aware temporal grounding.(a.)The traditional approach usually adopts a dual-stream network that treats audio and video as separate entities.(b.)Our proposed cross-modal synchrony loss ensures the interaction between audio and visual modalities.(c.)Our proposed single-stream architecture is able to treat audio and video as a whole, thus incorporating temporal grounding and fusion.
ple, as illustrated in Figure 1, our focus lies solely on the subject of inquiry, i.e., instruments, disregarding the singing person or ambient sound.Traditional AVQA approaches (Li et al., 2022;Yun et al., 2021;Yang et al., 2022), inherited from the audio-visual scene understanding community, rely on aligning all audio-visual elements in the video to answer a question.This results in much irrelevant information and difficulties in identifying the relevant objects in complex scenes.As for the second challenge, most existing methods (Yun et al., 2021;Yang et al., 2022;Lin et al., 2023) employ a typical attention-based two-stream framework.As shown in Figure 2.a, such a two-stream architecture processes audio and video in each stream separately while overlooking the unity of audio and visual modalities.In particular, the temporal grounding and audio-visual fusion are isolated, with fusion occurring through an additional module.
To effectively address these two challenges, we propose a target-aware joint spatio-temporal grounding (TJSTG) network for AVQA.Our proposed approach has two key components.
Firstly, we introduce the target-aware spatial grounding (TSG) module, which enables the model to focus on audio-visual cues relevant to the query subject, i.e., target, instead of all audio-visual elements.We exploit the explicit semantics of text modality in the question and introduce it during audio-visual alignment.In this way, there will be a noticeable distinction between concepts such as the ukulele and the guitar.Accordingly, we propose an attention-based target-aware (TA) module to recognize the query subject in the question sentence first and then focus on the interesting sounding area through spatial grounding.
Secondly, we propose a cross-modal synchrony loss (CSL) and corresponding joint audio-visual temporal grounding (JTG) module.In contrast to the existing prevalent two-stream frameworks that treat audio and video as separate entities (Figure 2.a), the CSL enforces the question to have synchronized attention weights on visual and audio modalities during question-aware temporal grounding (Figure 2.b) via the JS divergence.Furthermore, it presents avenues to incorporate question-aware temporal grounding and audio-visual fusion into a more straightforward single-stream architecture (Figure 2.c), instead of the conventional approach of performing temporal grounding first and fusion later.In this way, the network is forced to jointly capture and fuse audio and visual features that are supposed to be united and temporally synchronized.This simpler architecture facilitates comparable or even better performance.
The main contributions of this paper are summarized as follows: • We propose a novel single-stream framework, the joint audio-visual temporal grounding (JTG) module, which treats audio and video as a unified entity and seamlessly integrates fusion and temporal grounding within a single module.
• We propose a novel target-aware spatial grounding (TSG) module to introduce the explicit semantics of the question during audio-visual spatial grounding for capturing the visual features of interesting sounding areas.An attention-based target-aware (TA) module is proposed to recognize the target of interest from the question.
• We propose a cross-modal synchrony loss (CSL) to facilitate the temporal synchronization between audio and video during question-aware temporal grounding.

Audio-Visual-Language Learning
By integrating information from multiple modalities, it is expected to explore a sufficient understanding of the scene and reciprocally nurture the development of specific tasks within a single modality.AVLNet (Rouditchenko et al., 2020) and MCN (Chen et al., 2021a) utilize audio to enhance text-to-video retrieval.AVCA (Mercea et al., 2022) proposes to learn multi-modal representations from audio-visual data and exploit textual label embeddings for transferring knowledge from seen classes of videos to unseen classes.Compared to previous works in audio-visual learning, such as sounding object localization (Afouras et al., 2020;Hu et al., 2020Hu et al., , 2022)), and audio-visual event localization (Liu et al., 2022;Lin et al., 2023), these works (Mercea et al., 2022;Zhu et al., 2020;Tan et al., 2023) have made great progress in integrating the naturally aligned visual and auditory properties of objects and enriching scenes with explicit semantic information by further introducing textual modalities.Besides, there are many works (Akbari et al., 2021;Zellers et al., 2022;Gabeur et al., 2020) propose to learn multimodal representations from audio, visual and text modalities that can be directly exploited for multiple downstream tasks.Unlike previous works focused on learning single or multi-modal representations, this work delves into the fundamental yet challenging task of spatio-temporal reasoning in scene understanding.Building upon MUSIC-AVQA (Li et al., 2022), our approach leverages textual explicit semantics to integrate audio-visual cues to enhance the study of dynamic and long-term audio-visual scenes.

Audio-Visual Question Answering
The demand for multimodal cognitive abilities in AI has grown alongside the advancements in deep learning techniques.Audio-Visual Question Answering (AVQA), unlike previous question answering (Lei et al., 2018;You et al., 2021;Chen et al., 2021b;Wang et al., 2021), which exploits the natural multimodal medium of video, is attracting increasing attention from researchers (Zhuang et al., 2020;Miyanishi and Kawanabe, 2021;Schwartz et al., 2019;Zhu et al., 2020).Pano-AVQA (Yun et al., 2021) introduces audio-visual question answering in panoramic video and the corresponding Transformer-based encoder-decoder approach.MUSIC-AVQA (Li et al., 2022) offers a strong baseline by decomposing AVQA into audio-visual fusion through spatial correlation of audio-visual elements and question-aware temporal grounding through text-audio cross-attention and text-visual cross-attention.AVQA (Yang et al., 2022) pro-posed a hierarchical audio-visual fusing module to explore the impact of different fusion orders between the three modalities on performance.LAV-ISH (Lin et al., 2023) introduced a novel parameterefficient framework to encode audio-visual scenes, which fuses the audio and visual modalities in the shallow layers of the feature extraction stage and thus achieves SOTA.Although LAVISH proposes a robust audio-visual backbone network, it still necessitates the spatio-temporal grounding network proposed in (Li et al., 2022), as MUSCI-AVQA contains dynamic and long-duration scenarios requiring a significant capability of spatio-temporal reasoning.Unlike previous works, we propose a TSG module to leverage the explicit semantic of inquiry target and JTG to leverage the temporal synchronization between audio and video in a novel single-stream framework, thus improving the multimodal learning of audio-visual-language.

Methodology
To solve the AVQA problem, we propose a targetaware joint spatio-temporal grounding network and ensure the integration between the audio-visual modalities by observing the natural integrity of the audio-visual cues.The aim is to achieve better audio-visual scene understanding and intentional spatio-temporal reasoning.An overview of the proposed framework is illustrated in Figure 3.

Audio-visual-language Input Embeddings
Given an input video sequence containing both visual and audio tracks, it is first divided into T non-overlapping visual and audio segment pairs {V t , A t } T 1 .The question sentence Q consists of a maximum length of N words.To demonstrate the effectiveness of our proposed method, we followed the MUSIC-AVQA (Li et al., 2022) approach and used the same feature extraction backbone network.
Audio Embedding.Each audio segment A t is encoded into f t a ∈ R da by the pretrained on Au-dioSet (Gemmeke et al., 2017a) VGGish (Gemmeke et al., 2017b) model, which is VGG-like 2D CNN network, employing over transformed audio spectrograms.
Visual Embedding.A fixed number of frames are sampled from all video segments.Each sampled frame is encoded into visual feature map f t v,m ∈ R h×w×dv by the pretrained on ImageNet (Russakovsky et al., 2015) ResNet18 (He et al., 2016) for each segment V t , where h and w are the  We introduce text modality with explicit semantics into the audio-visual spatial grounding to associate specific sound-related visual features with the subject of interest, i.e., the target.We exploit the proposed cross-modal synchrony loss to incorporate audiovisual fusion and question-aware temporal grounding within a single-stream architecture.Finally, simple fusion is employed to integrate audiovisual and question information for predicting the answer.
height and width of the feature maps, respectively.
Question Embedding.The question sentence Q is tokenized into N individual words {q n } N n=1 by the wrod2vec (Mikolov et al., 2013).Next, a learnable LSTM is used to process word embeddings obtaining the word-level output f q ∈ R N ×dq and the last state vector as the encoded sentence-level question feature.
Noted the used pretrained models are all frozen.

Target-aware Spatial Grounding Module
While sound source localization in visual scenes reflects the spatial association between audio and visual modality, it is cumbersome to elaborately align all audio-visual elements during question answering due to the high complexity of audio-visual scenes.Therefore, the target-aware spatial grounding module (TSG) is proposed to encourage the model to focus on the truly interested query object by introducing text modality from the question.
Target-aware (TA) module.For the word-level question feature f q ∈ R N ×dq , we aim to locate the target subject, represented as f tgt ∈ R dq , which owns the explicit semantic associated with the audio-visual scenes.Specifically, we index the target according to the question-contributed scores.To compute question-contributed scores, we use sentence-level question feature h q , as query vector and word-level question feature f q as key and value vector to perform multi-head self-attention (MHA), computed as: where f q = f 1 q ; • • • ; f N q , and d is a scaling factor with the same size as the feature dimension.s ∈ R 1×N represents the weight of each word's contribution to the final question feature.Next, we index the feature of the target, which will be enhanced in the subsequent spatial grounding, as: (2) where f tgt has the highest contribution weight to the question feature.
Interesting spatial grounding module.One way of mapping man-made concepts to the natural environment is to incorporate explicit semantics into the understanding of audio-visual scenarios.
For each video segment, the visual feature map f t v,m , the audio feature f t a and the interesting target feature f tgt compose a matched triplet.Firstly, We reshape the dimension of the For each triplet, we can compute the interesting sound-related visual features f t v,i as: where , σ is the softmax function, and (•) ⊤ represents the transpose operator.In particular, we adopt a simple thresholding operation to better integrate the text modality.Specifically, τ is the hyper-parameter, selecting the visual areas that are highly relevant to the query subject.I(•) is an indicator function, which outputs 1 when the input is greater than or equal to 0, and otherwise outputs 0. By computing text-visual attention maps, it encourages the previous TA module to capture the visual-related entity among the question.Next, we perform the Hadamard product on the audio-visual attention map and text-visual attention map to obtain the target-aware visual attention map.In this way, the TSG module will focus on the interesting sounding area instead of all sounding areas.To prevent possible visual information loss, we averagely pool the visual feature map f t v,m , obtaining the global visual feature f t v,g .The two visual feature is fused as the visual representation: , where FC represents fully-connected layers, and f t v ∈ R 1×dv .

Joint Audio-visual Temporal Grounding
In the natural environment, visual and audio information are different attributes of the same thing, i.e., the two are inseparable.Therefore, we propose the joint audio-visual temporal grounding (JTG) module and cross-modal synchrony (CSL) loss to treat the visual modality and audio modality as a whole instead of separate entities as before.
Cross-modal synchrony (CSL) loss.Temporal synchronization is a characteristic of the audio and visual modalities that are united, but in multimedia videos, they do not strictly adhere to simple synchronization.We use the question feature as the intermediary to constrain the temporal distribution consistency of the audio and visual modalities, thus implicitly modeling the synchronization between the audio and video.Concretely, given a h q and audio-visual features {f t a , f t v } T t=1 , we first compute the weight of association between the given question and the input sequence, based on how closely each timestamp is related to the question, as: where d is a scaling factor with the same size as the feature dimension.In this way, we obtain the question-aware weights A q , V q ∈ R 1×T of audio and video sequence, respectively.
Next, we employ the Jensen-Shannon (JS) divergence as a constraint.Specifically, the JS divergence measures the similarity between the probability distributions of two sets of temporal vectors, corresponding to the audio and visual questionaware weights, respectively.By minimizing the JS divergence, we aim to encourage the temporal distributions of the two modalities to be as close as possible, thus promoting their question-contributed consistency in the JTG process.The CSL can be formulated as: D KL (P ∥Q) = T t P (t) log Note that JS divergence is symmetric, i.e., JS(P ||Q) = JS(Q||P ).
Joint audio-visual temporal grounding (JTG) module.Previous approaches to joint audio-visual learning have typically used a dual-stream structure with a decoupled cross-modal fusion module.However, the proposed CSL makes single-stream networks for audio-visual learning possible and can naturally integrate audio-visual fusion and temporal grounding into one module.Specifically, we first interleave the LSTM encoded video feature tensor and audio feature tensor along rows, i.e., the temporal dimension, as: where IL denotes that the features of two modalities are InterLeaved in segments, f av ∈ R 2T ×d represents the multimedia video features, and d = d v = d a .Next, we perform MHA to aggregate critical question-aware audio-visual features among the dynamic audio-visual scenes as: w av t f t av (14) where f q av ∈ R 1×dc represents the question grounded audiovisual contextual embedding, which is more capable of predicting correct answers.The model will assign higher weights to segments that are more relevant to the asked question.Then, we can retrieve the temporal distribution weights specific to each modality from the output of multi-head attention MHA and apply our proposed CSL as follows: where w a , w v ∈ R 1×T are question-aware temporal distribution weights of audio and video, respectively.By leveraging the CSL, the proposed JTG module can effectively perform both temporal grounding and audio-visual fusion while considering the synchronization between the audio and visual modalities.The resulting single-stream architecture simplifies the overall system and treats audio and video as a whole.

Answer Prediction
In order to verify the audio-visual fusion of our proposed joint audio-visual temporal grounding module, we employ a simple element-wise multiplication operation to integrate the question features h q and the previously obtained audiovisual features f q av .Concretely, it can be formulated as: Next, we aim to choose one correct answer from a pre-defined answer vocabulary.We utilize a linear layer and softmax function to output probabilities p ∈ R C for candidate answers.With the predicted probability vector and the corresponding groundtruth label y, we optimize it using a cross-entropy loss: L qa = − C c=1 y c log(p c ).During testing, the predicted answer would be ĉ = arg max c (p).

Experiments
This section presents the setup details and experimental results on the MUSIC-AVQA dataset.We also discuss the model's performance and specify the effectiveness of each sub-module in our model through ablation studies and qualitative results.

Experiments Setting
Dataset.We conduct experiments on the MUSIC-AVQA dataset (Li et al., 2022), which contains 45,867 question-answer pairs distributed in 9,288 videos for over 150 hours.It was divided into sets with 32,087/4,595/9,185 QA pairs for training/validation/testing.The MUSIC-AVQA dataset is well-suited for studying spatio-temporal reasoning for dynamic and long-term audio-visual scenes.
Metric.Answer prediction accuracy.We also evaluate the model's performance in answering different questions.
Implementation details.The sampling rates of sounds and frames are 16 kHz and 1 fps.We divide the video into non-overlapping segments of the same length with 1s-long.For each video segment, we use 1 frame to generate the visual features of size 14×14×512.For each audio segment, we use a linear layer to process the extracted 128-D VGGish feature into a 512-D feature vector.The dimension of the word embedding is 512.In experiments, we used the same settings as in (Li et al., 2022) and sampled the videos by taking 1s every 6s.Batch size and number of epochs are 64 and 30, respectively.The initial learning rate is 2e-4 and will drop by multiplying 0.1 every 10 epochs.Our network is trained with the Adam optimizer.We use the torchsummary library in PyTorch to calculate the model's parameters.Our model is trained on an NVIDIA GeForce GTX 1080 and implemented in PyTorch.
Training Strategy.As previous methods (Li et al., 2022) use a two-stage training strategy, training the spatial grounding module first by designing a coarse-grained audio-visual pair matching task, formulated as:  We utilize the pretrained stage I module in (Li et al., 2022) directly without retraining for certain layers that overlap with our approach.We use our proposed L = L qa + L csl + λL s to train for AVQA task, where λ is 0.5 following previous setting (Li et al., 2022).

Comparisons with SOTA Methods
We challenge our method against current SOTA methods on AVQA task.For a fair comparison, we choose the same audio and visual features as the current methods.As shown in Table 1, We compare our TJSTG approach with the AVSD (Schwartz et al., 2019), PanoAVQA (Yun et al., 2021), and AVST (Li et al., 2022) methods.Our method achieves significant improvement on all audio and visual questions compared to the second-best method AVST (average of 2.60% ↑ and 2.48%↑, respectively).In particular, our method shows clear superiority when answering counting (average of 2.31%↑) and comparative (average of 2.12%↑) questions.These two types of questions require a high conceptual understanding and reasoning ability.The considerable improvement achieved by TJSTG can be attributed to our proposed TSG module, which introduces textual modalities with explicit semantics into the audio-visual spatial grounding process.Although we were slightly behind AVST in the audio-visual temporal question, we still achieved the highest accuracy of 70.13% on the total audio-visual question with a simpler single-stream architecture, outperforming AVST by 0.6%↑.Benefiting from our proposed JTG leveraging the natural audio-visual integration, our full model has achieved the highest accuracy of 73.04% with a more straightforward architecture, which outperforms AVST by 1.45%↑.

Ablation studies
The effectiveness of the different modules in our model.To verify the effectiveness of the proposed components, we remove them from the primary model and re-evaluate the new model on the MUSIC-AVQA dataset.Table 2 shows that after removing a single component, the overall model's performance decreases, and different modules have different performance effects.Firstly, when we remove the TA module and the target-aware process during spatial grounding (denoted as "w/o T-A") and use traditional audio-visual spatial grounding, the accuracy decreases by 1.24%, 0.83% and 0.27% under audio, visual and audio-visual questions, respectively.This shows that it is essential to have a targeting process before feature aggregation instead of attending to all the audio-visual cues.Secondly, we remove the proposed CSL (denoted as "w/o L csl "), and the overall accuracy drops to 72.28% (0.76% below our full model).Lastly, we remove two modules and employ a vanilla single-stream structured network (denoted as "w/o TA+L csl ), the overall accuracy severely drops by 1.36%, from 73.04% to 71.78%.These results show that every component in our system plays an essential role in the AVQA.
Effect of introducing text during audio-visual learning.As Table 2 shows, removing the TA module and target-aware process resulted in a lower accuracy (75.17%) for audio questions consisting of counting and comparative questions compared to the "w/o L csl " (75.73%) and our full model (76.47%).In Table 3, we utilize AVST (Li et al., 2022) as a baseline model to further validate the robustness and effectiveness of our proposed targetaware approach.We implement AVST with our proposed TA module and the corresponding targetaware process denoted as "AVST w/ T-A", which surpasses AVST by 0.84% in overall accuracy (from 71.59% to 72.43%).These results demonstrate that the explicit semantics in the audiovisual spatial grounding can facilitate audio-visual question answering.
Effect of Target-aware module.As shown in Table 4, we adopt different ways to introduce question information into the spatial grounding module, thus verifying the effectiveness of our proposed Target-aware module during the target-aware process.Specifically, we conduct average-pooling (denoted as "TSG w/ Avg") and max-pooling (denoted as "TSG w/ Max") on the LSTM-encoded question embedding f q to represent the target feature, respectively.We also adopt the question feature vector h q as the target feature during spatial grounding.Compared to these methods, our approach (denoted as "TSG w/ TA") achieves the highest accuracy of 73.04%.The experimental results not only prove the superiority of our proposed target-aware module but also further demonstrate the effectiveness of our introduction of textual modalities carrying explicit semantics in the audio-visual learning stage.
In addition, we explore the impact of hyperparameter τ on model performance.As shown in Table 5, while τ plays a role in selecting relevant visual areas, our experiments revealed that it does not significantly impact performance within the context of the MUSIC-AVQA dataset (Li et al., 2022).The highest accuracy of 73.04% is achieved when τ = 0.005.However, the removal of the thresholding operation (τ = 0.000) causes a decrease of 0.81% in accuracy.This may be caused by information redundancy, and we believe that this phenomenon can be improved by utilizing a pretrained model with a priori information on imagetext pairs in future work.
Effect of singe-stream structure.We validate the effectiveness of our designed specialized audiovisual interleaved pattern, i.e., IL(A;V), which maintains both the integrity of the audio-visual content at the segment level and the relative independence between the audio and visual content at the video level.As shown in Table 6, we explore different ways of arranging visual and audio features, and our interleaved-by-segments pattern is 0.41% higher on average than the concatenated-by-modals pattern.we also conduct a comprehensive comparison between single-stream and dual-stream networks.During the temporal grounding, we switch to the prevalent two-stream network structure like in (Li et al., 2022), but still with our proposed TSG module and cross-synchrony loss, which is denoted as "Dual-stream Net" in Table 7.As shown in Table 7, the "Single-stream Net" that omits the additional fusion module yields 0.15% higher accuracy with 3.5M fewer parameters than the "Dual-stream Net".This indicates the superiority of single-stream networks over two-stream networks, which utilize the integration of the audio and visual modalities to simultaneously accomplish question-aware temporal grounding and audio-visual fusion.Effect of cross-modal synchrony loss.Similarly, as shown in Table 3, we verified the validity of our proposed L csl on AVST (denoted as "AVST w/ L csl ")."AVST w/ L csl " achieved an accuracy of 72.47%, exceeding the baseline model by 0.88%.We further consider the impact of the combination order of multimedia video features f av on performances, as shown in Table 6.Specifically, we compose f av by interleaving the audio and visual features but putting the audio modality in front (denoted as "IL(A;V) w/ L csl ").Compared to our full model (denoted as "IL(V;A) w/ L csl "), the overall accuracy is the same (both are 73.04%).The concatenate operation also has similar results.That is, the order in which the audio-visual features are combined does not have a significant impact on the performance of our entire system.These results validate the robustness and effectiveness of our proposed CSL.

Qualitative analysis
In Figure 4, we provide several visualized targetaware spatial grounding results.The heatmap indicates the location of the interesting-sounding source.Through the results, the sounding targets are visually captured, which can facilitate spatial reasoning.For example, in the case of Figure 4.(a), compared to AVST (Li et al., 2022), our proposed TSTJG method can focus on the target, i.e., the flute, during spatial grounding.The TSG module offers information about the interesting-sounding object in each timestamp.In the case of Figure 4.(b) with multiple sound sources related to the target, i.e., instruments, our method also indicates a more accurate spatial grounding compared to AVST (Li et al., 2022).When there is no target of interest in the video, as shown in Figure 4.(c), i.e., the ukulele, it can be seen that our method presents an irregular distribution of spatial grounding in the background region instead of the undistinguished sounding area of the guitar and bass presented by AVST (Li et al., 2022).Furthermore, the JTG module aggregates the information of all timestamps based on the question.These results demonstrate that our proposed method can focus on the most question-relevant audio and visual elements, leading to more accurate question answers.

Conclusions
This paper proposes a target-aware spatial grounding and joint audio-visual temporal grounding to better solve the target-oriented audio-visual scene understanding within the AVQA task.The targetaware spatial grounding module exploits the explicit semantics of the question, enabling the model to focus on the query subjects when parsing the audio-visual scenes.Also, the joint audio-visual temporal grounding module treats audio and video as a whole through a single-stream structure and encourages the temporal association between audio and video with the proposed cross-modal synchrony loss.Extensive experiments have verified the superiority and robustness of the proposed module.Our work offers an inspiring new direction for audio-visual scene understanding and spatiotemporal reasoning in question answering.

Limitations
The inadequate modeling of audio-visual dynamic scenes potentially impacts the performance of question answering.Specifically, although our proposed TSG module enables the model to focus on question-related scene information, sometimes information not directly related to the question can also contribute to answering the question.Experimental results demonstrate that adding the targetaware spatial grounding module to the basic model resulted in a marginal improvement in the accuracy of answering audio-visual questions compared to incorporating the cross-modal synchrony loss into the basic model.We believe this limits the overall performance of our approach, showing an incremental improvement for the audio-visual question type (0.6% on average) compared to a significant improvement for the uni-modal question type (2.5% on average).In the future, we will explore better ways to integrate natural language into audio-visual scene parsing and mine scene information that is not only explicitly but also implicitly related to the question.
-wise product Text attention score Audio attention weights over time Visual attention weights over time

Figure 3 :
Figure3: The proposed target-aware joint spatio-temporal grounding network.We introduce text modality with explicit semantics into the audio-visual spatial grounding to associate specific sound-related visual features with the subject of interest, i.e., the target.We exploit the proposed cross-modal synchrony loss to incorporate audiovisual fusion and question-aware temporal grounding within a single-stream architecture.Finally, simple fusion is employed to integrate audiovisual and question information for predicting the answer.

Figure 4 :
Figure4: Visualized target-aware spatio-temporal grounding results.Based on the grounding results of our method, the sounding area of interest are accordingly highlighted in spatial perspectives in different cases (ac), respectively, which indicates that our method can focus on the query subject, facilitating the target-oriented scene understanding and reasoning.

Table 1 :
Comparisons with state-of-the-art methods on the MUSIC-AVQA dataset.The top-2 results are highlighted.

Table 2 :
Ablation studies of different modules on MUSIC-AVQA dataset.The top-2 results are highlighted.

Table 3 :
Ablation studies of different modules against baseline model.The top-2 results are highlighted.

Table 4 :
Effect of the Target-aware module on the accuracy(%).The top 2 results are highlighted.

Table 6 :
Effect of the Cross-modal synchrony loss on the accuracy(%)."IL" denotes that audio and visual features are interleaved in segments."Cat" denotes that audio and visual features are concatenated.The top 2 results are highlighted.

Table 7 :
Comparison of dual-stream structure and singestream structure.