Scene-robust Natural Language Video Localization via Learning Domain-invariant Representations

,


Introduction
Natural language video localization(NLVL) (Wang et al., 2020;Zhao et al., 2021;Zhang et al., 2021) aims to retrieve a temporal moment from an untrimmed video to match a language query semantically.As a fundamental vision-language problem, NLVL attracts increasing attention, and recent works (Chen et al., 2018;Gao et al., 2017;Xu et al., 2019aXu et al., , 2018;;Zhang et al., 2020a,b;Chen  et al., 2019;Ghosh et al., 2019;Yuan et al., 2019b) has achieved impressive results.However, their performances are tested with the assumption of independently identical data distribution.Given the variety of video content, it is impossible to cover all scenes when building training datasets.Therefore, these methods are impractical in real-world scenarios since they ignore the generalization ability to novel scenes.In this paper, we, for the first time, propose the scene-robust NLVL problem.
Scene-robust NLVL aims to take advantage of data in available scenes to learn a robust model which can generalize well on novel scenes.As shown in the Figure 1, there are obvious semantic differences between video and text input scenes, such as video background, specific actions and objects in a scene.In more detail, we analyze the semantic distribution gaps across scenes in three NLVL datasets.The detailed statistics and discussions in the appendix show that the domain gaps between scenes are prevalent and diverse.Traditional NLVL models would experience dramatic performance degradation when dealing with such semantic distribution shifts.
The proposed scene-robust NLVL problem is more challenging than the relevant traditional domain generalization(DG) problem.On the one hand, inputs of NLVL are multi-modal, which means the features used for predicting are a fusion of visual and textual features, and the final feature space is much more complex.Besides, the different information attribute of video and text brings more difficulty.Recent DG methods mainly focus on learning domain-invariant representations on single-modal tasks, which are not comprehensive and appropriate for multi-modal features.On the other hand, NLVL requires more reasoning steps with diverse modules, and some samples may mainly rely on domain-specific information for reasoning.The knowledge in these data is hard-to-transfer and may affect the learning of domain-invariant representations and the predictor.
To alleviate these challenges, we proposed a novel generalizable NLVL framework to learn stable and high-quality multi-modal domain-invariant representations from alignment and decomposition perspectives.Specifically, we design a multi-modal domain alignment module that contains: an intraand inter-sample distance metric for aligning the complex multi-modal feature spaces comprehensively, an asymmetric multi-modal aligning strategy for different information densities of textual and visual features, and Dirichlet-Mixup augmentation to compensate for the missing information and increases the domain diversity.Besides, we introduce domain-specific predictors for each domain to refine the learned domain-invariant representations by decomposition, and dynamically suppress the weights of hard-to-transfer samples by simply summing the outputs of domain-specific and domain-agnostic predictors.
Our main contributions can be summarized as follows: • To the best of our knowledge, we are the first to formulate the scene-robust NLVL problem, which is quite essential for real-world scenarios and fundamentally more challenging.
• We propose a novel generalizable NLVL framework to address the unique challenges of the scene-robust NLVL problem.It learns and refines domain-invariant representations from both alignment and decomposition perspectives.
• The extensive experiments conducted on three NLVL datasets: Charades-STA, ActivityNet-Caption, and YouCook2, demonstrate the effectiveness of our proposed methods.

Related work
Natural Language Video Localization.The task of retrieving video segments using language queries is first introduced by (Gao et al., 2017).The previous methods in the field can be categorized into proposal-based (Gao et al., 2017;Xu et al., 2018Xu et al., , 2019a;;Chen et al., 2018;Zhang et al., 2020b;Wang et al., 2020;Zhang et al., 2019;Yuan et al., 2019a) and proposal-f ree (Yuan et al., 2019b;Chen et al., 2019;Ghosh et al., 2019;Zhang et al., 2020a;Mun et al., 2020;Rodriguez et al., 2020;Wang et al., 2019) methods.Specifically, proposal-based methods mainly rely on sliding windows or segment proposal networks to generate proposal candidates from video and retrieve the best matching one for the given semantic information.(Xu et al., 2018) apply the pre-trained segment proposal network (Xu et al., 2017) for proposal candidates generation, (Xu et al., 2019a) further introduce query information to adjust proposals generation process.(Zhang et al., 2020b) first utilizes a 2D temporal map to generate proposals.Considering the redundant computation and low efficiency of the two-stage propose-andrank, (Yuan et al., 2019b) build a proposal-f ree method using attention-based structures to directly regresses temporal locations of the target moment.
Inspired by the concept of question answering (Seo et al., 2016;Yu et al., 2018), (Chen et al., 2019;Ghosh et al., 2019) try to formulate NLVL as a span prediction task.(Zhang et al., 2020a) further explains the similarity and differences between NLVL and question answering, and the proposed VSLNet achieves superior performance with simple network architecture.
Domain Generalization.Domain generalization aims to learn a generalizable model from seen source domains to achieve good performance on unseen target domains.Most existing domain generalization methods can be roughly divided into three categories: representation learning-based (Li et al., 2018b(Li et al., , 2017;;Ghifary et al., 2016;Rahman et al., 2020), data augmentation-based (Somavarapu et al., 2020;Zhou et al., 2021;Shu et al., 2021;Mancini et al., 2020;Cheng et al., 2023), and learning strategy-based (Mancini et al., 2018;Cha et al., 2021;Finn et al., 2017;Li et al., 2018a).One common approach for domain generalization is learning domain-invariant feature representation, which can be realized by aligning source domains or disentangling features into domain-specific and domain-invariant components.(Li et al., 2018b) employ Adversarial Auto-encoder (Makhzani et al., 2015) and Maximum Mean Discrepancy (Gretton et al., 2012) metric to align distribution between every pair of source domains.(Li et al., 2017) decomposes the model into domain-specific and domainagnostic parts and only utilizes domain-agnostic parts to make predictions at inference time.Data augmentation-based methods focus on manipulating new data to increase the diversity of training data.(Somavarapu et al., 2020) use the style transfer model (Huang and Belongie, 2017) to explore cross-source variability for better generalization.(Zhou et al., 2021;Shu et al., 2021;Mancini et al., 2020) mix the features between instances from different domains to further explore inter-domain feature spaces and compensates for the missing information.Learning strategy-based methods employ the general learning strategy to enhance the generalization ability.(Mancini et al., 2018;Cha et al., 2021) exploit ensemble learning to flatten the loss surface and learn a unified and generalized model.Meanwhile, (Li et al., 2018a) proposes meta-learning for domain generalization(MLDG), which uses a meta-learning strategy to achieve domain generalization.
However, most well-studied domain generalization strategies are proposed for single-modal or easy multi-modal tasks requiring less reasoning, such as image classification and image-text retrieval.These tasks' input modality and model structure are much simpler than that of NLVL, and the extra complexity makes the scene-robust NLVL problem more challenging.

Approach
We first formulate the scene-robust NLVL problem and our basic model architecture.Then we illustrate our method for learning domain-invariant representation for NLVL from two aspects: multimodal domain alignment and feature refinement by decomposition.

Overview
Problem Formulation.Given an untrimmed video v and a related natural language query q, NLVL is to retrieve the specific moment {τ s , τe } that is most semantically relevant to the given query.Assuming there are K domains in total, for domain l, the input video and query can be denoted as v l and q l .In scene-robust NLVL, the model can only access several source domains during training, while the target domain's data is unavailable.The generalization ability can be tested on the unseen target domain.
Model Architecture.The overall architecture of our model is illustrated in Figure 2. Concretely, our basic NLVL model consists of a video encoder e v : v l → v l ∈ R Nv×D , a query encoder e q : q l → q l ∈ R Nq×D , a cross-modal representation fusion module m : v l × q l → f l ∈ R Nv×D , a domainagnostic boundary predictor p b : f l → {T s , T e }, and some domain-specific boundary predictors {p l b } K l=1 : f l → {T l s , T l e }, where T s , T e ∈ R Nv denote the scores of start and end boundaries at each frame.Meanwhile, we use a confidence coefficient predictor p c : f l → C ∈ R Nv to assist training, where C represents the confidence coefficient of the frame in the matching span.

Multi-modal Domain Alignment
In this section, we describe our proposed multimodal domain alignment method.As discussed in (Arora et al., 2017;Long et al., 2018), when the feature distribution is multi-modal, adversarial learning is insufficient to guarantee that the confounded feature distributions are identical, and a simple distance metric would align different domains partially.Therefore, we devise an intra-and inter-sample distance metric to measure the domain differences comprehensively.Considering the multi-modal input in NLVL and the different information densities of each modality, we asymmetrically applied the metric to visual and fusion feature representations to facilitate the learning of multi-modal domain-invariant representations.
Intra-and Inter-sample Distance Metric.For brevity, the latent representations from different domains are denoted as {H l } K l=1 , where H l ∈ R b×n×d and b, n, d represent the number of samples in this batch, feature size and feature dimension of each sample, respectively.First, we calculate the intra-and inter-sample feature distribution: (2) The H l intra is the mean value computed across the temporal dimension of each sample, which represents intra-sample feature distribution for source domain l; H l inter is the standard deviation across all samples in a batch, which indicates the intersample feature distribution for source domain l.
Based on the complementary two-feature distributions of latent representations, we can comprehensively measure the distance between data distributions of different domains.The proposed intra-and inter-sample distance metric function IIDM (•) is defined as follows: (3) Maximum Mean Discrepancy(MMD) (Gretton et al., 2006) is an effective metric for comparing two distributions.Given feature distributions from domain i and j, the function can be written as: where ϕ is a simple gaussian or linear kernel function that maps representations to a highdimensional feature space.
Multi-modal Asymmetric Alignment.In the NLVL task, the query sentences are semantically related to the video.For different scenes, there are significant gaps in both visual and textual features.A straightforward approach is to align textual and visual features separately by the proposed distance metric.However, as discussed in (He et al., 2022), the information density is different between language and vision.Textual features are highly semantic and information-dense, while visual features are heavily redundant.Using the same domain alignment method for different modalities would result in either inadequate alignment of visual features or loss of textual semantic information.
To address this dilemma, we devise an asymmetric method to align different modalities.Specifically, the intra-and inter-sample distance metric is applied on the visual features { v l } K l=1 and fused features {f l } K l=1 , and the total multi-modal asymmetric alignment loss L maa can be formulated as: (5) where λ f is hyper-parameter to balance these two parts.By aligning the fused features, we can indirectly match the textual feature distributions to minimize the loss of semantic information, while the redundant visual feature can be aligned twice before and after considering the textual information to achieve sufficient alignment.

Domain Augmentation by Dirichlet-Mixup.
To compensates for the missing information and increases the data diversity, we adopt Dirichlet-Mixup (Shu et al., 2021) to mix multiple source domains for generating inter-domain data.This method sample the weights from the Dirichlet distribution rather than the beta distribution used in the original mixup (Zhang et al., 2017).Given latent representations of samples from different domain {h i } K i=1 where h i ∈ R n×d , where n denotes the number of sample and d is the feature dimension.The mixed feature representation h m can be calculated as: where δ ∈ R K×n .Considering the similarity between the target domain and each source domain is unpredictable, we set the weight β i of each domain to be equal, which indicates the sampled values δ (i) of each source domain are statistically equal.

Features Refinement by Decomposition
Our multi-modal domain alignment method is designed to facilitate the acquisition of domaininvariant representations by the model.However, the NLVL task is characterized by complex structures and multi-step reasoning, with some samples relying heavily on domain-specific information for reasoning.In such cases, learning domain-invariant representations is at odds with learning NLVL reasoning, leading to unstable domain-invariant representations and an unreliable boundary predictor.
To mitigate this conflict between task loss and alignment loss, we introduce the domain-specific boundary predictors pb l K l = 1 for each source domain, as well as a domain-agnostic boundary predictor p b for all domains.This allows for the dynamic adjustment of weights for these samples and the refinement of features utilized by the domainagnostic prediction through decomposition.
During training, the boundary predictions for calculating NLVL task loss is the combination of the two kinds of predictors: Intuitively, the domain-agnostic predictor that fits all domains tends to use invariant features, while domain-specific predictors prefer to use both invariant and specific information for more accurate predictions.Therefore, as shown in Figure 3, the former's loss with the ground truth L 1 is higher than the latter's L 2 , and loss L bd of the combination is naturally between L 1 and L 2 .
The key to Eq.8 is to dynamically alter the weight of each sample according to the similarity of these two kinds of prediction.For the hard-totransfer samples, which mainly rely on domainspecific information for reasoning, their L 1 would be much higher than L 2 , and the L bd would also be typically lower than L 1 .Accordingly, since gradients are generally proportional to losses, the importance of hard-to-transfer samples and the instability brought by them will be suppressed.On the contrary, for samples, the more their reasoning depends on invariant features, the closer L 1 , L bd and L 2 are, and the less their weights are suppressed.
By reducing the importance of the hard-totransfer samples, the domain-agnostic predictor can learn more stable predictions from invariant features, while no need to fit on specific features.Thus, for the features used for domain-agnostic prediction, the remaining domain-specific components in the learned representations are further decomposed by the domain-specific predictors.

Training and Inference
For obtaining the scene-robust NLVL model, the final loss function should contain two components: NLVL task loss L task and multi-modal alignment loss L maa .For the NLVL task loss, we apply a similar loss function as used in (Zhang et al., 2020a), which consists of the boundary loss L bd for training a reliable boundary predictor, and confidence coefficient loss L cc to assist in learning discriminative features.The NLVL task loss is a linear combination of the above two sub-losses, i.e., L task = L cc + 0.2L bd .The weight is empirically set for balancing the two terms.
Finally, the overall loss function in the training process can be summarized as where the λ maa is the hyper-parameter that depends on the distribution gap in different domains to balance alignment and task loss.
During inference, we only use the domainagnostic boundary predictor, and the predicted timestamp {τ s , τ e } is determined by the boundary prediction {T s , T e }, which can be written as {τ s , τ e } = arg max (τs,τe) We reconstruct three NLVL public datasets for the scene-robust NLVL task based on their original video tags and perform experiments to evaluate the effectiveness of our framework.
Charades-STA.This dataset is generated by (Gao et al., 2017)  Based on the video tags provided by the original dataset annotators, each NLVL datasets have different-grained scene splits.For Charades-STA, it can be split by activity location, i.e., Living room(Liv), Bathroom(Bath), Kitchen(Kit) and Bedroom(Bed).ActivityNet-Caption is divided by activity event, i.e., Eat/Drink(E/D), Personal care(Pc), Household(Ho), Social(So) and Sport(Sp).As for YouCook2, it is split according to the origin of the used recipes, i.e., America(Am), East Asia(EA), South Asia(SA) and European(Eu).
With these three restructured NLVL datasets, we iteratively use each domain as the target domain and the remaining domains as the source domains to construct the scene-robust NLVL task.The train/val/test split follows previous works (Zhang et al., 2020a;Zhou et al., 2018) and the scenerobust performance is evaluated on the test set of the target domain.

Experimental Settings
Implementation Details.We utilize the AdamW (Loshchilov and Hutter, 2017) optimizer and CosineAnnealing scheduler (Loshchilov and Hutter, 2016) with weight decay 1e − 6, and learning rate 5e − 4 for ActivityNet-Caption and 2e − 4 for Charades-STA and YouCook2.During training, the λ f in Eq.5 is set to 0.2, the γ in Eq.8 is set to 0.1, and the λ maa in Eq.9 is set to 4, 1, 0.1 for ActivityNet-Caption, Charades-STA and YouCook2 respectively.Due to the simple architecture of VSLBase (Zhang et al., 2020a), it can be viewed as a standard proposal-free model.Therefore, our basic network structure is the same as VSLBase to minimize the impact of architecture bias.Please refer to the appendix for more details.Evaluation Metrics.We adopt "R@n, IoU = m" as the evaluation metrics, following (Zhang et al., 2020a).This metric denotes the percentage of language queries having at least one result whose Intersection over Union(IoU) with ground truth is larger than m in the top-n grounding results.In our experiments, we use n = 1 and m = 0.5.
Baseline.For a comprehensive comparison, we consider the following methods as baselines: 1) Variants of our model, including Base, which do not use any alignment and decomposition methods during training, and Supervised, which is trained on all domains without alignment and decomposition.2) domain alignment methods for single-modal tasks, including DANN (Matsuura and Harada, 2020) and MMD (Li et al., 2018b).3) domain alignment methods for other multi-modal tasks, including JSD (Chao et al., 2018) and MMA (Xu et al., 2019b).

Performance Comparison
The quantitative evaluation results of our proposed method on Charades-STA, ActivityNet-Caption and YouCook2 are reported in Table 1, 2 and 3, respectively.The best results are in bold and second best underlined.According to the results, we have the following observations: • On all the benchmark datasets, our method gains noticeable performance improvements compared to the base model, which demonstrates that the proposed methods can effectively help the NLVL model learn the generalizable domain-invariant representations.
Besides, stable improvement under differentgrained scene shifts is a significant and practical merit since the scene shift with source domains in real-world applications is diverse and unpredictable.
• Remarkably, our method boosts the performance (Avg) of the Base model from 31.20/34.60 to 38.62/38.79 on Charades-STA and ActivityNet-Caption datasets, which far exceeds all the compared methods and even  .91 35.24 33.43 37.12 31.60 33.66 IIDM F 33.77 36.71 35.91 40.98 33.75 36.22 IIDM V 35.05 37.59 37.03 43.53 37.66 38.17 achieves comparable performance to the supervised setting.It further reveals the superiority of our methods on the scene-robust NLVL problem.
• Looking at the evaluation results on YouCook2, the improvement brought by our methods is relatively less than the other two datasets, which may stem from the intrinsic characteristics of this dataset.NLVL on YouCook2 is more challenging than Charades-STA and ActivityNet-Caption.
The annotations of YouCook2 are more detailed, and the differences between adjacent frames are slight.The more complex NLVL reasoning on YouCook2 makes it harder to capture discriminative and domain-invariant representations.

Ablation Study
Component Ablation.We ablate the major components of our framework on Activity-Net Caption: fused features alignment(IIDM F ), visual features alignment(IIDM V ), Dirichlet-Mixup(D-Mixup) and feature refining by decomposition(FRD).Results are reported in   inter-sample distributions leads to the best performance.In addition, the inter-sample part is more critical than the intra-sample part since multiple samples can better reflect the overall distribution of the domain.
Design of Multi-modal Alignment.As discussed in Sec 3.2, the different information density of each modality requires a modality-specific alignment strategy.We separately align features of each modality to prove the necessity and effectiveness of our asymmetric multi-modal alignment strategy.From Table 6, we conclude: 1) Directly aligning information-dense textual features results in performance degradation and loss of semantics.
2) Only aligning the heavily redundant visual feature achieves the best results.3) Aligning fused features, which can be viewed as symmetrical indirect multi-modal alignment, leads to sub-optimal performance due to insufficient visual information alignment.
Hyper-Parameter Analysis.In our framework, the hyper-parameter γ in Eq. 8 and λ maa in Eq. 9 are important for generalization.Therefore, we further explore their impacts on Charades-STA and ActivityNet-Caption.As shown in Figure 4, for γ, a value of around 0.1 can obtain considerable gain, and too large γ would affect the training of the domain-agnostic predictor.As for λ maa , which depends on the ratio of the alignment loss, the optimal value relies on the distribution gaps between source domains.The bigger domain gaps, the larger λ maa should be set to reinforce alignments adaptively, and vice versa.
Query: The man continues scuffing the shoes and shining the shoe and ends with tying them and presenting them.
Query: The child continues using the toy on the clothes while looking up and smiling to the camera.
Figure 5: Qualitative results sampled from the household scene in ActivityNet-Caption.

Qualitative Analysis
In order to qualitatively evaluate the performance of our alignment and decomposition strategy, we show two representative examples from the household scene in ActivityNet-Caption, which can be found in Figure 5. (The analysis of failure cases can be found in the appendix.)In both cases, the base model only learns to localize some simple and general actions in the novel scene, such as "smiling at the camera" and "presenting them."By introducing our multi-modal domain alignment method, the representations are forced to be domain-invariant, and the model learns to capture the high-level semantic similarities in different scenes instead of the common overlapping actions.Further, the decomposition approach refines the domain-invariant representations and stabilizes the learning process.Our scene-robust NLVL method can effectively improve localization accuracy on unseen scenes.

Conclusion
In this paper, we first proposed a scene-robust problem in NLVL.Our main idea is to learn a group of high-quality domain-invariant feature representations from multiple source domains.By analyzing the extra generalization challenges posed by the NLVL task, we propose a novel NLVL framework that tackles the scene-robust problem from aligning and decoupling perspectives.With the help of these two branches, we significantly enhance the generalization ability to new scenes.Extensive experiments and detailed ablation studies on three widely-used benchmark datasets demonstrate the effectiveness and robustness of our methods.

Limitations
In this work, we first formulate the scene-robust NLVL problem and propose our solution.However, our generalizable NLVL model is still tested on existing close-world datasets, and the actual performance in real-world scenarios needs to be further explored.A real-world, large-scale dataset is required to develop a practical, generalized, openworld query-based video retrieval model.
Luowei Zhou, Chenliang Xu, and Jason J Corso.2018.Towards automatic learning of procedures from web instructional videos.In Thirty-Second AAAI Conference on Artificial Intelligence.

A Analysis on Failure Cases
To better understand the limitation of our framework, we elaborate on the failure case and discuss them in detail.The reasons for our prediction errors can be roughly summarized into three categories: • Ambiguity of ground truth.Due to the complexity of video content, some queries in NLVL datasets correspond to multiple video clips, but the ground truth only label one of  them.As shown in Case(1,2) in Figure 6, our predicted segment and the ground truth annotation are both semantically matched to the given query sentence.However, our predictions are considered to be completely wrong.
• Inability to distinguish subtle actions in video.In some scenarios, tiny imperceptible action differences in the video is critical to distinguish similar clips.As shown in Case (3), in our predicted segment, the man is actually refining the draw by paper and a spatula, rather than painting.It might be essential to use a better vision encoder or hierarchical visual feature maps to distinguish the video's subtle actions.
• Inability to understand detail word.In the scene-robust NLVL problem, there are inevitable distribution gaps in the vocabulary lists of descriptions in different scenes, which results in misunderstanding some keywords.
In Case (4), our model fails to figure out the detailed semantic differences between solution and water without any information or knowledge about the target domain vocabulary.

B Distribution Gaps between Scenes
In this section, we explore the differences in text distribution scenes.Since the videos and sentences in NLVL are semantically matched, the word distribution gaps can also be viewed as semantic gaps between videos in different scenes.
We analyze the word distribution gaps from two perspectives: Word ratio and Vocabulary IoU.
Word ratio.For each scene, we counted the ratio of occurrences of each word in all sentences.In the upper part of Figure 9, 10, 11, we visualize the word ratios for the same words in different scenes of Charades-STA, ActivityNet-Caption, and YouCook2.The further a word is from the diagonal, the more scene-specific it is.Except for several words with high ratios that are common to all scenes, most words are various widely in word ratio across scenes, as shown by the points off the diagonal.
Vocabulary IoU.In the lower part of Figure 9, 10, 11, we show the Intersection over Union(IoU) of the words with the topk% words ratio.For sentences, common words(such as: "the," "and," and "is," et al.) appear most frequently.Therefore when k% is very small, the vocabulary overlap between the two domains is relatively high.However, the IoU decreases sharply as the k% value slightly increases, which indicates that each scene may contain unique high-frequency words.Note that for Charades-STA, ActivityNet-Caption, and YouCook2, the average IoUs for the entire vocabulary in different scenes are only 0.29, 0.45, and 0.40, respectively.The low IoU value across different scenes on all three figures illustrates that the distribution gap brought by scene shift is prevalent and significant.This observation further demonstrates the necessity and practicality of our scene-robust NLVL problem formulation.

C Implementation Details
Data Processing.For language query, we use the pre-trained Glove (Pennington et al., 2014) embedding to initialize each lowercase word, and the visual embeddings are extracted via the 3D Con-vNet pre-trained on Kinetics dataset (Carreira and Zisserman, 2017) as previous method (Zhang et al., 2020a).Note that all the pre-trained feature extractors are fixed during training.All experiments were carried out on a single 2080ti.
Network Architecture.Due to the simplicity and effectiveness of VSLBase (Zhang et al., 2020a), it can be viewed as a standard proposal-free NLVL model.Therefore, our network structure is similar to VSLNet to minimize the impact of architectural bias.The video and query feature encoder consist of four convolution layers and a multi-head attention layer.After feature encoding, we use contextquery attention(CQA) as our cross-modal representation fusion module.The boundary and confidence coefficient predictors are essentially multi-layer perceptrons (MLP).We set the kernel size of the convolution layer as 7 and the head size of multihead attention as 8.The frame number of video N v is set to 32 for all three datasets.And the hidden dimensions are set to 128 for Charades-STA, and 256 for YouCook2 and ActivityNet-Caption.

D Ethical Discussion
Natural language video localization has many applications, including video corpus retrieval and video-Case1 -Query: She then combs the tail.
Case2 -Query: The man sprays the car with water.Case3 -Query: The man paints the draw while talking.Case4 -Query: The man sprays the car with a solution.
Figure 8: Failure Cases -Inability to understand detail word based dialogue.Our scene-robust NLVL problem makes it more practical and reliable in real-world prediction scenarios.Due to the generalization of our method, there may be concerns about the misuse of offensive data.However, in fact, our method focuses on the scene shift while the general activities are consistent, such as Charades-STA for indoor activity, ActivityNet-Caption for outdoor activity, and YouCook2 for cooking.Therefore, the model may not get reliable generalization performance on entirely new activities.In the reference B2.Did you discuss the license or terms for use and / or distribution of any artifacts?
In the reference B3.Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided that it was specified?For the artifacts you create, do you specify intended use and whether that is compatible with the original access conditions (in particular, derivatives of data accessed for research purposes should not be used outside of research contexts)?Not applicable.Left blank.
B4. Did you discuss the steps taken to check whether the data that was collected / used contains any information that names or uniquely identifies individual people or offensive content, and the steps taken to protect / anonymize it?Not applicable.Left blank.
B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and linguistic phenomena, demographic groups represented, etc.?Not applicable.Left blank.
B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits, etc. for the data that you used / created?Even for commonly-used benchmark datasets, include the number of examples in train / validation / test splits, as these provide necessary context for a reader to understand experimental results.For example, small differences in accuracy on large test sets may be significant, while on small test sets they may not be.In the 4.1 Dataset C Did you run computational experiments?
In the sec 4 Experiments C1.Did you report the number of parameters in the models used, the total computational budget (e.g., GPU hours), and computing infrastructure used?In the sec 4.2 and Appendix.cD1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.?Not applicable.Left blank.
D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Not applicable.Left blank.
D3. Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Not applicable.Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Not applicable.Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Not applicable.Left blank.

Figure 1 :
Figure 1: An illustration of different scenes in an NLVL dataset.These examples are selected from Charades-STA, and the videos are in Bathroom, Living room and Kitchen, respectively.

Figure 2 :
Figure 2: An overview of our generalizable NLVL framework for scene-robust NLVL problem.

Figure 3 :
Figure 3: Illustration of our decomposition strategy.

Figure 7 :
Figure 7: Failure Cases -Inability to distinguish subtle actions in video.

C2.
Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?In the sec 4.2, 4.4 and Appendix.cC3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Not applicable.Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?In the sec 4.2 and Appendix.cD Did you use human annotators (e.g., crowdworkers) or research with human participants?Left blank.

Table 1 :
Performance on the Charades-STA for the scene-robust NLVL task.

Table 2 :
Performance on the ActivityNet-Caption for the scene-robust NLVL problem.

Table 3 :
Performance on the YouCook2 for the scenerobust NLVL problem.

Table 4
invariant representations.To verify the complementarity of the intra-sample and inter-sample distribution in our proposed distance metric, we individually remove the two related term in Eq. 3. As shown in Table5, the joint use of intra-sample and

Table 7 :
Statistics of the three NLVL datasets for scenerobust problem.