End-to-End Modeling via Information Tree for One-Shot Natural Language Spatial Video Grounding

Natural language spatial video grounding aims to detect the relevant objects in video frames with descriptive sentences as the query. In spite of the great advances, most existing methods rely on dense video frame annotations, which require a tremendous amount of human effort. To achieve effective grounding under a limited annotation budget, we investigate one-shot video grounding and learn to ground natural language in all video frames with solely one frame labeled, in an end-to-end manner. One major challenge of end-to-end one-shot video grounding is the existence of videos frames that are either irrelevant to the language query or the labeled frame. Another challenge relates to the limited supervision, which might result in ineffective representation learning. To address these challenges, we designed an end-to-end model via Information Tree for One-Shot video grounding (IT-OS). Its key module, the information tree, can eliminate the interference of irrelevant frames based on branch search and branch cropping techniques. In addition, several self-supervised tasks are proposed based on the information tree to improve the representation learning under insufficient labeling. Experiments on the benchmark dataset demonstrate the effectiveness of our model.


Introduction
Natural language spatial video grounding is a vital task for video-text understanding (Luo and Shakhnarovich, 2017;Hu et al., 2019;, which aims to detect the objects described by the natural language query First author. mengzeli@zju.edu.cn from each video frame, as shown in Figure 1. There is a substantial and rapidly-growing research literature studying this problem with dense annotations (Li et al., 2017;Yamaguchi et al., 2017;Sadhu et al., 2020), where each frame that contains objects relevant to the language query will be manually labeled with bounding boxes. Obviously, such annotations require tremendous human effort and can hardly be satisfied in real-world scenarios. Recently, some works have investigated weakly-supervised video grounding with solely the video-text correspondence rather than object-text annotations (Huang et al., 2018;Chen et al., 2019a;Shi et al., 2019;Chen et al., 2019b;Zhou et al., 2018). However, the performance is less satisfied with such weak supervision. In practice, we are more likely to have a limited annotation budget rather than full annotation or no annotation. In addition, as humans, after experiencing the language query and one frame object paired together for the first time, we have the ability to generalize this finding and identify objects from more frames. Towards this end, we investigate another practical problem setting, i.e., one-shot spatial video grounding, with solely one relevant frame in the video labeled with bounding boxes per video.
Existing methods that are devised for supervised video grounding are not directly applicable to this novel setting. We summarize several critical challenges: • On the one hand, most of them incorporate a multi-stage training process, i.e., firstly training a clip localization module, and training an object localization module in the second stage. However, in one-shot spatial video grounding, there are no temporal annotations, which indicate the start/end time of the relevant clip, to train the clip localization module. Moreover, many of them extract video features in a pre-processed manner using feature extractor or object detector pretrained on large-scale datasets. However, independent modeling limits the cooperation of different modules, especially when the labels are few. Therefore, it is in urgent need to derive an end-to-end training framework for one-shot spatial video grounding.
• On the other hand, there are video frames that are either irrelevant to the natural language query or the labeled frames. These irrelevant frames might increase the computation complexity of end-toend training, and bring confounding between the frame label and (irrelevant) visual features.
• Lastly, with fewer supervision signals, deep representation learning might become error-prone or easily under-fitting, especially for end-to-end training.
To address these challenges, we devise an endto-end model via the Information Tree for the One Shot natural language spatial video grounding (IT-OS). Different from previous works, we design a novel tree structure to shield off the one-shot learning from frames that are irrelevant to either the language query or the labeled frame. We devise several self-supervised tasks based on the tree structure to strengthen the representation learning under limited supervision signals. Specifically, the calculation processes of the key module, information tree, contains four steps: (1) To construct the information tree, we view video frame features as nodes, and then compress the adjacent nodes to non-leaf nodes based on the visual similarity of themselves and the semantic similarity with the language query; (2) We search the information tree and select branch paths that are consistently relevant to the language query both in the abstractive non-leaf node level and in the fine-grained leaf node level; (3) We drop I) the leaf nodes that do not belong the same semantic unit with the labeled node; and II) the non-leaf nodes on the low relevance branch paths. We also down-weight the importance of the leaf nodes that belong to the same semantic unit with the labeled node but are on the low relevance paths; (4) Finally, we input the extracted and weighted information to the transformer, and conduct training with the one-shot label and self-supervised tasks, including masked feature prediction and video-text matching. We note that both the information tree and the transformer are jointly trained in an end-to-end manner.
We conduct experiments on two benchmark datasets, which demonstrate the effectiveness of IT-OS over state-of-the-arts. Extensive analysis including ablation studies and case studies jointly demonstrate the merits of IT-OS on one-shot video grounding. Our contributions can be summarized as follows: • To the best of our knowledge, we take the initiative to investigate one-shot natural language spatial video grounding. We design an end-toend model named IT-OS via information tree to address the challenges brought by limited labels.
• By leveraging the language query, several novel modules on the information tree, such as tree construction, branch search, and branch cropping, are proposed. Moreover, to strengthen the deep representation learning under limited supervision signals, we introduce several selfsupervised tasks based on the information tree.
• We experiment with our IT-OS model on two benchmark datasets. Comparisons with the stateof-the-art and extensive model analysis jointly demonstrate the effectiveness of IT-OS.
Deep neural network has convincingly demonstrated high capability in many domains (Gan and Zhang, 2020;Gan et al., 2021;Wu et al., 2020Guo et al., 2021;, especially for video related tasks (Miao et al., 2021;Xiao et al., 2020, like video grounding. For example, (Li et al., 2017) use the neural network to detect language query related objects in the first frame and track the detected object in the whole video. Compared to it, (Yamaguchi et al., 2017) and (Vasudevan et al., 2018) go further. They extract all the object proposals through the pretrained detector, and choose the right proposal described in the text.
Supervised training for the natural language video object detection needs high labeling costs. To reduce it, some researchers pay attention to weakly-supervised learning fashion using multiple instances learning(MIL) method (Huang et al., 2018;Chen et al., 2019a;Shi et al., 2019;Chen et al., 2019b;Zhou et al., 2018;Wang et al., 2021a)transfers contextualized knowledge in crossmodal alignment to release the unstable training problem in MIL. Based on contrastive learning (Zhang et al., 2022), (Da et al., 2021) proposes an AsyNCE loss to disentangle false-positive frames in MIL, which allows for mitigating the uncertainty of from negative instance-sentence pairs. Weakly supervised false-positive identification based on contrastive learning has witnessed success in some other domains (Zhang et al., 2021b;Yao et al., 2022) One-shot Learning for Videos. One-shot learning has been applied in some other video tasks.  proposes a meta-learning-based approach to perform one-shot action localization by capturing task-specific prior knowledge. (Wu et al., 2018) investigates the one-shot video person re-identification task by progressively improving the discriminative capability of CNN via stepwise learning. Different from these works, (Caelles et al., 2017) and (Meinhardt and Leal-Taixé, 2020) define the one-shot learning as only one frame being labeled per video. Specifically, (Caelles et al., 2017) use a fully convolutional neural network architecture to solve the one-shot video segmentation task. (Meinhardt and Leal-Taixé, 2020) decouple the detection task, and uses the modified Mask-RCNN to predict local segmentation masks. Following this setting, we investigate one-shot natural language spatial video grounding, and devise a novel information-tree based end-to-end framework for the task.

Model Overview
Problem Formulation.
Given a video V = {v i } i=1,2,...,I and a natural language query C, spatial video grounding aims to localize the querydescribed object from all the objects O i = {o i j } j=1,2,...,J for each frame. I denotes the frame number of the video, and the J is the object number in the video. In one-shot spatial video grounding, solely one frame v i in video V is labeled with the region boxes of the target objects O i .
Pipeline of IT-OS. As shown in Figure 2, there are mainly four steps involved in the end-to-end modeling of IT-OS: • Firstly, we extract the features from the input video and the input caption. Specifically, for the video, we use ResNet-101 (He et al., 2016) as the image encoder to extract the frame feature maps; for the language query, we employ a language model Roberta (Liu et al., 2019). Both the vision encoder and the language encoder are jointly optimized with the whole network.
• Secondly, we build the information tree to get the representation of the video. The information tree is built upon the frame feature maps, which are the leaf nodes. Leaf nodes will be further merged based on the relevance between nodenode and node-query to have non-leaf and root nodes. Nodes on unnecessary branches will be deleted conditioned on the language query.
• Thirdly, we utilize the transformer encoder to reason on the remaining nodes and language features. Upon the transformer, we devise two selfsupervised tasks, i.e., masked feature modeling, and video-text matching, which enhances the representation learning under limited labels.
Prediction and Training. We follow the common prediction and training protocol of visual transformers used in other object detection models (Wang et al., 2021b). We input the embedding parameters E de and the multi-model features F de generated by the transformer encoder into the transformer decoder D. Then, the decoder D outputs possible prediction region features for each frame. For each possible region, a possibility P and a bounding box B are generated. IsMatch? Figure 2: The overall schema of the proposed end-to-end one-shot video grounding via information tree (IT-OS), which contains query-guided tree construction, query-based branch search & cropping, and a transformer encoder enhanced by self-supervised tasks.
We choose the box B with the highest possibility value P for each frame as the target box.
During the training process, we first calculate the possible prediction regions. Then, we match the possible regions with the target boxes, and choose the best match for each frame. Finally, use the match to train our IT-OS model.

Information Tree Module
In this section, we will elaborate the information tree modules in detail. We will illustrate how to construct the information tree, how to extract critical information from it and how to design the selfsupervised learning based on the tree. To ease the illustration, we take the 6 frames as an example, and show the process in Figure 2.

Tree Construction
Given the frame features generated by the CNN, we build the information tree by merging adjacent frame features in the specified order. Specifically, the frame features output by the image encoder are the leaf nodes N = {n i } 2M i=1 . A sliding window of size 2 and step 2 is applied on these nodes and nodes in the window are evaluated to be merged or not.
We calculate the semantic relevance difference between each node pair with the language query, and get the visual relevance between the nodes in each pair. For the visual relevance calculation, we max-pool the feature maps of the i node pair to have the feature vector f 2i−1 v and f 2i v . And then, we compute the cosine similarity r i vv between f 2i−1 v and f 2i v to be the visual relevance. Next, we calculate the semantic relevance r 2i−1 tv and r 2i tv between the text feature f t and the nodes of i node pair: where the w t and w v are learnable parameters, and σ is the sigmoid activation function. The semantic relevance difference d i tv between the ith paired nodes is: where the γ is the hyperparameter.
With the relevant difference value, we rank the node pairs and pick out the top λ. The λ is a hyperparameter, which can be set as a constant or a percentage. We merge the node pairs: (5) where the w mg and b mg are trainable. Finally, The new node n new replace the old nodes n 2i−1 and n 2i in the queue. Repeat the process until there is only one node in the queue. Saving all nodes in the process and the composite relationship between nodes generated in the merging process, we get the information tree.

Branch Search
We use a branch to denote a subtree. To filter critical local and global information, we perform branch search and selection. We firstly select branches that contain leaf nodes less than δ max and more than δ min . δ max and δ min are hyperparameters. We calculate the semantic relevance of branches' root nodes and the language query based on Equation 2.
Training. During training, we directly select the branch that contains the labeled leaf node and the root node with the highest semantic relevance. This selection improves the training efficiency.

Inference.
During inference, all frames should be processed. We conduct an iterative search with multiple search steps. For each step, we select the branch with the highest semantic relevance and remove the selected branch from the information tree. After the search, we have multiple selected branches and each branch will be forwarded to the following processes.

Branch Cropping
Note that not all the non-leaf nodes in the selected branches are closely related to the input caption. We remove non-leaf nodes that are with semantic relevance less than ∆, which is a hyperparameter. Their descendant non-leaf nodes are also removed. To reserve enough frame nodes for training, we do not remove the descendant leaf nodes. Instead, we down-weight them with λ = 0.5. For other leaf nodes, λ = 1. The remaining leaf nodes and nonleaf nodes represent the critical local information and the global information, respectively. We multiply the feature of node i and the node's semantic relevance r i tv : where f i vnew is the feature vector input into the transformer. As such, Equation 6 considers both local relevance r tv and global relevance λ with the language query.

Self-supervised Tasks
We leverage a transformer encoder for these extracted information and the language query. As shown in the Figure 2, we design two selfsupervised tasks as: 1) predicting the masked text features, and masked local/global video information; 2) judging whether the text and the video match. For the transformer, the input tokens F in consist of the local information, the global information and the text features, which are three types of tokens. We further introduce 2-D position embedding for video tokens and type embedding for all tokens, which are added to the tokens' features.
Then, the features F in are input into the transformer encoder E. After encoding, the fusion features F out are output: We predict the original features for masked language tokens and masked video tokens (leaf/nonleaf nodes in the selected branch) using multilayer perceptrons.
where the M LP t and M LP v are the multilayer perceptrons for text and video features, respectively. We view masked token modeling as feature regression and adopt L2 distance as the loss function. In addition, there will be a mismatched language query at the rate of 50%. We propose to predict whether the video and language are matched, i.e., whether the video contains the event described by the language query, based on the output representation of token [CLS]. When the video and the language are not matched, we will not train the model with the one-shot label.  Table 2: Compared with baselines on VID-sentence. All methods are trained using one-shot learning. The * represents the MDETR is applied to these baselines as the object detector backbone.
55, 135 interrogative sentences and 44, 808 declarative sentences. These sentences describe 79 types of objects appearing in the videos. We follow the official dataset split of (Zhang et al., 2020g).
(2) VID-sentence (Chen et al., 2019b) is another widely used video grounding benchmark constructed based on the VID (Russakovsky et al., 2015) dataset. There are 30 categories and 7, 654 video clips in this dataset. We follow the official dataset split for evaluation.
Implementation Detail For video preprocessing, we random resize the frames, and set the max size is 640 * 640. The other data augmentation methods, such as random horizontal flip and random size cropping are used at the same time. During training, the learning rate is by default 0.00005, and decays by a factor of 10 for every 35 epochs. The batch size is 1 and the maximum training epoch is 100. We implement IT-OS in Pytorch and train it on a Linux server. For model hyperparameters, we set λ = 60%, and ∆ = 0.7. Most of the nat-ural language spatial video grounding models use the pretrained detection model as the backbone. Thus, like them, we choose the official pretrained MDETR (Kamath et al., 2021) as the parameter basis for target detection of our IT-OS.

Evaluation Metrics
We follow the evaluation protocol of (Chen et al., 2019b). Specifically, we compute the Intersection over Union (IoU) metric for the predicted spatial bounding box and the ground-truth per frame. The prediction for a video is considered as "accurate" if the average IoU of all frames exceeds a threshold α. The α is set to 0.4, 0.5, and 0.6 during testing.
Baselines Since existing video grounding methods are not directly applicable to the one-shot setting, we extend several state-of-the-arts as the baselines. Specifically, to have a comprehensive comparison, we consider 1)fully supervised models, including VOGnet (Sadhu et al., 2020), OMRN (Zhang et al., 2020f) and STGRN (Zhang et al., 2020g); and 2) other widely known methods, including video person grounding STPR (Yamaguchi et al., 2017), and visual grounding method, GroundeR (Rohrbach et al., 2016).

Performance Comparison
The experimental results for one-shot video grounding on VidSTVG and VID-sentence datasets are shown in Table 1 and 2, respectively. According to the results, we have the following observations: • Not surprisingly, although extended to the video grounding setting, baselines that belong to other domains, including video person grounding STPR and visual grounding GroundeR, achieve inferior results on video grounding benchmarks. They lack domain-specific knowledge and might  • The baselines are implemented with the backbones used in their original papers, which are different from ours. To further disentangle the sources of performance improvement, we re-implement the best-performing baselines (VOGnet*, and OMRN*) with the same object detection backbone, MDETR, as IT-OS. Although there is performance improvement with the new backbone, the best-performing baseline OMRN*, still underperforms OT-OS by over 4 points for the average accuracy on all datasets. It further reveals the effectiveness of our novel model designs eliminating interference with different pre-training parameters. We attribute the improvement to the end-to-end modeling, where different modules can simultaneously benefit from each other. In addition, the proposed information tree alleviates the negative effects of irrelevant frames, and effectively models the interactions between the video global/local information and the language query. Several self-supervised learning tasks based on the information tree enhance the representation learning under limited one-shot labels.

Comparison with Fully Supervised Methods
We are interested in 1) how different baselines and IT-OS perform under fully supervised settings; 2) how one-shot IT-OS perform compared to these baselines. Towards this end, we train multiple baselines and IT-OS with all labels on the VIDsentence dataset. The experiment results are shown in Table 3. From the table, we have the following findings: • Remarkably, the performance gap between oneshot IT-OS and the fully supervised version is less than 4%. Such a minor gap demonstrates the effectiveness of IT-OS on learning with limited annotations. This is significant and practical merit since we are more likely to have a limited annotation budget in real-world applications.
• Surprisingly, one-shot IT-OS can still outperform some weak baselines such as GroundeR and STPR. These results reveal the necessity of end-to-end modeling for video grounding.
• OMRN under the fully supervised setting achieves comparative performance with IT-OS (supervised). OMRN has many advanced techniques, such as fine-grained object relation reasoning, that require sufficient labeling. Recall that IT-OS outperforms OMRN by nearly 7 points. These results further demonstrate the merits of IT-OS on one-shot video grounding.

Ablation Study
We are interested in how different building blocks contribute to the effectiveness of IT-OS. To this end, we surgically remove several components from IT-OS and construct different architectures. The investigated components include information tree (Γ tree ), the branch cropping (Γ crop ), and the selfsupervised training (Γ self ). It is worth noting that the other components cannot be deleted independently except the branch cropping. Thus, we don't conduct an ablation study for them. Results on VidSTG and VID-sentence datasets are shown in Table 4 and   • Stacking multiple components outperform the architecture with a single component. This result reveals that the proposed components can benefit from each other in end-to-end training and jointly boost one-shot video grounding.

Case Study
We conduct a case study to visually reveal the ability of the IT-OS in detail. Specifically, we random sample three videos from the datasets, and sample 6 frames from each video to visualize. We compare our IT-OS model with the baseline method, OMRN, and the fundamental ablation model of the IT-OS, which is removed from the self-supervised module and the information tree. As shown in Figure 3, we have the following key findings: (1) The IT-OS detects the more accurate one from all objects of the video than the best performing previous method. It demonstrates the better representation extraction and analysis capabilities of our model. (2) Even if the target object is selected correctly, the IT-OS localizes a more precise positioning area compared with the previous two stages method. The results reflect the end-to-end model, IT-OS, has more accurate domain knowledge through fine-tuning on the target dataset.
(3) After adding the information tree and the self-supervised module, the IT-OS outputs more precise bounding boxes. It reveals that combining the two modules introduce stronger supervision The IT-OS(Base) represents the IT-OS model without the self supervised module and the informaiton tree. The GT represents the target labels. signals for model training so that the model has stronger detection ability.

Conclusion
In this paper, we introduce the one-shot learning into the natural language spatial video grounding task to reduce the labeling cost. To achieve the goal, the main point is to make full use of only one frame label for each video. The invalid frames unrelated to the input text and target objects bring confounding to the one-shot training process. We design an end-to-end model (IT-OS) via the information tree to avoid it. Specifically, the information tree module merges frames with similar semantics into one node. Then, by searching the tree and cropping the invalid nodes, we can get the complete and valid semantic unit of the video. Finally, two self-supervised tasks are used to make up the insufficient supervision.