Weakly-Supervised Temporal Article Grounding

Given a long untrimmed video and natural language queries, video grounding (VG) aims to temporally localize the semantically-aligned video segments. Almost all existing VG work holds two simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video. 2) All query sentences for the same video are always at the same semantic scale. Unfortunately, both assumptions make today’s VG models fail to work in practice. For example, in real-world multimodal assets (eg, news articles), most of the sentences in the article can not be grounded in their affiliated videos, and they typically have rich hierarchical relations (ie, at different semantic scales). To this end, we propose a new challenging grounding task: Weakly-Supervised temporal Article Grounding (WSAG). Specifically, given an article and a relevant video, WSAG aims to localize all “groundable” sentences to the video, and these sentences are possibly at different semantic scales. Accordingly, we collect the first WSAG dataset to facilitate this task: YouwikiHow, which borrows the inherent multi-scale descriptions in wikiHow articles and plentiful YouTube videos. In addition, we propose a simple but effective method DualMIL for WSAG, which consists of a two-level MIL loss and a single-/cross- sentence constraint loss. These training objectives are carefully designed for these relaxed assumptions. Extensive ablations have verified the effectiveness of DualMIL.


Introduction
Video Grounding (VG), i.e., localizing video segments that semantically correspond to (coreference relation) query sentences, is one of the fundamental tasks in multimodal understanding.Further, video grounding can serve as an indispensable technique for many downstream applications, such as the textoriented highlight detection (Lei et al., 2021), video The query is an article, which consists of multiple sentences at different scales (e.g., How to Make Pancakes).
High-level and low-level sentences are denoted with corresponding formats.✓ and ✗ denote that sentence can or cannot be grounded to the video, respectively.
Early VG efforts mainly focus on single sentence grounding (Gao et al., 2017;Hendricks et al., 2017) (cf. Figure 1(a)).Thanks to advanced representation learning and multimodal fusion techniques, single sentence VG has achieved unprecedented progress over the recent years (Cao et al., 2021).The next step towards general VG is to ground multiple sentences to the same video (cf. Figure 1(b)).A straightforward solution for multisentence VG is utilizing the single sentence VG model for each sentence individually.Since these query sentences associated with the same video are always semantically related, recent multi-sentence VG methods directly ground all queries simultaneously by considering their temporal order or semantic relations (Bao et al., 2021;Shi et al., 2021).
Unfortunately, all existing VG attempts hold two wikiHow Article … 1. Find a large pitcher.The pitcher will need to be able to hold liquid.2. Squeeze some lemons to make lemon juice.Cut the lemons in half, and use a citrus squeezer, a hand juicer, or a wooden reamer to squeeze the juice from the lemons.3. Pour the lemon juice into the pitcher.You may add the pulp if you like a thicker lemonade, or you may discard it along with the seeds.If you do not want the pulp, you can put a strainer over the pitcher, before pouring in the lemon juice.Once all of the juice is inside the pitcher, take the strainer off the pitcher and discard the pulp and seeds.4. Add in some cold water.You will need cold water.You can also use sparkling for a carbonated lemonade.

YouTube Videos
Figure 2: The only supervision for WSAG is a wikiHow article (e.g., How to Make Lemonade) and some corresponding YouTube videos about the same task.
simple but unrealistic assumptions: 1) All query sentences can be grounded in the corresponding video.Although this assumption is acceptable for the VG task itself, it greatly limits the usage of VG models in real-world multimodal assets.For example in news articles, most of the sentences in an article cannot be grounded in their affiliated videos.2) All query sentences for the same video are always at the same semantic scale.By "same scale", we mean that all VG models overlook the hierarchical (or subevent) relations (Aldawsari and Finlayson, 2019;Yao et al., 2020) between these query sentences.For example, in Figure 1(c), the sentence "Stir gently, leaving some small clumps of dry ingredients in the batter" (S 2 ) is one of the subevents of "Add the butter and milk to the mix" (S 1 ), i.e., S 1 and S 2 are at different semantic scales.Thus, the second assumption makes current VG models fail to perceive the semantic scales, and achieve unsatisfactory performance with multi-scale queries.
To this end, we propose a more realistic but challenging grounding task: Article Grounding (AG), which relaxes both above-mentioned assumptions.Specifically, given a video and a relevant article (i.e., a sequence of sentences), AG requires the model to localize only "groundable" sentences to video segments, and these sentences are possibly at different semantic scales.To further avoid the manual annotations for the large-scale training set, in this paper, we consider a more meaningful setting: weakly-supervised AG (WSAG).As shown in Figure 2, the only supervision for WSAG is that the given video and article are about the same task3 .
Since there is no prior work on WSAG, we collect a new dataset, YouwikiHow, to benchmark the research.YouwikiHow is built on top of wiki-How articles and YouTube videos4 .In particular, we group a wikiHow article and an arbitrary video about the same task as a document-level pair (cf. Figure 2).For the training set, we conduct a set of carefully designed operations to control the quality of training samples, e.g., task filtering or sentence simplification.For the test set, we directly borrow the manual step grounding annotations in the existing CrossTask (Zhukov et al., 2019) dataset and propagate them to wikiHow article sentences.
In addition, we propose a simple but effective Dual loss constraint MIL-based method for WSAG, dubbed DualMIL.Specifically, for the first assumption, we relax the widely-used Multiple Instance Learning (MIL) loss into a two-level MIL loss.By "two-level", we mean that we regard all sentences for each article (sentence-level) and all proposals for each sentence (segment-level) as the "bag" at two different levels.Then, we obtain the global video-article matching score by aggregating all matching scores over the two-level bag.This two-level MIL inherently allows some queries that cannot be grounded in the video.Meanwhile, to avoid obtaining many highly-overlapping segments, we propose a single-sentence constraint to suppress the proposals whose neighbor proposals have higher matching scores with the query.For the second assumption, we enhance models' abilities in perceiving different semantic scale queries by considering these hierarchical relations across sentences.In particular, we assume that high-level sentences should be more likely to be grounded than its low-level sentences for highly matched proposals, and propose a cross-sentence constraint loss.We show the effectiveness of DualMIL over stateof-the-art methods through extensive ablations.
In summary, we make three contributions: 1. To the best of our knowledge, we are the first work to discuss the two unrealistic assumptions: all query sentences are groundable and all query sentences are at the same semantic scale.Meanwhile, we propose a meaningful WSAG task.2. To benchmark the research, we collect the first WSAG dataset: YouwikiHow.3. We further propose a simple but effective method DualMIL for WSAG, which consists of three different model-agnostic training objectives.

Related Work
Single Sentence & Multi-Sentence VG.Mainstream solutions for single sentence VG can be coarsely categorized into two groups: 1) Top-down Methods (Hendricks et al., 2017;Gao et al., 2017;Zhang et al., 2019Zhang et al., , 2020b;;Chen et al., 2018;Yuan et al., 2019aYuan et al., , 2021;;Wang et al., 2020;Xiao et al., 2021b,a;Liu et al., 2021b,a;Lan et al., 2022): They first cut given video into a set of segment proposals with different durations, and then calculate matching scores between query and all segment proposals.Their performance heavily relies on predefined rules for proposal settings (e.g., temporal sizes).2) Bottom-up Methods (Yuan et al., 2019b;Lu et al., 2019;Zeng et al., 2020;Chen et al., 2020aChen et al., , 2018;;Zhang et al., 2020a): They directly predict the two temporal boundaries of the target segment by regarding the query as a conditional input.Compared to their top-down counterpart, bottom-up methods always fail to consider the global context between two boundaries (i.e., inside segment).In this paper, we follow the top-down framework and our DualMIL is model-agnostic.
Existing multi-sentence VG work all takes an assumption: the query sentences are ranked by their corresponding segments.This is an unrealistic and artificial setting.In contrast, real-world articles always do not meet this strict requirement, and most of the sentences are not even groundable in affiliated videos.In this paper, we take more realistic assumptions for the multi-sentence VG problem.
Weakly-Supervised VG.Since the agreements on the manually annotated target segments tend to be low (Otani et al., 2020), a surge of efforts aims to solve this challenging task in a weakly-supervised manner, i.e., there are only video-level supervisions at the training stage.Currently, there are two typical frameworks: 1) MIL-based (Gao et al., 2019;Mithun et al., 2019;Chen et al., 2020b;Ma et al., 2020;Zhang et al., 2020c,d;Tan et al., 2021): They first calculate the matching scores between the query sentence and all segment proposals and then aggregate scores of multiple proposals as the score of whole "bag".State-of-the-art MIL-based methods usually focus on designing better positive/negative bag selections.2) Reconstructionbased (Duan et al., 2018;Lin et al., 2020): They utilize the consistency between dual tasks sentence localization and caption generation, and infer the final grounding results from intermediate attention weights.Among them, the most related work to us is CRM (Huang et al., 2021), which considers both multi-sentence and weakly-supervised settings.Compared to CRM, our setting is more challenging: a) Sentences are from different scales; b) Not all sentences can be groundable; and c) Sentence sequences are not consistent with GTs.Multi-Scale VL Benchmarks.With the development of large-scale annotation tools, hundreds of video-language (VL) datasets are proposed.To the best of our knowledge, three (types of) VL datasets also have considered the multiple semantic scale issue: 1) TACoS Multi-Level (Rohrbach et al., 2014): It provides three-level summaries for videos.In contrast, their middle-level sentences are more like extractive summarization (instead of abstractive).Thus, the grounding results for different-scale sentences may be the same.2) Movie-related (Xiong et al., 2019;Huang et al., 2020;Bain et al., 2020): They always have multiple-level sentences to describe videos, such as overview, storyline, plot, and synopsis.They have two characteristics: a) Numerous sentences are abstract descriptions, i.e., they do not have exact grounding temporal boundaries.b) The high-level summaries are more like highlights or salient events.3) COIN (Tang et al., 2019): It defines multi-level predefined steps.Thus, it sacrifices the ability to ground any open-ended queries.

Dataset: YouwikiHow
We built YouwikiHow dataset from wikiHow articles and YouTube videos.As shown in Figure 2, we group a wikiHow article and any video about the same task as a pair.Thanks to the inherent hierarchical structure of wikiHow articles, we can easily obtain sentences from different scales: high-level summaries and low-level details.As in Figure 2, "Pour the lemon juice into the pitcher." is a highlevel sentence summary and "You may add the pulp if .... along with the seeds." is a low-level sentence detail of this summary.In this section, we first introduce the details of dataset construction, and then compare YouwikiHow to existing VG benchmarks.

Training Set
Initial Visual Tasks.Each wikiHow article describes a sequence of steps to instruct humans to perform a certain "task", and these tasks range from physical world interactions to abstract mental well-being improvement.In YouwikiHow, we follow (Miech et al., 2019) and only focus on "visual tasks".This gives us 25K tasks to begin with.Task-Related Videos.We also follow (Miech et al., 2019) and use the same preprocessing steps (e.g., remove videos with few views or too short dura-  (Lei et al., 2020) 21.8K 5.0 ----QVHighlights (Lei et al., 2021) 10.2K 1.0 ---CrossTask (Zhukov et al., 2019) 4.7K 7.4∼8.883 ---COIN (Tang et al., 2019) 11.8K 3.9 tions) to obtain initial task-related videos for each task.To further control the quality and ensure sufficient training videos for each task, we restrict the videos to top 50 search results, and the number of training videos for each task to be at least 30.This step prunes the number of tasks from 25K to 2.3K.Sentence Quality Control.Firstly, to avoid overlong articles, we filter out all the tasks with verbose sentences.Specifically, we set the max number of sentence summaries and details to 10 and 30, respectively.This filtering step decreases the task number to 1.4K.Meanwhile, since original wiki-How articles usually contain unimportant modifiers or quantifiers, we further conduct rule-based sentence simplification (Al-Thanyyan and Azmi, 2021) based on POS and dependency parse tags5 .

Test Set
For the test set, we directly build on top of the existing CrossTask (Zhukov et al., 2019) and reuse their manual temporal grounding annotations.Specifically, CrossTask is originally proposed for step segmentation, which consists of 18 primary wiki-How tasks.For each task, it collects corresponding YouTube videos and annotates the temporal grounding boundaries for each video corresponding to the predefined task-specific steps.Then, we manually link the step to the wikiHow articles6 and propagate these annotations as the ground-truth for wikiHow sentences.We conduct the same sentence simplification steps on all the wikiHow articles in the test set, and remove the task with over-long articles 7 .
Unfortunately, when we perform manually linking between CrossTask steps and wikiHow articles, we found it is difficult to link these steps to low-level details and almost all steps are linked to high-level summaries.To this end, we further design different evaluation metrics for high-/low-level sentences to bypass these limitations.(Details are in Sec.5.1.)

Comparison with Existing VG Datasets
We compare our collected YouwikiHow with other prevalent VG or step segmentation datasets in Table 1. 4 Proposed Approach for WSAG Problem Formulation.WSAG is defined as follows: Given an untrimmed video V and a relevant 7 We remove three tasks: Make Kimichi Fried Rice, Add Oil to Your Car, and Make French Strawberry Cake.All these tasks have over 60 sentences in their wikiHow articles.article A with multi-scale sentences, WSAG needs to predict all possible temporal locations for all groundable sentences, i.e., one sentence may refer to either multiple segments or even none.
In this paper, we consider sentences at two scales.Specifically, as shown in Figure 3, article A is organized as A = {s h 1 , s l 1 1 , ..., s l 1 n 1 ; s h 2 , ...; s h m , ..., s lm nm }, where s h k is the k-th high-level summary, and s l k i is the i-th low-level details of s h k .There are m highlevel summaries in total, and each high-level summary s h k has n k low-level details.To show more generalized abilities, in test stage, we assume that we do not know the scale prior of each sentence.
In this section, we first go through the architecture for grounding in Sec 4.1.Then, we detail each component of DualMIL in Sec 4.2.

Basic Visual Grounding Architecture
Since DualMIL is a model-agnostic training strategy, we follow a SOTA proposal-based model 2D-TAN (Zhang et al., 2020b) and use it as our baseline.As shown in Figure 4, it consists of three parts: Video Feature Encoding.Given video V , we first use a pretrained video feature extractor to extract clip features, and sample the video features evenly to N clips.Then, we utilize the 2D-map proposal strategy: All the segment proposals can be organized into a 2D temporal map M , and each element m ij ∈ M represents the candidate segment which starts from clip i and ends at clip j .We extract each proposal feature by averaging all inside clip features, and then stack a few conv-layers to further encode the context.Finally, we obtain 2D feature map F M ∈ R N ×N ×dv , and each element F M ij denotes the feature of segment proposal m ij .Text Feature Encoding.For each sentence s i = {w i j } in article A, we first use the GloVe embedding (Pennington et al., 2014) to encode each word w, and then feed all word embeddings into a Bi-LSTM.The final hidden state of Bi-LSTM is taken as the feature of sentence, denoted as F S i ∈ R ds .Multimodal Matching.After obtaining the video feature F M and all sentence features {F S i }, we then fuse these two features by Hadamard product: where w s ∈ R d h ×ds and w v ∈ R d h ×dv are two learnable MLPs, which map two modality features into a common space.Reorganizing the Fij,k into the 2D map format, we can obtain Fk ∈ R N ×N ×d h , which denotes the fused feature between sentence s k and all segment proposals M .Later, we adopt several conv-layers to obtain context-aware multimodal 2D feature maps.And these feature maps are fed into the classifier to predict all the matching score maps {P k }, where P k ∈ R N ×N denotes the matching scores between all segment proposals M and sentence s k .

Two-level MIL Training Objective
Since not all sentences in article A are groundable to the given video V , we only select the top-k 1 sentences with the highest matching scores to represent the whole article.As for the matching score between each sentence and the video, we average the similarity scores among the top-k 2 proposals: where sim(V, s k ) denotes the similarity score between video V and sentence s k .Similarly, we use sim(V, A) i to denote the similarity score between the top-i sentence in A with video V (i.e., i ≤ k 1 ).
We train the whole model with the ranking loss.Specifically, we treat video V and its same-task article A as positive pair (V , A).Then we randomly replace the video or article with other-task videos or articles to obtain negative pairs, denoted as (V − , A) and (V , A − ) respectively.Then, the two-level MIL loss is written as L MIL = i j L ij MIL , and where ∆ is a predefined margin.

Single-Sentence Constraint
Since each sentence may be grounded in multiple segments, we need to predict the similarity scores between each query sentence and all segment proposals.To force WSAG models to make sparse predictions, we propose the single-sentence constraint to enhance the two-level MIL training.By "sparse", we mean that only a few proposals are selected as results for each groundable sentence.Specifically, before selecting the top-k 2 segment proposals as in Eq. ( 2), we conduct a sparse filtering step to suppress (or filter out) the proposals by two rules: 1) In local highly-overlapped neighbors, there are other proposals with higher videosentence matching scores.2) The matching score is much less than the proposal with the highest score.
From an implementation perspective, we can use a simple max-pooling layer with kernel size K and a threshold δ to realize single-sentence constraint.Then, we can obtain a new filtered P k , and calculate a similar MIL loss with P k following Eq.( 2) and Eq. ( 3).(Ablations on K and δ are in Sec. 5).Highlights.Compared to the existing constraint strategy by selecting the proposal with the highest score as extra pseudo GT (Wang et al., 2021), our solution avoids selecting unstable pseudo GT (i.e., more robust), and it is more suitable for the setting of any number of GT segments for each query.

Cross-Sentence Constraint
To force WSAG models to perceive multi-scale queries, we propose the cross-sentence constraint.Specifically, we assume that high-level sentences should be more like to grounded than its low-level sentences for highly matched proposals.The reason is that today's multimodal coreference relations between query sentence and GT video segment discussed in grounding works contain both "identical" and "hierarchical" relations.Let's take two extreme cases as examples: 1) If the low-level sentence is identical to the video segment proposal, then its high-level sentence is also coreference to the proposal (hierarchical relation).2) If the highlevel sentence is identical to the proposal, then its low-level sentence is only partially coreference to the proposal.Thus, we propose the cross-sentence constraint by limiting the proposal matching scores between a high-level and low-level sentence pair.
Obviously, if the proposal itself is unrelated to the low-level sentence, this constraint is meaningless.Thus, we use the low-level sentence matching score as the loss weight, and the contraint loss is: where P h ij is the matching score between h-th highlevel sentence and proposal m ij , and P l h ,k ij is the matching score between k-th low-level sentence of s h h and proposal m ij .α is a predefined margin, and the impact of α is discussed in Table 3. Highlights.Since multimodal hierarchical relation is always difficult to predict, the main effect of the cross-sentence constraint is to avoid the case: a low-level sentence has a high matching score with the proposal while its high-level summary is not.

Inference
In the test stage, given a video and a relevant article, we first predict the matching scores between each sentence and all proposals, and then we conduct non-maximum suppression (NMS) to filter out the proposals with highly overlaps but smaller scores.Then, we can simply combine all predictions from different sentences based on their matching scores.
To further consider the semantic relations between sentences at test stage, we use a Structure-NMS, inspired by Soft-NMS (Bodla et al., 2017), to suppress the segments which violate structure constraints.More details are left in the appendix.

Model
R@50 (IoU) R@100 (IoU) 0.1 0.3 0.5 0.1 0. have GT annotations for high-level summaries, we also proposed Recall@K meet Constraint (RC@K) as a supplementary metric for low-level sentences.Since we assume the temporal grounding results of low-level sentences should be inside its high-level manual annotations, we calculated the percentages of low-level sentence predictions that meet the constraint.Note that RC@K is not strictly accurate.Implementation Details.Given a reference video V , we used a pretrained S3D extractor (Miech et al., 2020) to extract initial clip features.The number of initial clips was set to 256.For text sentences, following prior VG works, we truncated or padded each sentence to a maximum length of 25 words.In the training stage, to save GPU memory, we randomly sample 20 sentences if the articles have more than 20 sentences.All the dimensions of the hidden features were set to 512.In the multimodal matching, we used a three-layer convolutional network to encode context.Its kernel size and strides were set to 3 and 1, respectively.We trained the whole network with Adam optimizer for 100 epochs.The initial learning rate was set to 0.0001, and the batch size was set to 32.The loss weights of two-level MIL loss (for both models with and without singlesentence constraint) and cross-sentence constraint loss were set to 1.0 and 0.1, respectively.The predefined margin ∆ for MIL training was set to 0.3.For the model with cross-sentence constraint, to ensure the predicted low-level sentence matching scores are reliable, we first train the model with MIL loss solely at a warm-up stage, and then add cross-sentence constraint loss for further training.

Ablation Studies
We run a number of ablations to analyze the impact of different hyperparameters of each component, and the effectiveness of each component.
Ablation on Single-Sentence Constraint.The impacts of two hyperparameters in the single-sentence constraint (i.e., kernel sizes K and thresholds δ) are reported in Table 2. From the results, we can observe that: 1) For most hyperparameter settings, the single-sentence constraint can consistently improve models' performance.
2) The Model with setting K = 7 and δ = 0.5 achieves the best results.
Ablation on Cross-Sentence Constraint.The impact of different margin α in the cross-sentence constraint are reported in Table 3. From results, we can observe that the performance gains are robust to different α, and the model with α = 0 achieves the best performance.It is worth noting that a negative α (relaxed constraint) is still effective, which proves the claimed main effects in Sec.4.2.3.Ablation on Structure-NMS.The results of the models with and without structure-NMS are illustrated in Figure 5. From the results, we can observe that structure-NMS can significantly improve the performance of tasks with high agreements (e.g., Change a Tire, or Grill Steak).In contrast, it may hurt the performance of tasks with low agreements.
Effectiveness of Each Strategy.The ablation studies on each strategy are reported in Table 4. From Table 4, we have the following observations: 1) Compared to the baseline, each strategy can consistently improve performance on both R@50 and R@100 metrics.
2) The full model achieves the best R@50 and R@100 over different IoUs.

Comparisons with State-of-the-Art
Baselines.We compared our proposed DualMIL with a set of state-of-the-art baselines.Specifically, we investigated three types of baselines: Type1: State-of-the-art WSVG models.We com-  pared with WSTAN (Wang et al., 2021), which builds on top of a cross-entropy (XE) based MIL backbone.For completeness, we also reported results of the WSTAN backbone (dubbed MIL-XE), and a random guess (RandomGuess) baseline.
Type2: Pretrained multimodal video-text retrieval models (e.g., MIL-NCE (Miech et al., 2020)).We show the zero-shot results of two variants by maxpooling or average-pooling the clip features inside the boundaries of video segment proposals.
Type3: Two-stage model.Since today's WSVG models assume all the sentences can be grounded to the video, a straightforward two-stage solution is: Using pretrained video-text retrieval models to select all groundable sentences first, and then training a WSVG model with selected sentences.
Obviously, we need to manually set a threshold to filter out sentences at the first stage, we reported results of three variants with different thresholds.
Results.All results are reported in Table 5.From Table 5, we have the following observations: 1) For Type1 methods, the simple baseline MIL-XE can achieve good performance.However, the SOTA model WSTAN with other more advanced designs only performs similarly with RandomGuess, which proves existing SOTA WSVG models fail to work in these more realistic settings.2) For Type2 methods, the performance gaps between different pooling operations are large.Although these large-scale pretrained models can achieve exemplary zero-shot performance, they are not robust enough and heavily rely on different heuristic rules.3) For Type3 methods, the model with different thresholds also behavior differently, i.e., these two-stage methods are not robust either.4) In contrast, our proposed DudalMIL can achieve satisfactory performance with relatively consistent gains.

Visualizations
We illustrated two examples in Figure 6.For the first example, we only show the grounding results of one query sentence (from article "Build Simple Floating Shelves") with multiple ground-truth segments.For the second example, we show the grounding results of all high-level sentences of the article ("Make French Toast").From Figure 6, we observe that: Both the proposed single-sentence constraint and cross-sentence constraint can help to ground some missing segments in top-K predictions.Meanwhile, both constraints are complementary, i.e., the full model achieves the best results.

Conclusions
In this paper, we discussed the weaknesses of default assumptions in existing video grounding work, and proposed a more challenging task: weaklysupervised article grounding (WSAG).To facilitate the research in this direction, we collected the first WSAG dataset YouwikiHow.Further, we proposed DualMIL for WSAG, including a two-level MIL loss and a single-/cross-sentence constraint loss.This work paves the way for a number of exciting future works: 1) designing more reasonable backbones for multiple sentence inputs by considering their semantic relations; 2) extending to more general domains beyond instructional articles.

Limitations
The main limitations of this work are about the collected dataset YouwikiHow.Specifically, we can discuss them from the two following aspects: Dataset Creation.Since we focus on WSAG, the manner of creating the training set of Youwik-iHow is acceptable.However, to save the manual annotations for the test set, we only propagate the annotations from the existing CrossTask (Zhukov et al., 2019) dataset.Although this solution is much cheaper, it introduces two types of potential errors in the "ground-truth" annotations for evaluation: 1) When manually mapping the "step" in CrossTask to the "sentence" in the wikiHow article, we found it not always be one-to-one perfect mapping.In a few cases, multiple sentences may refer to a single step or multiple steps may refer to a single sentence.Thus, the original ground-truth annotations for each CrossTask step may not be exactly accurate for its mapped sentence regarding the same video.2) Since each wikiHow article has much more sentence queries than original step queries in CrossTask, many wikiHow sentences cannot be mapped to these predefined steps, i.e., these wiki-How sentences will not have any "ground-truth" annotations.However, these sentences may be groundable in some specific videos.
Domain Coverage.Since we obtain the explicit multi-scale sentences from the inherent hierarchical structures of wikiHow articles, these wik-iHow articles are mainly about instructional articles.Thus, the main domain of our YouwikiHow dataset focuses on instructional articles/videos, i.e., the model trained in our dataset may suffer from performance drops when they are applied to other domain daily multimodal assets.
For the first limitation, we mitigate its impact by using more relaxed metrics: R@K or RC@K.Of course, the most accurate solution is checking all the annotations between any video-article pairs.

Ethics Statement
The proposed dataset and method aim to improve the performance of temporal grounding models in more realistic settings.Advancements in visual grounding help the deployment of visual grounding (or article grounding) models in our daily applications.Since we mainly focus on the two unrealistic assumptions in existing grounding models, our work does not introduce new ethical concerns.The only potential ethical concern is that any languagequery based applications run the risk of using biased or offensive words (or descriptions) -video grounding is no exception.In the future, we can try to incorporate a preprocessing step to avoid or correct biased or offensive content.A More Details about Structure-NMS Given all detected segments for each groundable sentence in the article, we hope these segments themselves also meet the same semantic relations as their query sentences (temporal or hierarchical relations).Since we assume that we do not know the scale prior of each sentence at the test stage, currently we only consider the temporal relations.
More specifically, given two query sentences s i and s j .If s i appears earlier than s j in the corresponding article, we hope the grounding segments for s i should be earlier than s j too.Following Soft-NMS (Bodla et al., 2017), we also multiple a coefficient to decrease the matching score of the proposals which violate this temporal constraint, and the coefficient is proportional to their IoU.Let's take a concrete example.If the predicted segments for s i and s j are [l i s , l i e ] and [l j s , l j e ], and their matching scores are p i and p j (p i < p j ).After selecting the segment [l j s , l j e ] into top-K predictions, we then slightly decrease the matching score p i by: IoU bad = max(l j s − l i s , 0) + max(l i e − l j e , 0) max(l i e , l j e) − min(l i s , l j s) , where const is a constant number.

B More Experimental Details
More Details about RC@K.Since we hope the grounding segment of the low-level sentence is inside that of their high-level summary, we calculate RC@K the same way as plain recall with only one exception: if the low-level prediction is totally inside their high-level ground-truth annotations, the prediction is regarded as hit regardless of the IoU.

54.39%
Table 7: The statistics about the agreement between the order of ground-truth query sentence and the order of their corresponding ground-truth segments.

C More Ablation Studies
Impact of Proposal Settings.For proposal-based VG methods, a notorious weakness is that their performance is heavily affected by different proposal settings.To this end, we explored the impact of different proposal settings in our baseline framework, and the results are reported in Table 6.From Table 6, we can observe that the model achieves the best performance in most metrics when N is 16.The performance gap between the "prediction" and "GT" settings also shows that the main bottleneck for current article grounding models is detecting groundable sentences for the video-article pair.

D Statistics about the Agreement of GT Temporal Orders
The agreement between the order of all groundable sentences and the order of their corresponding ground-truth segments of the test set are reported in Table 7.

Figure 1 :
Figure 1: (a) Single sentence grounding: The query is a single sentence.(b) Multi-sentence grounding: The queries are multiple sentences.(c) Article grounding: The query is an article, which consists of multiple sentences at different scales (e.g., How to Make Pancakes).High-level and low-level sentences are denoted with corresponding formats.✓ and ✗ denote that sentence can or cannot be grounded to the video, respectively.

Figure 3 :
Figure 3: Illustration of the multi-scale structures of A.
Video  : Find a large pitcher.  :The pitcher will need to be able to hold liquid.  : Squeeze some lemons to make lemon juice.  : Cut the lemons in half, and use a citrus squeezer, a hand juicer, or a wooden reamer to squeeze the juice from the lemons.

Figure 4 :
Figure 4: The overview of the article grounding architecture with the proposed DualMIL.

Figure 6 :
Figure 5: Performance gains (%) between models w/ & w/o structure-NMS.Task ids are ranked by the agreement between the order of groundable sentences and GT segments (cf.Appendix).

Table 1 :
Comparison between YouwikiHow and other prevalent video grounding or step segmentation benchmarks.

Table 5 :
Performance (%) comparison with SOTA baselines.All listed methods use the same proposal settings."MM Pretrain" denotes these models use large-scale multimodal pretraining features.* results are averaged by five different random seeds.Model a/b/c in Type3 denotes model with different thresholds.† denotes reimplementation results using official codes.The best and second best results are denotes with corresponding formats.

Table 6 :
Ablations on different proposal settings."GT" denotes the results with only groundable sentences.