Retrieval-augmented Video Encoding for Instructional Captioning

Instructional videos make learning knowledge more efficient, by providing a detailed multimodal context of each procedure in instruction. A unique challenge posed by instructional videos is key-object degeneracy , where any single modality fails to sufficiently capture the key objects referred to in the procedure. For machine systems, such degeneracy can disturb the performance of a downstream task such as dense video captioning, leading to the generation of incorrect captions omitting key objects. To repair degeneracy, we propose a retrieval-based framework to augment the model representations in the presence of such key-object degeneracy. We validate the effectiveness and generalizability of our proposed framework over baselines using modalities with key-object degeneracy.


Introduction
Instructions, which provide detailed information about the procedures required to achieve the desired goal, are a central part of how humans acquire procedural knowledge.Instructions decompose a sequence of complex procedures into key objects and the associated actions expressed as verbs.As machine systems increasingly aim to provide realworld utility for humans, their ability to translate human goals into natural language instructions to follow becomes essential (Ahn et al., 2022).In this light, instructional captioning, summarizing instructional videos into a set of succinct instructions, is thus an important component of enabling the distillation of human-level procedural knowledge to machines.
For instructional captioning, we focus on the task of dense video captioning (DVC) (Krishna et al., 2017) which aims to produce a precise set of instructions from visual input (e.g.instructional videos).For example, to illustrate the procedure s 2 in Figure 1, the instructional video details the procedure, while simultaneously showing how this action is performed.DVC system can then summarize this video into a set of salient captions, forming a set of instructions that enhances the visual demonstration with informative text descriptions.While the task of extracting a salient instruction from complex visual input can be effortless for humans, it presents a unique challenge for machine systems, which we denote as key-object degeneracy.That is, machine systems can often fail at the fundamental task of key-object recognition, which is core to instructions.This is due to the fact that frequently, key objects are not easily recognized from either images (Shi et al., 2019a;Zhou et al., 2018a) or transcripts of the frames (Huang* et al., 2018) during a demonstrative and conversational presentation.While humans can impute such missing information by flexibly aggregating across various available modalities, key-object degeneracy can cause critical failures in existing DVC systems.
To quantify the degeneracy in instructional videos, we first conduct a study measuring the number of recognizable key objects from the images X and transcripts T in one of our target instructional video corpora, YouCook2 (Zhou et al., 2018a) 1 .We define recognizability as the percentage of key objects which are recognizable in at least one modality, and present the statistics in Table 1.
From the result in Table 1, we can observe that many key objects are not recognizable from the image alone.Though we can observe that recognizability improves when the image is augmented Figure 1: Overall illustration of our framework and a real-life example.The key object "Chicken" of procedure s 3 is hard to recognize from the images and transcripts of Frame 3 and 4 of the instructional video V S (right top), which we call degeneracy.To repair degeneracy, we supervise the machine system to retrieve procedural sentences (left middle) aligned to each video frame utilizing key object aware inter-frame information (connected with yellow line), unless it fails to distinguish Frame 2 and 3, 4 and retrieve recipe sentence aligned to Frame 2 for Frame 3,4 (connected with green line).We feed frame representation augmented with the retrieved procedural sentence to the downstream task model, DVC, whose generated caption of s 3 (left bottom) becomes more detailed and contains the key object.
with the temporally paired transcript, this does not entirely resolve key-object degeneracy, as nearly 40% of key objects remain unrecognized.For instance, in Figure 1, the key object of procedure s 3 , chicken, is not recognizable from either the image or transcript of Frame 3.
Having different reasons for degeneracy, each modality has distinct methods to make key objects recognizable: 1) reducing occlusion of key objects in images or 2) reducing ambiguity by mentioning the key objects with nouns in text.Based on the preliminary study, we pursue the latter, and propose a disambiguation method based on retrieval from instructional scripts, such as recipes for cooking.
The sufficient condition of instructional scripts for our method is that they contain disambiguated key objects, and provide adequate coverage of valid (key-object, action) pairs.For the YouCook2 dataset, we quantitatively confirm the efficacy of instructional scripts in repairing degeneracy, in Table 1, where it is shown that the instructional scripts can successfully make the unrecognized key objects recognizable.For example, in Figure 1, the unrecognizable key object in the third and fourth frames, chicken, becomes recognizable after the procedural sentence r 3 ∈ R S (middle left of Figure 1) explicitly mentioning "chicken" is paired with the image and transcript.
While such well-aligned procedural sentences can reduce key-object degeneracy, in most cases, there exists no alignment supervision between the video frame and procedural sentences, as the two are generated independently.Our solution is to generate such alignment using a machine retriever.However, key-object degeneracy in the video frame negatively affects the existing retrieval systems as well, e.g., image-text retrieval, from retrieving the aligned procedural sentence.
Inspired by the contextualized understanding of previous/following frames (Qi et al., 2022), our distinction is to guide the retriever to achieve keyobject-aware alignment with procedural sentences, by conducting retrieval based on aggregating interframe information in an object-centric manner.For this goal, we propose Key Object aware Frame Contrastive Learning (KOFCL) for improved differentiation of nearby frames of distinctive procedures, and more robust contextualization of the key object beyond a single procedure.
Our major contributions are threefold: 1) pro-pose a temporal description retrieval task to find the procedural sentences procedurally aligned to each frame in instructional videos, 2) propose a key object-aware frame contrastive learning objective (KOFCL) to improve temporal description retrieval, and 3) show the improved temporal description retrieval repairs degeneracy and improves DVC significantly.

Preliminaries and Related Work
We first introduce our target domain, namely, instruction, and its representations and previous research on their characteristics ( §2.1).Our goal is to improve the encoding of video frame G ( §2.2).
Then we provide a concise overview of our downstream task, DVC ( §2.3).

Target Domain: Instruction and Video, Script
Instruction Instruction refers to structured knowledge explaining how to perform a wide variety of real-world tasks.An instruction S can be represented as a list of N procedures, S = {s j } N j=1 , where each procedure describes the action required for the task, as a tuple of verb a j and key object set Ôj , s j = (a j , O j ).For example, the instruction for cooking chicken parmesan would be a list composed of tuples such as (coat, [chicken, mixture]) which is written in text or shown in the video for human consumption as depicted in Figure 1.
Instructional Video Instructional video, denoted as V S , is a video explaining instruction S. It consists of a list of frames, V S = {v j i |i ≤ |V S | and j ≤ N }.The procedure s j is represented in the key clip k j , the subset of video frames starting at b j and ending at e j .Then, the i-th frame, v j i , represents the corresponding procedure s j when it is included in the key clip k j or the null procedure s 0 if it is not covered by any key clip.For example, Frame 1 in Figure 1 explains its procedure by showing and narrating its key objects in its image x j i and transcript t j i .It is widely known that degeneracy is prevalent in each modality of instructional videos (Zhou et al., 2018a).Specifically, this indicates a large difference between the key object set O j and the key objects recognizable in the frame v j i , Ôj i .There have been previous works that discovered and addressed the degeneracy in a single modality of image (Shi et al., 2019b) or transcript (Huang* et al., 2018).However, our approach aims to repair the degeneracy in both modalities, by leveraging the procedural sentences from instructional transcripts.
Instructional Script An instructional script R S = {r j } N j=1 consists of procedural sentences where each procedural sentence r j represents its corresponding procedure s j explicitly as words describing the action a j and the key objects O j .Representing procedures in disambiguated form, previous works construct instruction S from its corresponding instructional script R S (Lau et al., 2009;Maeta et al., 2015;Kiddon et al., 2015).We propose to adopt R S to disambiguate the unrecognizable key object for mitigating degeneracy.
2.2 Baseline: Representation g j i A baseline to overcome degeneracy is to encode the temporally paired image and transcript (x j i , t j i ) into joint multi-modal representation g j i .For such purpose, we leverage pretrained LXMERT (Tan and Bansal, 2019) 2 , as it is widely adopted to encode the paired image transcript of video frame (Kim et al., 2021;Zhang et al., 2021).Specifically, the transcript t j i and image x j i of the video frame v j i are fed together to pretrained LXMERT.We utilize the representation at the special [CLS] token as the frame representation g j i as follows: We use the resulting representation G = {g j i |i ≤ |V S | and j ≤ N } as features of individual frames that will be fed to DVC systems.Zhou et al., 2018a;Wang et al., 2021).It also outputs the likelihood P k ( k) estimating the predicted clip k to be a key clip which is further used to select the key clips for caption generation.

Caption Generaton
The caption generation task aims to generate caption ĉ describing the predicted key clip k.The predicted key clip k is fed to the captioning module which generates each word ŵi by estimating the probability distribution over vocabulary set W conditioned on key clip k: ŵi = argmax w∈W P (w|w ≤i−1 , k). (2) We adopt EMT and PDVC, DVC systems which are widely adopted or SOTA, as our DVC systems.We refer (Zhou et al., 2018b;Wang et al., 2021) for further details, as our focus is not on improving downstream task models, but on repairing the degeneracy of input instructional videos, which is applicable to any underlying models.

Our Approach
Building on preliminaries, we now describe our retrieval augmented encoding framework in detail.
First, we explain how instructional scripts can contribute to repairing the degeneracy ( §3.1).Our framework combines a cross-modal TDR module ( §3.2), which can aggregate the key objects across frames ( §3.3), to build robust multi-modal representations which repair key-object degeneracy.

Representation Augmentation with Procedural Sentence
Our hypothesis to mitigate degeneracy is that a procedural sentence r j i in R S represent a procedure sj i similar to the procedure s j of each frame v j i .Explaining a similar procedure, the key object set Õj i of r j i has common key objects sufficient to repair degeneracy.Our first distinction is to augment the individual frame representation g j i with the representation d j i of such procedural sentence r j i .Thus, when procedural sentence r j i is provided with video frame v j i , more key objects become recognizable, and the degeneracy in video frames can be reduced.

Temporal Description Retrieval (TDR)
Cross-modal Retrieval for Aligning Sentences with Frames The preliminary study in Sec.3.1 establishes the potential of procedural sentences to repair key-object degeneracy.However, it assumes the ideal scenario where the procedure described by the procedural sentence r j , matches that of the frame v j i , which we call procedural alignment.However, such procedural alignment between procedural sentences and frames is not available in practice, as data of the two modalities are generated completely independently.
We, therefore, propose a cross-modal retrieval task, Temporal Description Retrieval (TDR), as a solution to learn such procedural alignments.We train a frame-sentence retriever, ϕ(v j i , R S ) to take the query frame v j i from video V S , and the instructional script R S as input, and predict, for every procedural sentence r i ∈ R S , their relevance.The goal of ϕ is to find the procedural sentence ri which best explains the procedure s j .
Here, it is important to note that the retrieval task itself is also susceptible to key-object degeneracy, making TDR more challenging.In the presence of key-object degeneracy, single-modality (image or text) encodings can exacerbate this problem, due to a potential information imbalance between the two modalities.Therefore, we formulate the crossmodal TDR as retrieving text encodings using a joint image-text query, using the LXMERT joint image-text representation, g j i .Finally, we augment the feature vector g j i of the frame with vector representation d j i of the retrieved procedural sentence ri as depicted in Figure 1.
Dense Retrieval for Efficiency There can be several options to implement the frame-sentence retriever ϕ(v j i , R S ).Existing architectures fall into two categories, cross retrievers and dense retrievers (Humeau et al., 2020).These differ in how the interaction between the query frame v j i and the procedural sentence r l is modeled.
As TDR conducts retrieval for each frame in V S , efficiency should be prioritized, and we mainly consider the dense retrieval architecture.First architecture, the cross retrieval requires the exhaustive computation of O(|V S | × |R S |) as the v j i and r l interact within a single neural network.However, the dense retrieval conducts the retrieval with little computation cost, at O(|V S | + |R S |), by reusing the encoding of the v j i and r l .Specifically, the dense retriever consists of two distinct encoders Ω V and Ω R , which encode the query frame v j i and the procedural sentence r l independently.Then, the interaction between v j i and r l is modeled as a simple dot product operation, resulting in retrieval as follows: For training, we adopt the contrastive learning objective (Mnih and Kavukcuoglu, 2013), denoted by L TDR , that guides the retriever to assign larger relevance for the gold procedural sentence r + than that of negative procedural sentences r − : , (5) We utilize the caption c j as the gold procedural sentence r + , as there is no available gold procedural sentence, and this approach is reported to be effective in previous work (Gur et al., 2021).We also utilize in-batch negatives, treating all other gold procedural sentences representing different procedures from the identical instructional video, as negative procedural sentences.

Key Object-aware Frame Contrastive
Learning (KOFCL) The key aspect separating instructional videos from standard image-text or textual retrieval is the additional temporal dimension.In order to repair keyobject degeneracy, it is critical to aggregate interframe information across this temporal dimension.
To illustrate, consider the key object of frames 3 and 4 in Figure 1, "chicken", which is not recognizable from either the transcript or the images of Frame 3 and 4, but is clearly recognizable in both image v 1 1 and transcript t 1 1 of Frame 1.We adopt LSTM as a sequence encoder similar to existing video works (Zhou et al., 2018a) and build LXMERT-I 2 which encodes precedent/following frames, g ≤j ≺i and g ≥j ≻i , and outputs the resulting query frame encoding ← → g j i as follows: However, the locality of the frame-level procedure annotations biases such model to simply encode temporally local inter-frame information (Wang et al., 2020), not the key objects.Specifically, the procedures are represented as temporally local frames and such local frames of identical procedures can contribute to repair degeneracy.However, as all local frames are not of identical procedures, e.g.boundaries of the key clips, encoding such frames cannot repair degeneracy and rather confuse the models to consider as the preceding/following procedures.For Frame 3 in Figure 1, temporally local inter-frame information of Frame 2 and 3 is redundant with the given frame, adding little new information.Even worse, confusing that Frame 2 and 3 describe the identical procedure, the model misaligns Frame 3 to the procedural sentence r 2 of the different procedure.On the other hand, identifying the key object which appears in Frame 1, and binding this information into the encoding for Frame 3, would successfully repair the key-object degeneracy of Frame 3.
A recent approach, frame contrastive learning (FCL) (Dave et al., 2022), partially addresses the temporal locality bias.It regards the arbitrary frame pair (v j i , v m n ) as positive when they represent identical procedure and negative otherwise as follows: What makes FCL address the temporal locality bias is that it supervises the difference in the procedures between the local frames so that local frames of different procedures, such as Frame 2 for given Frame 3 in Figure 1, can be less aggregated.
Then, the frame encoder is supervised to map the frames of identical procedures close together in the representation space, while pushing away those of different procedures by FCL loss, L aux (v j i , v m n ), defined as follows: where σ is sigmoid function and W aux is parameter of bilinear layer.Finally, the retriever is optimized to simultaneously minimize L TDR and L aux : where λ aux is a hyper-parameter weighing contribution of L aux during training.However, FCL is limited to contextualizing local frames of identical procedure as the inter-frame information.To extend such contextualization beyond the single procedure, we propose key objectaware frame contrastive learning (KOFCL), which encourages contextualizing the frames of different procedures when they share common key objects, based on a globally shared notion of key objects.The clear advantage of such contextualization is that it enables retrieving the correctly aligned procedural sentence, even when key objects are hardly recognizable in the query frame, by leveraging keyobject information.For example, the missing key object "chicken" of Frames 3 and 4 in Figure 1 can be found in Frame 1 of procedure s 1 , where Frames 1, 3, and 4 will be encouraged to share similar representations through KOFCL.More concretely, we label the frame pair v j i and v m n as positive when they have common key objects.To measure how many key objects a frame pair shares, we computed the intersection of union (IoU) between the key object set of frame pair3 as follows: Using IoU obj (v j i , v m n ), we labeled the frame pair, v j i and v m m , when they share key objects over predefined threshold µ as follows: Converting the FCL label in Eq.( 7) into our proposed label in Eq.( 12), KOFCL supervises to map frame pair, v j i and v m n , close when they not only describe the identical procedure but also share key objects.Thus, the retriever can build a more robust understanding of the key objects in the query frame v j i with key object aware inter-frame information.
4 Experimental Setup

Dataset
We used two distinct instructional video datasets, YouCook2 (Zhou et al., 2018a), a dataset of instructional cooking videos and IVD (Alayrac et al., 2017), a dataset of instructional videos with 5 distinct goals such as CPR, jump the car.As each video provides its goals, we collected the instructional scripts by querying its goal to the web archive4 for YouCook2 following previous work (Kiddon et al., 2015) and the Google search engine for IVD dataset.Our instructional script collection contains an average of 15.33 scripts with 10.15 sentences for each goal in YouCook2 and 1 instructional script with an average of 7.4 sentences for each goal in IVD dataset.We used transcripts generated by YouTube ASR engine following previous works (Xu et al., 2020;Shi et al., 2019aShi et al., , 2020)).5 4.2 Evaluation Settings TDR We evaluated TDR in two distinctive settings to utilize both gold captions and our collected instructional scripts.First, we report the recall metric (R@K) of the gold captions, where all the captions in the same video are considered candidates for retrieval.Second, we evaluated TDR performance on our collected instructional scripts using N DCG ROU EGE−L metric (Messina et al., 2021a,b).It replaces the relevance annotation between the query frame and procedural sentences with lexical similarity score, ROUGE-L, between gold captions and procedural sentences.We report each metric on top-1/3/5 retrieval result.Especially, for recall metrics, we mainly considered the top-1 retrieval result as our priority is to address key object degeneracy.Specifically, retrieving sentences of different procedures containing the same key objects may result in a slightly lower R@3,5.
DVC For the caption generation of DVC, following convention (Krishna et al., 2017;Zhou et al., 2018b), we report lexical similarity of generated captions with gold captions, using BLEU@4 ( Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), CIDEr (Vedantam et al., 2015), and Rouge-L (Lin, 2004), abbreviated as B-4, M, C, and R. For the key clip extraction, we report the average recall of the predicted key clips denoted as AR following convention (Escorcia et al., 2016;Zhou et al., 2018b).For every metric, we provide the average and standard deviation of 5 repetitive experiments.

Results
We now present our experimental results, aiming to address each of the following research questions: RQ1: Is our cross-modal retrieval using joint image-text query more effective than standard retrieval approaches for TDR? RQ2: Does KOFCL address key-object degeneracy in TDR, and help the retriever to build a robust understanding of key objects?RQ3: Does retrieval-augmentation using procedural sentences improve DVC by repairing key-object degeneracy?

RQ1: Effectiveness of joint image-text query formulation for TDR
Query Encoder Input R@1 R@3 R@5 BM25  To verify the effectiveness of our joint imagetranscript query formulation for TDR, we compare our approach with baselines consisting of existing textual and image-text retrieval systems as follows: • BM25 (Robertson, 2009) and BERT (Devlin et al., 2019) are widely used approaches in text retrieval.We adopt them as a baseline using the transcript as a query.
• TERAN (Messina et al., 2021a) and NAAF (Zhang et al., 2022) are the state-ofthe-art image-text retrievers.We adopt them as baselines using the image x j i as a query.Table 2 shows TDR result of the baselines and our joint image-text query formulation LXMERT for the YouCook2 dataset.We can observe that baselines using single modality queries, i.e.BM25 or TERAN, are insufficient for finding the aligned procedural sentence, with R@1 score lower than 40%.LXMERT shows higher TDR results with large margins over baselines in every metric, confirming the effectiveness of our proposed joint image-transcript query.For comparison, we also include the TDR result of our full model, which further improves significantly over LXMERT.
Additionally, we compare a straightforward method to repair degeneracy, by disambiguating pronouns in transcripts.Following previous work (Huang* et al., 2018), we use a co-reference module (Gardner et al., 2017) to convert transcripts into their disambiguated versions, τ j i .Interestingly, we observe a degradation of TDR in every metric.We hypothesize that the co-reference resolution introduces noise from several sources, including the module's inaccuracy itself, but also incorrect pronoun resolution using key objects belonging to other, adjacent procedures.
Next, we evaluate the effectiveness of inter-frame information, in conjunction with KOFCL, in improving the performance of TDR.In Table 3, we report the respective results of TDR on the YouCook2 and IVD datasets, with varying inter-frame information supervision approaches.First, on both datasets, we observe a large improvement of LXMERT-I 2 over LXMERT, reflecting the importance of inter-frame information for TDR.Next, we focus on the effect of jointly supervising LXMERT-I 2 with FCL or KOFCL.When LXMERT-I 2 supervised by FCL, the increase in R@1 is negligible.In contrast, when supervised with our proposed KOFCL, we can observe a meaningful improvement in R@1, on both datasets.These results indicate that KOFCL improves TDR by capturing key-object aware interframe information in a generalizable manner.
In order to further verify that KOFCL contextualizes key objects and repairs key-object degeneracy, we collect an isolated subset of YouCook2, where nearby frames are prone to confuse frame-sentence retrievers with a temporal locality bias.Specifically, we collect the query frames v j i whose corresponding procedure s j has distinct6 key objects from neighboring procedures s j−1 and s j+1 .
We report the R@1 score on this isolated set in Table 4. Whereas FCL fails to improve over LXMERT-I 2 , R@1 improves meaningfully when the frame-sentence retriever is supervised with our proposed KOFCL.These results indicate that KOFCL contributes to the contextualization of key objects, and alleviates the temporal locality bias.

RQ3: Retrieved procedural sentences repair degeneracy and improve DVC
Next, we evaluate the impact of repairing degeneracy on improving downstream task of dense video captioning, which is the main objective of this work.We evaluate our proposed approach, which uses a trained retriever to retrieve procedural sentences from instructional scripts to augment frame representations, with a baseline without any consideration of key-object degeneracy, as well as an advanced baseline, which augments frame representations using the disambiguated version of the transcript τ j i , instead of procedural sentences.We first report the DVC performance on YouCook2 in Table 5.The advanced baseline, which augments the baseline representation g j i with d j i using τ j i , improves performance on both captioning and key clip extraction, showing that DVC can be improved by augmenting frame representations with disambiguated key-object information.Notably, our proposed framework, which augments using procedural sentences retrieved using the LXMERT-I 2 + KOFCL retriever, significantly outperforms both baselines, on all metrics measured, for both tasks.These results indicate that by repairing key-object degeneracy, our retrieved procedural sentences are a better source to augment frame representations for DVC.Moreover, our augmented representations improve results on both EMT and PDVC downstream models, which confirms that our method can be easily applied to improve standard DVC systems, without dramatic modification of the downstream task models.Next, we conduct an ablation study of the contribution of each of our framework components.In Tables 6 and 7, we report the results of DVC on YouCook2 and IVD respectively, using the EMT model with various frame-sentence retrievers.The results confirm that the improvement in the retrieval outcomes translates to better downstream performance on DVC, with LXMERT-I 2 and KOFCL meaningfully improving DVC performance on both datasets.Also, our proposed retrieval augmentation method showed more improvement in the IVD dataset than YouCook2.The key difference between the Youcook2 and IVD datasets is that the IVD dataset is composed of more distinctive instructions, such as "jump the car", "re-pot the plant" and "make coffee", than YouCook2, which contains only cooking instructions.For such distinctive instructions, knowing the key objects can act as clarifying information about the instruction and thus can help generate more accurate captions.Finally, to verify that the improvement in DVC performance is attributable to the repair key-object degeneracy, we divided the test set into definite and degenerative sets and compared the results of baseline representation g j i and our augmented representation g j i ; d j i w/ LXMERT-I 2 + KOFCL.Specifically, the caption c j is considered degenerative when the video frames corresponding to the ground-truth key clip k j have lower than 60% recognizability of image and transcript, and definite when the recognizability is higher than 80%.In Table 8, in contrast to representation g j i , whose CIDEr score decreases on the degenerative set, our augmented representation g j i ; d j i w/ LXMERT-I 2 + KOFCL increases the score on the degenerative set, showing that our augmented representation using retrieved procedural sentences is effective in re-solving the key-object degeneracy in instructional videos.

Conclusion
We proposed retrieval-augmented encoding, to complement video frames, by repairing degeneracy and considering correlations between steps.Our evaluation results validated that our proposed framework improves existing DVC systems significantly.

Limitations
Our method overcomes degeneracy in instructional videos under the assumption of the existence of textual instructional scripts describing the exact instructions of instructional videos.Thus, our method is applicable to instructional videos having such recipe documents.However, we note that similar documents exist for various types of instructions other than cooking, such as topics in other datasets (Alayrac et al., 2017) For temporal description retrieval, we followed the convention of (Krishna et al., 2017;Zhou et al., 2018b;Shi et al., 2019a) and obtained the image frames from the video by down-sampling for every 4.5s.The obtained image frames are then fed to pre-trained object detector (Anderson et al., 2018) to yield the sequence of object region features.For image encoder Ω v and the text encoder Ω r , we used the image encoder of pretrained LXMERT and BERT-base-uncased (Devlin et al., 2019), respectively.For training temporal description retrieval, we used one video as a batch, resulting in all the sampled frames and recipe sentences in a batch coming from the same video.We adopt an Adam optimizer with a learning rate of 0.0001.We set the weighing contribution λ aux in Eq. 10 to be 0.05 and the threshold µ for KOFCL to be 0.1, based on validation set result.

Computation of Recognizability
To compute the joint recognizability of the image and transcript, instructional script, we first computed the recognizability in each modality.In the image, we considered the key objects to be recognizable when they are labeled to be inside the image without occlusion in human annotation (Shen et al., 2017).In the textual modality, transcript and instructional script, the key objects are considered to be recognizable when they are lexically referred in transcripts or instructional scripts.Then, we considered the key objects to be recognizable when they are in the union of the recognizable key object set of each modality.

Ablation on Sequence Encoder
Here, we show the result of TDR with distinct sequence encoders.In highest R@1 score.While we adopted LSTM as our sequence encoder, our KOFCL is orthogonal to any sequence encoder and can be adapted to any existing sequence encoder.

Dataset
We conducted experiments on the two distinct instructional video datasets, YouCook2 (Zhou et al., 2018a), a dataset of instructional cooking videos and IVD dataset (Alayrac et al., 2017), a dataset of instructional videos with 5 distinct topics.Though YouCook2 originally provides 2000 videos, as some videos are unavailable on YouTube, we collect the currently available videos, obtaining 1,356 videos.For the dataset split, we follow the original split ratio from (Zhou et al., 2018a) to YouCook2: 910 for training, 312 for validation, and 135 for testing for YouCook2.For the IVD dataset, we used 104 for training, 17 for validation, and 32 for testing.
This split is used for both TDR and DVC.Each video is labeled with starting and ending times of key clips, and their textual descriptions.For transcripts, we use YouTube's ASR engine.We collected the instructional documents from the web archive7 for YouCook2 following previous work (Kiddon et al., 2015) and top-1 retrieved result from the google search engine for IVD dataset.
Our instructional document collection contains an average of 15.33 documents with 10.15 sentences for YouCook2 dataset and 1 instructional document with 20 sentences for IVD dataset.

Qualitative Results
Here, we provide the generated result of EMT without/with our retrieved recipes in Figure 2. In all examples, there exist the key objects hardly recognizable from the images which EMT fail to mention in the generated caption.However, our retrieved recipes provide the disambiguated reference of such key objects and enable EMT to generate more accurate caption containing them.D3.Did you discuss whether and how consent was obtained from people whose data you're using/curating?For example, if you collected data via crowdsourcing, did your instructions to crowdworkers explain how the data would be used?Left blank.
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?Left blank.
D5. Did you report the basic demographic and geographic characteristics of the annotator population that is the source of the data?Left blank.
Task: DVC Given an instructional video V S describing instruction S, DVC consists of two subtasks of key clip extraction and caption generation.Key Clip Extraction Given a sequence of video frames, key clip extraction module predicts key clip k = ( b, ê) by regressing its starting/ending time b and ê (

Figure 2 :
Figure 2: Example of the retrieved procedural sentence and generated captions without/with retrieved procedural sentence.Top 2 figures are from YouCook2 dataset and bottom figure is from IVD dataset.

Table 3 :
Temporal description retrieval results ablated on inter-frame information

Table 8 :
CIDEr scores results on definite/degenerative sets.
, e.g., how to jump start a car, or change a tire.
C2. Did you discuss the experimental setup, including hyperparameter search and best-found hyperparameter values?Left blank.C3.Did you report descriptive statistics about your results (e.g., error bars around results, summary statistics from sets of experiments), and is it transparent whether you are reporting the max, mean, etc. or just a single run?Left blank.C4.If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE, etc.)?Left blank.D Did you use human annotators (e.g., crowdworkers) or research with human participants?D1.Did you report the full text of instructions given to participants, including e.g., screenshots, disclaimers of any risks to participants or annotators, etc.? Left blank.D2.Did you report information about how you recruited (e.g., crowdsourcing platform, students) and paid participants, and discuss if such payment is adequate given the participants' demographic (e.g., country of residence)?Left blank.