O2NA: An Object-Oriented Non-Autoregressive Approach for Controllable Video Captioning

Video captioning combines video understanding and language generation. Different from image captioning that describes a static image with details of almost every object, video captioning usually considers a sequence of frames and biases towards focused objects, e.g., the objects that stay in focus regardless of the changing background. Therefore, detecting and properly accommodating focused objects is critical in video captioning. To enforce the description of focused objects and achieve controllable video captioning, we propose an Object-Oriented Non-Autoregressive approach (O2NA), which performs caption generation in three steps: 1) identify the focused objects and predict their locations in the target caption; 2) generate the related attribute words and relation words of these focused objects to form a draft caption; and 3) combine video information to refine the draft caption to a fluent final caption. Since the focused objects are generated and located ahead of other words, it is difficult to apply the word-by-word autoregressive generation process; instead, we adopt a non-autoregressive approach. The experiments on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate the effectiveness of O2NA, which achieves results competitive with the state-of-the-arts but with both higher diversity and higher inference speed.


Introduction
The task of video captioning, which aims to generate a descriptive sentence based on the input video, has a wide range of applications. In recent years, deep neural models, particularly the models based on the encoder-decoder framework (Venugopalan et al., 2015;Pan et al., 2016b;Xu et al., 2017;Aafaq et al., 2019), have achieved great success * Equal Contributions.

Conventional Video Captioning Model
a man is watching people ride down a road.

Object-Oriented Non-Autoregressive Model
Objects: motorcycles, people, street, bikes, road 1. two motorcycles speed down a street. 2. two people are speeding down a road on motorcycles. 3. people on motorcycles racing down the street. 4. some people are speeding on bikes. 5. two people are racing bikes on the road. … … Figure 1: Examples of the captions generated by a stateof-the-art conventional video captioning model (Zheng et al., 2020) and our model. Compared to the conventional model, whose generation process is hardly controllable, our model can be guided to mention the desired objects (i.e., the colored objects) and generate diverse, object-oriented captions for a video.
in advancing the state-of-the-art (Pan et al., 2020;Zheng et al., 2020;Perez-Martin et al., 2021;Yang et al., 2021). These models usually entail the autoregressive property, i.e., conditioning each word on the previously generated words.
In video captioning, one critical step is to detect and include focused objects. As exemplified in Figure 1, when a dangerous situation occurs, a captioning-based blind-aid system should focus on the dangerous objects on the road to alert the visually-impaired people, rather than over-describe the presence of pedestrians or shops nearby. It means that in the above example, speeding vehicles should be considered as focused objects and should be mentioned in the generated caption. While people could identify focused objects in video easily (Shinn-Cunningham, 2008;Corbetta and Shulman, 2002;Posner and Petersen, 1990), existing captioning systems can hardly be controlled to generate focused objects because of their word-to-word generation practice. Motivated by those observations, we introduce the problem of controllable video captioning in the sense of controlling contents.
As shown in Figure 2, to solve the controllable video captioning problem, we propose the Object-Oriented Non-Autoregressive approach (O2NA). Different from conventional models that adopt a left-to-right or word-by-word decoding process, O2NA applies a non-autoregressive manner to control the caption generation. O2NA first detects all objects that appear in the video and then selects the focused objects for the final caption. For example, in the aforementioned blind-aid system, the system would select the dangerous objects speeding vehicles in case of an emergency. Next, the caption generation process consists of three main steps: 1) locate all focused objects in the proper locations of the target caption; 2) generate the related attribute words and relation words to form a draft caption; and 3) adopt the iterative refinement approach (Ghazvininejad et al., 2019;Lee et al., 2018) to proofread and improve the draft caption.
For each step, as there is no dependency among generated words, the words can be generated in parallel, indicating a fixed computing time regardless of caption length, while computing time of the conventional autoregressive approach is linear with the caption length. For long captions, conventional methods embody high inference latency, which limits their adoption in real-time applications, e.g., blind-aid system (Voykinska et al., 2016) and human-robot interaction (Das et al., 2017). According to our experiments and analyses on two benchmark datasets, i.e., MSR-VTT (Xu et al., 2016) and MSVD (a.k.a. Youtube2Text) (Guadarrama et al., 2013), our O2NA is able to produce a descriptive and fluent caption which outperforms several existing methods in terms of both accuracy and efficiency.
Overall, the main contributions of this paper are: • We introduce the problem of controllable video captioning in the sense of controlled contents, which has more practical values than the existing studies on syntactic variations.
• Specifically, we propose the Object-Oriented Non-Autoregressive approach (O2NA) to tackle the controllable video captioning problem by injecting strong control signals conditioned on selected objects, with the benefits of fast and fixed inference time, which are critical for real-time applications.
• We evaluate our approach on two datasets. In particular, our O2NA achieves competitive results with the state-of-the-art methods with higher diversity and higher inference speed.
The rest of this paper is organized as follows: Section 2 reviews the related work; Section 3 introduces the proposed Object-Oriented Non-Autoregressive approach (O2NA) in detail; Section 4 and Section 5 present the experimental results and analyses, respectively; and finally, Section 6 concludes the paper.

Related Work
In this section, we describe the related work from 1) Video Captioning, 2) Controllable Image Captioning and 3) Non-Autoregressive Decoding.

Video Captioning
Recently, a large number of encoder-decoder based neural models have been proposed for video captioning (Venugopalan et al., 2015;Yao et al., 2015;Pan et al., 2016b,a;Xu et al., 2017;Aafaq et al., 2019Aafaq et al., , 2020Zheng et al., 2020;Yang et al., 2021;Perez-Martin et al., 2021). These methods mainly introduce a convolutional neural network (CNN) (Krizhevsky et al., 2012) to encode the video and employ a LSTM (Hochreiter and Schmidhuber, 1997) or a Transformer (Zhou et al., 2018) to generate the coherent captions with the attention mechanism (Bahdanau et al., 2015;Pan et al., 2016b). However, these methods lack controllability, i.e., their behaviors can hardly be influenced. Our model allows an easy way to control the contents of video captions rather than merely syntactic variations in existing studies.

Controllable Image Captioning
Different from image captioning (Xu et al., 2015;Vinyals et al., 2015;Lu et al., 2017;Anderson et al., 2018; that processes a static image with details of almost every appeared object, video captioning considers a sequence of frames which biases towards focused objects. It is still worth noting that the controllable image captioning has been explored most recently (Cornia et al., 2019;Chen et al., 2020;Zheng et al., 2019). However, all of them are based on autoregressive decoding, i.e., conditioning each word on the previously generated outputs. Therefore, to control the generation of image captions, a major challenge is to decide the timing to attend to the region-of-interest (i.e., the object we care about). Zheng et al. (2019) first fixes the cared object and generates the rest captions to its left and right which can only apply to the case with a single cared object. To scale to multiple cared objects, Cornia et al. (2019) implements a region pointer mechanism to predict, at each timestep, whether this pointer should be incremented or not; Chen et al. (2020) introduces the abstract scene graph, to control the generation of captions, they proposed graph-based attention and graph updating mechanisms to adaptively select relevant nodes, which contain the concerned objects to generate next word.
In this work, we focus on controllable video captioning, which is a more challenging problem than controllable image captioning. It is hard for controllable video captioning to construct the same regionof-interests (RoIs) as in Cornia et al. (2019) and scene graphs as in Chen et al. (2020). To this end, based on the non-autoregressive decoding methods in neural machine translation (Gu et al., 2018;Lee et al., 2018;Ghazvininejad et al., 2019;Wang et al., 2019b;Shao et al., 2019), we propose Object-Oriented Non-Autoregressive model, which does not need the RoIs in Cornia et al. (2019) or scene graphs in Chen et al. (2020) to generate controllable video captions. Moreover, our approach can generate all the objects we care about in parallel, leading to fast generation speed.
It is worth noting that Wang et al. (2019a);  also introduced the controllable video captioning. However, they devoted to employing Part-of-Speech (POS) information to guide caption generation, which mainly focuses on improving diversity and adjusting the syntactic structure of the captions, instead of constraining the model to generate captions containing the focused objects.

Non-Autoregressive Decoding
Most recently, non-autoregressive decoding has received growing attention in the community of neural machine translation (NMT) (Gu et al., 2018;Ghazvininejad et al., 2019;Lee et al., 2018;Guo et al., 2019;Shao et al., 2019;Kasai et al., 2020;Ren et al., 2020;Haviv et al., 2021;Hao et al., 2021). Such models remove the sequential dependency and can generate all words of a sequence in one step, resulting in high inference efficiency. Inspired by the success of non-autoregressive decoding, we propose the Object-Oriented Non-Autoregressive model. As for the network structure, these current nonautoregressive models usually employ a completely empty sequence as the input of decoder to generate the whole sentence in the early stages, which gives a high risk of producing translation errors. Differ-ent from these works, we consider exploiting the objects in the video and propose to first generate an object-oriented coarse-grained caption, and then refine each object word with rich contextual information to generate the whole caption to alleviate the description ambiguity problem.

Approach
We first briefly introduce the backgrounds of our approach and then describe the approach in detail.

Backgrounds
The backgrounds are introduced from the used Video Representations and Basic Module.
Video Representations For video captioning, image and motion features have been widely used. Image features are good at illustrating the shapes, the colors and the relationships of the items in the image; Motion features are important for capturing the actions and temporal interactions. Following Pei et al. (2019), given a video, N = 8 key frames are uniformly sampled to extract image features I. Considering both the past and the future contexts, we take each key frame as the center to generate corresponding motion features M . Specifically, for the image features, we adopt the ResNet-101 (He et al., 2016) pre-trained on ImageNet (Deng et al., 2009) to extract the 2048-D image features I ∈ R N ×d i (d i = 2048), which are the output of the last convolutional layer. The motion features are usually given by the 3D CNN (Tran et al., 2015), we adopt the ResNeXt-101 (Hara et al., 2018) pretrained on the Kinetics dataset (Kay et al., 2017) to extract the 2048-D motion features M ∈ R N ×dm (d m = 2048). In this paper, both features are projected to d h = 512. Then, we use the concatenation of the two projected features as the video representations V ∈ R 2N ×d h to our model.

Basic Module
Our approach is adapted from the non-autoregressive decoding models (Lee et al., 2018;Ghazvininejad et al., 2019), which is based on the Transformer decoder (TFM) (Vaswani et al., 2017). Specifically, the TFM consists of a selfattention, a source-attention and a feed-forward network (FF). The multi-head attention (MHA) is the basic of self-attention and source-attention. Overall, the TFM is defined as follows: Please refer to Vaswani et al. (2017) for the detailed introduction of the Transformer decoder (TFM).  Figure 2: Illustration of our proposed O2NA, which consists of an object predictor (OP), a length predictor (LP), an object generator (OG) and a caption generator (CG). The object predictor and length predictor extract the objects appearing to the input video and estimate the length of target caption, respectively; The object generator locates all the focused objects we care about in the target caption; The caption generator generates the rest words to link focused objects to form a fluent caption. It is worth noting that the focused objects could be the objects predicted by the object predictor, the preferred objects given by the user or the pre-defined concerned objects, e.g., the dangerous objects in the captioning-based blind-aid system.

Object-Oriented Non-Autoregressive Approach (O2NA)
As stated above, we adopt the Transformer decoder (Vaswani et al., 2017) to implement our Object-Oriented Non-Autoregressive approach (O2NA). Specifically, as shown in Figure 2, O2NA consists of an object predictor, a length predictor and two Transformer decoders, where the first decoder focuses on generating all the objects we care about in parallel (i.e., object generator), and the second decoder pays attention to linking these objects to form a fluent caption (i.e., caption generator).
Object Predictor (OP) The OP is expected to predict the objects that appear in the given video. We first build an object vocabulary based on the training captions. Given this object vocabulary, we can associate each video with a set of objects according to its human-annotated captions. Specifically, we denote the ground truth objects as where M represents the size of object vocabulary; o * i = 1 if the video is annotated with object i, and o * i = 0 otherwise. During the training phase, we directly use the ground truth objects O * . At the inference stage, we adopt a two-layer non-linear layer to predict the objects O ∈ R M , defined as: where MP denotes the Mean Pooling, σ is the sigmoid function; are the parameters to be learned. Next, following Wu et al. (2016), we minimize the elementwise logistic loss function L OP to train our OP: During the inference procedure, to select the final predicted objects, we set a threshold γ, which means that if the o i > γ, we reset o i = 1, and reset o i = 0 otherwise. In particular, if we care about some specific objects, for example, the user preferred objects or the pre-defined dangerous objects in the captioning-based blind-aid system, we could just set the value of these concerned objects equal to 1, and set the value of other objects equal to 0.
Length Predictor (LP) In the generation process, the non-autoregressive decoding model needs to know the length of target captions (Ghazvininejad et al., 2019). To this end, at training time, we use the sequence length l * of ground truth caption. At inference stage, given the video information V ∈ R 2N ×d h and the focused objects O ∈ R M , we adopt a LP to predict the length l. In detail, we apply a two-layer network to achieve the effect: where [·; ·] represents the concatenation operation; denotes the pre-defined maximum sequence length. Thus, p l ∈ R lmax is a probability. We adopt the cross entropy loss L LP to train the LP, which can be defined as follows: Object Generator (OG) The object generator is based on the non-autoregressive decoder and is dedicated to generating all the objects we care about at once. To achieve such effect, we adopt a single-layer Transformer decoder 1 , followed by a linear layer and a softmax function. In implementation, the object generator takes the fully masked and e i denotes the word embedding of [MASK] token and position embedding, respectively. Then the object information O is added to X 0 , i.e., At last, the transformer decoder in the object generator takes the X 0 ⊕ OW O as input (⊕ denotes the matrix-vector addition), and generates all objects at the position in the final caption, i.e., an objectoriented coarse-grained caption, which can be defined as follows: where X 0 ∈ R l×d h , V ∈ R 2N ×d h , O ∈ R M represent the input sequence, the video representations and the predicted objects, respectively; W O ∈ R M ×d h and W OG ∈ R d h ×|D| are the matrices for linear transformation; |D| is the size of vocabulary D. Each value of p 0 ∈ R l×|D| is a probability indicating how likely each word in D should be the current output word. At training time, for each human-annotated caption, we mask all the non-object words based on the object vocabulary to acquire the ground truth object sequence Y * obj = (. . . , [MASK], . . . , object i , . . .). Our goal is to minimize the following standard cross entropy loss: Caption Generator (CG) In implementation, the caption generator shares the same structure with object generator. The main differences between the two generators are the different generating objective and the input sequence. Specifically, the caption generator takes the object sequence X 1 as input, where X 1 equals to Y * obj and Y obj at the training stage and inference stage, respectively, and generates the related attribute words and relation words to form a draft caption, which is defined as: where p 1 ∈ R l×|D| . Given the ground truth caption Y * cap = (y * cap 1 , y * cap 2 , . . . , y * cap l ), we adopt standard cross entropy loss as the loss function to train the CG, which can be defined as follows: Since the non-autoregressive approach removes the sequential dependency, we may have introduced the "multi-modality problem" (Gu et al., 2018) (i.e., a word could appear in multiple position to form different captions). So we further adopt the iterative refinement approach (Lee et al., 2018) to proofread Y 1 . In implementation, to acquire the input sequence X 2 , we randomly mask n = l * r words in Y * cap and mask out top n words with the lowest confidence in Y 1 at the training time and inference time, respectively, where l and r represent the caption length and masking ratio, respectively, and the confidence is taken to be the output probability. To obtain the final caption, we employ the following equation, which is defined as: Finally, the cross entropy loss is defined similar as Eq. (9): Overall, by combining the L OP in Eq. (3), L LP in Eq. (5), L OG in Eq. (7), L CG in Eq. (9) and L CG in Eq. (11), the full training objective is: where λ 1 , λ 2 , λ 3 , λ 4 and λ 5 are the hyperparameters that control the regularization. For simplicity, we set λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = 1, since we find that our approach can achieve competitive results with the state-of-the-art models in major metrics under this setting (see Section 4.2), thus we do not attempt to explore other settings. Overall, through Eq. (12), we are able to realize our Object-Oriented Non-Autoregressive approach (O2NA). The trained model is encouraged to describe the focused objects that a user cares about.
In this section, we first describe the datasets, metrics and settings used for evaluation, then followed by the experimental results of our approach.

Datasets
Our results are evaluated on the benchmark Microsoft Video Description (MSR-VTT) (Xu et al., 2016) and Microsoft Video Description (MSVD) (Guadarrama et al., 2013) datasets. For MSR-VTT, the dataset contains 10,000 video clips, and each video is paired with 20 annotated sentences. Following common practice (Pei et al., 2019;Yang et al., 2021;Pan et al., 2020), we use the official splits to report our results. Thus, there are 6513, 497 and 2990 video clips in the training set, validation set and test set, respectively. For MSVD, it contains 1,970 video clips and roughly 80,000 English sentences. We follow the split settings in Pei et al. (2019), resulting in 1,200, 100 and 670 videos for the training set, validation set and test set, respectively. Following previous works, we replace caption words that occur less than 3 times in the training set with the [UNK] token, plus with a [MASK] token, resulting in a vocabulary of 10,546 words for MSR-VTT and 9,467 words for MSVD.

Metrics
We test the model performance with a standard captioning evaluation toolkit (Chen et al., 2015). It reports the widely-used automatic evaluation metrics CIDEr , ROUGE-L (Lin, 2004), METEOR (Lin and Hovy, 2003;Banerjee and Lavie, 2005) and BLEU (Papineni et al., 2002). Among them, CIDEr, which incorporates the consensus of a reference set for an example, is based on n-gram matching, is specifically designed for evaluating captioning systems. BLEU and ME-TEOR are originally designed for machine translation evaluation, while ROUGE-L is proposed for automatic evaluation of the extracted text summarization. Besides, we further adopt the evaluation metrics Novel, Unique and Vocab Usage, provided by Dai et al. (2018), to evaluate the diversity of the generated captions. Novel is calculated by the percentage of generated captions that have not been seen in the training data; Unique is calculated by the percentage of generated unique words among the other all generated captions; Vocab Usage denotes the percentage of words that are used to gen-erate captions in the vocabulary.

Settings
As stated in Section 3.1, we set N = 8, d i = d m = 2048 and d h = 512 for the video representations. All category tags (Xu et al., 2016) included in MSR-VTT. For the object predictor, to compare with existing methods, we set the threshold γ = 0.8 and directly select all the predicted objects to generate captions. For the length predictor, the maximum sequence length l max is set to 30. For the object generator and caption generator, following the original setting as in Transformer (Vaswani et al., 2017), the model size d h = 512. The number of heads in multi-head attention is set to 8 and the feed-forward network dimension is set to 2048. The masking ratio r = 0.5. To build the object vocabulary, we use the spaCy library 2 for noun tagging from the training dataset, resulting in 5,647 and 4,681 noun words for MSR-VTT and MSVD, respectively. The tagged noun words are taken as the object words, building up the object vocabulary with sizes of 5,647 and 4,681 for MSR-VTT and MSVD, respectively. Therefore, we do not use external data to build the object vocabulary. Specifically, the object predictor labels will match the words used to name objects in the captions. We use Adam optimizer (Kingma and Ba, 2014) with a batch size of 64 and a learning rate of 5e-4 within maximum 50 epochs for parameter optimization.
As each video is annotated with multiple sentences, i.e., Video -{Caption i }, where each sentence Caption i includes a set of objects {Object i }, we use all objects appearing in these sentences as the ground truth objects for each video to train the object predictor. However, we treat the different sentences as independent training samples, i.e., Video -Caption i -{Object i }, to train length predictor, object generator and caption generator. In this manner, we can ensure that the focused objects {Object i } appears in the target sentence Caption i during training and inference, which allows an easy way to control the contents of video captions.
Following the non-autoregressive decoding models of neural machine translation, we incorporate the knowledge distillation (Kim and Rush, 2016;Gu et al., 2018) and de-duplication (Wang et al., 2019b) techniques to improve the performance of our non-autoregressive model on MSR-VTT. Furthermore, following Gu et al. (2018); Wang et al.

Methods
Dataset: MSVD (Guadarrama et al., 2013) Dataset: MSR-VTT (Xu et al., 2016) Table 1: Performance of automatic evaluation on the test sets of MSVD and MSR-VTT. Higher is better in all columns. † denotes our own implementation. VPS stands for videos per second at the inference stage, which is measured on a single NVIDIA GeForce GTX 1080 Ti. In this paper, the Red-and the Blue-colored numbers denote the best and the second best results across all approaches, respectively. All existing video captioning systems follow the autoregressive approach to generate the captions and cannot control the video captioning process to ensure the inclusion of the focused objects. In comparison, O2NA can not only describe the focused objects, but also achieve competitive performances with the state-of-the-arts in major metrics with both higher diversity and faster inference.
(2019b); Yang et al. (2021), to generate the captions, we also adopt the teacher re-scoring technique and noisy parallel decoding (Gu et al., 2018;Yang et al., 2021) techniques, which could generate a set of candidate sentences in parallel, then, we select the candidate sentence with the highest output probability as the final generated caption. For the detailed introduction of these techniques, please refer to original papers (Kim and Rush, 2016;Gu et al., 2018;Wang et al., 2019b;Yang et al., 2021).

Evaluation Results
In comparable settings, twelve representative methods, including five most recently published stateof-the-art approaches, namely STAT (Yan et al., 2020), STGN-OAKD (Pan et al., 2020), ORG-TRL (Zhang et al., 2020), SAAT (Zheng et al., 2020), SGN (Ryu et al., 2021) and SemSynAN (Perez-Martin et al., 2021), are selected for comparison. Unless specifically stated, we directly report the results from the original papers. The results on the test of MSVD and MSR-VTT datasets are shown in Table 1. As we can see, our O2NA achieves the results competitive with the state-of-the-art models on the two datasets in major metrics. The competitive performances verify the validity of our O2NA for standard video captioning. More encouragingly, in terms of the metrics that evaluate the diversity of the generated captions, O2NA surpasses the previous state-of-the-art models with relatively 39%, 31% and 18% margins in terms of Novel, Unique and Vocab scores, which proves our arguments and corroborates the effectiveness of our approach.
Moreover, since our O2NA generate the entire captions in three steps with a fixed generation time, we achieve the fastest inference speed (highest VPS in Table 1) among existing methods. Overall, our O2NA achieves performances competitive with state-of-the-arts in major metrics but with higher diversity scores and faster inference speed. The experimental results show that our approach is able to generate fluent and diverse video captions with fast inference speed. More importantly, our O2NA allows an easy way to control the contents of video captions rather than merely syntactic variations in existing studies. These advantages of our approach could have the potential to promote the application of video captioning for real-time industrial applications, e.g., helping visually impaired people see (Voykinska et al., 2016) and human-robot interaction (Das et al., 2017).

Analysis
In this section, we conduct analysis on the benchmark MSR-VTT dataset from different perspectives to better understand our approach.

Quantitative Analysis
We first conduct the quantitative analysis to investigate the contribution of each component in our proposed O2NA.

Ablation Study
Compared to conventional non-autoregressive decoding models (Baseline) from neural machine translation (Lee et al., 2018;Gu et al., 2018;Sections Settings Methods Iteration Times  Table 2: Quantitative analysis of O2NA. Baseline denotes the conventional non-autoregressive decoding model in neural machine translation (Lee et al., 2018;Gu et al., 2018;Ghazvininejad et al., 2019). OP and OG denote the object predictor and object generator, respectively. Ghazvininejad et al., 2019), our O2NA further introduces the object predictor and object generator for controllable video captioning. Therefore, we investigate the contribution of the two components and the results are shown in Table 2.

Number of Layers
Effect of the Object Predictor (OP) As expected, since the OP can provide explicit visual clues (i.e., objects) of the input video, the model achieves improved results (c.f. Table 2(b)), especially in Novel and Unique scores, indicating that the OP helps to generate diverse captions. The improved results prove the effectiveness of our OP.
Effect of the Object Generator (OG) As shown in Table 2(O2NA), when further equipping with the OG, the model significantly outperforms the Baseline, which employs a completely empty sequence as the input to generate the whole sentence. Intuitively, such practice in Baseline may give high risk of producing errors. Fortunately, the object-oriented coarse-grained captions generated by our OG could provide rich contextual information for the following non-autoregressive decoding model to generated accurate revised captions. It proves our arguments and verifies the effectiveness of generating captions in a coarse-grained to finegrained manner. Overall, the proposed OP and OG can boost the performance from different perspectives, making our O2NA generate diverse and accurate captions.

Effect of the Iteration Times
In O2NA, we adopt the iterative refinement technique (Lee et al., 2018) to proofread and improve the generated captions (see Eq. (10)). However, in conventional non-autoregressive decoding methods for neural machine translation (Gu et al., 2018;Lee et al., 2018;Ghazvininejad et al., 2019;Guo et al., 2019;Shao et al., 2019), they usually adopt more iterations to obtain better results. As to O2NA, Table 2(c-e) shows that performances stabilize with the increasing number of iterations but do not show a significant increase as in Lee et al. (2018); Ghazvininejad et al. (2019). The reason is that our generated object-oriented coarse-grained captions have provided a solid guidance (i.e., rich contextual information) for non-autoregressive video captioning model, which further proves the effectiveness of our approach. The decreased performance of diversity may be due to the over-fitting problem brought by more iterations, making the model prone to generating frequent captions in the training data. Thus, considering the trade-off between "the performance of caption generation" and "the performance of diversity and inference speed", we only proofread the generated captions once.

Effect of the Number of Layers
When increasing the number of layers to 2 (c.f. Table 2(f)), the model can only achieve a slightly improved result on BLEU-4 (i.e., 41.6 → 41.8), but loses 31.5% inference speed. At the same time, if the number of layers is further increased, the performance decreases. We hypothesize that when training on video captioning datasets that are relatively small compared to those for neural machine translation, larger depths add to the difficulty of training, which is the same case with deep RNNs. In brief, considering the trade-off between the performance and inference speed, we adopt a singlelayer Transformer decoder.

Case Study and Error Analysis
In this section, we list some correct and incorrect examples to show the controllability of our proposed O2NA intuitively. In the analysis, we man-

Correct Examples
(woman, kitchen) a woman is in a kitchen . (woman, food, pan) a woman is cooking some food with a pan.  ually select the predicted objects to encourage the model to generate a set of diverse captions. Figure 3 shows that our approach is controllable and explainable. Specifically, it can generate multiple diverse captions for the same video, and can accurately follow the selected objects we care about. Besides, we find that the error mainly takes place when there are incorrectly predicted objects, e.g., "suitcase" and "shirt". O2NA mistakes the incorrect object for an appropriate one during its object sequence generation. A more powerful object predictor may be helpful in solving these problems, but it is unlikely to be completely avoided.

Conclusions
In this work, we introduce the problem of controllable video captioning in the sense of controlled contents. In contrast to the existing studies considering syntactic variations, controlling contents is of more practical value. To tackle the problem, we propose the Object-Oriented Non-Autoregressive approach (O2NA), which encourages the model to describe the focused objects that a user cares about by generating captions conditioned on the focused objects non-autoregressively. The experiments and analyses verify the flexibility and demonstrate the effectiveness of O2NA, which achieves competitive results with existing state-of-the-art models on two benchmark datasets in major metrics with higher diversity and faster inference. These advantages could promote the application of video captioning adapting to real-world scenarios.

Acknowledgments
This work is supported in part by Beijing Academy of Artificial Intelligence (BAAI). We sincerely thank all the anonymous reviewers and chairs for their constructive comments and suggestions that substantially improved this paper. We also sincerely thank Bang Yang for providing the imple-mentation code of non-autoregressive framework for video captioning. 3 Xu Sun is the corresponding author of this paper.

Impact Statement
This paper introduces the problem of controllable video captioning in the sense of controlled contents to efficiently understand the visual content of a given video and generate corresponding descriptive sentences. As a result, our work can control the video captioning process and include focused objects, i.e., the video captions generated by our model are more likely to contain preferred objects given by a user or pre-defined objects that should be prioritized in generation. It improves the practicality of video captioning in real-world applications, such as visual retrieval, human-robot interaction and aiding visually-impaired people. However, the training of our proposed model relies on a large volume of video-caption pairs, which may not be easily obtained in the real world but could be alleviated using techniques such as distillation from publicly-available pre-trained models. Hence, it requires specific and appropriate treatment by experienced practitioners.