VKIE: The Application of Key Information Extraction on Video Text

Extracting structured information from videos is critical for numerous downstream applications in the industry. In this paper, we define a significant task of extracting hierarchical key information from visual texts on videos. To fulfill this task, we decouple it into four subtasks and introduce two implementation solutions called PipVKIE and UniVKIE. PipVKIE sequentially completes the four subtasks in continuous stages, while UniVKIE is improved by unifying all the subtasks into one backbone. Both PipVKIE and UniVKIE leverage multimodal information from vision, text, and coordinates for feature representation. Extensive experiments on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and efficient inference speed.


Introduction
Extracting information from video text is an essential task for many industrial video applications, i.e., video retrieval (Radha, 2016), video recommendation (Yang et al., 2007), video indexing (Yang et al., 2011), etc. Visual text embedded in videos usually carries rich semantic descriptions about the video contents, and this information gives a highlevel index for content-based video indexing and browsing.
Conventional methods utilize OCR (Liao et al., 2018;Tian et al., 2016;Zhou et al., 2017) to extract visual texts from videos frames and employ text classification techniques (Le et al., 2018;Li et al., 2020) to categorize the extracted content.However, these methods suffer from two significant shortcomings: 1) Visual texts are typically coarse-grained at the segment level, and are unable to capture finegrained information at the entity level, which is critical for downstream tasks.2) Traditional methods have not fully utilized the fusion of features from different modalities.
* These authors contributed equally to this work.Therefore, in our work, we introduce a novel industrial task for extracting key information from video text and exploring the relationship between entities, which we refer to as VKIE.The task aims to extract valuable hierarchical information from visual texts, explore their relationships, and organize them in structured forms.This approach enables effective management and organization of videos through the use of rich hierarchical tags, which can be utilized to index, organize, and search videos at different levels.Figure 1 provides an example of the hierarchical key information extracted by VKIE, where subtitles are captured at the segment level, and personal information is organized with names and identities at the entity level.
To enhance clarity, we decompose VKIE into four subtasks: text detection and recognition (TDR), box text classification (BTC), entity recognition (ER), and entity linking (EL).While the first subtask, TDR, is typically accomplished using offthe-shelf OCR tools, our work concentrates on the remaining three subtasks of BTC, ER, and EL.
Since TDR outputs all boxes with text content and coordinates information, there are massive useless texts, such as scrolling texts and blurred background texts, which could have side effects on downstream tasks.BTC aims to eliminate these useless texts and find valuable categories, such as title, subtitle, and personal information.
Although the BTC method can obtain segment level information, the results are relatively coarsegrained and will limit its deployment to many downstream applications.For example, in videotext retrieval, the query is usually in different forms, such as keywords, phrases, or sentences.In video indexing, a video is required to be stored with hierarchical tags.To address these issues, we designed ER to extract entities from text segments and EL to explore the relations among the entities.With this structured information, videos can be well managed with rich hierarchical information at the entity and segment levels.
In this paper, we present two solutions that have been deployed in our industry system.The first approach, called PipVKIE, involves performing the tasks sequentially, which serves as our baseline method.The second approach, called UniVKIE, achieves better performance and efficiency by more effectively integrating multimodal features.
In summary, our contributions are as follows: (1) We define a new task in the industry to extract key information from video texts.By this means, structured information could be effectively extracted and well managed at hierarchical levels.
(2) We introduce and compare two deployed solutions based on the framework includes TDR, BTC, ER, and EL.Experiments show our solutions can achieve remarkable performance and efficient inference speed.
(3) To make up the lack of datasets, we construct a well-defined dataset to provide comprehensive evaluations and promote this industrial task.

PipVKIE
The PipVKIE solution fulfills three subtasks of BTC, ER, and EL in a sequential pipeline and processes a single visual box at a time.In this process, BTC acts as a filter, selecting only the valuable text segments.After BTC performs, ER is carried out only on the segments selected by BTC.Similarly, when performing EL, only the entities extracted by ER are inputted, while other irrelevant information is filtered out.BTC In our design, the objective of BTC is to categorize the text segments that appear on the OCR boxes into different classes, such as titles and subtitles.As illustrated in Fig. 2, in PipVKIE, BTC takes the visual and textual features as input and outputs the corresponding class label.Specifically, for visual modality, in contrast to conventional approaches that usually use the classical VGG (Simonyan and Zisserman, 2014) or ResNet-based (He et al., 2016) network, we construct a shallow neural network as the backbone.In fact, we observe that texts differ in low-level features of colors and fonts, thus the above-mentioned deeper networks are abandoned as high-level semantic information is extracted.Consequently, transformer (Vaswani et al., 2017) is selected as the backbone of textual extraction.
The fusion of multimodal features is a critical step in obtaining the multimodal representation of one box.The process of visual and positional modalities is shown below: where h vf is visual embedding of frame directly obtained by CNN (Krizhevsky et al., 2012), and h p is positional embedding of box obtained by coordinates respectively.Firstly, ROIAlign (He et al., 2017) is utilized to extract visual box embedding conditioned on h p and h vf .Then, we take the transformer (Vaswani et al., 2017) to learn the implicit relation between a box and its corresponding frame, which denoted as h vb .The visual box embedding h vb and the textual box embedding h tb , which is obtained by applying the transformer encoder on text, are simply concatenated to obtain the final multimodal vector representation h b .Subsequently, we perform softmax classification by multiplying h b with trainable weight parameters.
ER Contrary to commonly known NER in flat text (Lample et al., 2016), the goal of ER in VKIE is to identify entities from a single video frame.In this context, factors such as the entity's position and background features can significantly influence the recognition process.In PipVKIE, what we need to accomplish at this stage is the extraction of entities from the valuable text segments selected in the previous step.We obtain the hidden representation of text tokens by transformer encoder, and then predict their tags with the BIO2 tagging schema (Sang and Veenstra, 1999).EL EL aims to explore the relations between the extracted entities in each frame.Specifically, let h N p denote the hidden representation respect to p-th entity of the category Name, h I q denote the hidden representation respect to q-th entity of the category Identity, the representation of each entity is generated by the average pooling of text tokens.Subsequently, in each frame, we build the matrix D as inputs for the classifier.The element of D is described in Eq.2, where D(p, q) represents the vector concatenated with the hidden representations of the entity pair (2)

UniVKIE
Although PipVKIE is effective in practice, we have identified several problems with it: 1) PipVKIE does not effectively utilize the layout relationships between different boxes within the same frame. 2) The three tasks (BTC, ER, and EL) are trained separately and cannot benefit from each other.3) Processing only one box at one time during inference is not efficient enough.To tackle the challenges posed by PipVKIE, we propose UniVKIE, a unified model that processes all boxes of each frame in parallel.UniVKIE leverages a shared multimodal backbone and employs a multitask learning approach.Fig. 2 provides an overview of our model's architecture.

Multimodal Backbone
Similar to the model structure defined (Li et al., 2021;Xu et al., 2020b,a;Hong et al., 2022), we utilize a shared multimodal backbone for the three tasks.Given a frame of video, we firstly apply OCR to obtain text recognition results which could be described as a set of 2-tuples including M text segments and box coordinates.Then, we concatenate these M text segments from top left to bottom right into one text with length N .In this concatenated text, let v i ∈ {v 1 , v 2 , ...v M } denote the i-th visual token with respect to i-th box and t j ∈ {t 1 , t 2 , ...t N } denote the j-th token of text.Then we add [CLS], [SEP] and pad the sequence to fixed length L. The input sequence is established as the format in Eq.3.
UniVKIE benefits from this structure in two aspects: 1) visual token and text token can interact with each other, thus the feature representation is reinforced by multimodal fusion.2) the relations between boxes are explored to fully extract layout information.3) all boxes in each frame are processed in parallel in these concatenated form.

Multitask Learning
While the models corresponding to the three subtasks are trained separately in PipVKIE, UniVKIE unifies these subtasks and employs a multitask learning approach (Vandenhende et al., 2020) to jointly train the model.As illustrated in Fig. 2, Uni-VKIE takes the embeddings of M text segments defined in Equation 3 as input to the BTC branch, which outputs the categories of M boxes.The ER branch takes the N tokens in the text concatenated by all box texts as input to identify the entities, which are then passed to the EL branch to explore their relationships.
By summing the losses of the three subtasks, we calculate the final loss as follows: where L BT C , L ER , and L EL is the loss of BTC, ER and EL respectly, α and β are hyperparameters to make trade-offs.

Experimental Setup
Dataset To promote the new task, we have created a real-world dataset consisting of 115 hours of videos collected from 88 different sources.In preprocess, we uniformly sampled 23,896 frames from these videos and obtained over 123k visual boxes with text segments and coordinates by an off-the-shelf OCR tool.Afterwards, the dataset was carefully annotated and strictly checked by 8 professional annotators.Further details about the dataset are shown in Table 5.

Metrics and Implementation Details
We evaluate the performance of BTC, ER, and EL by Precision (P), Recall (R), F1-score, and Accuracy (Acc).To ensure the reliability of our results, we conducted ten runs with distinct random seeds for each setting and report the average results obtained from these runs.Details of the hyperparameters settings for PipVKIE and UniVKIE are presented in Table 6 and Table 7 respectively.

BTC
The upper part of Table 1 presents the performance of BTC.To evaluate how modality contributes to performance, we also take unimodal methods for comparison.This includes two text backbones, BERT (Devlin et al., 2018) and xlm-RoBERTa (Conneau et al., 2019), as well as ResNet-50 (He et al., 2016), which serves as a visual backbone.Our results show that PipVKIE and UniVKIE outperform unimodal methods, with UniVKIE performing better than PipVKIE.This demonstrates the superiority of utilizing multimodal information and the unifying strategy.

ER
In PipVKIE, subtasks are completed in sequential stages, which means that errors can accumulate in the downstream task ER after BTC.To isolate the accumulated error, we evaluated the performance of ER by replacing the prediction of BTC with the ground truth.The performance of ER is shown in the bottom left of Table 1, where PipVKIE * represents the results obtained by using ground truth input instead of predicted input.Our observations show that the performance of PipVKIE * with ground truth input is better than that with predicted input, indicating that errors accumulate in downstream tasks.Furthermore, UniVKIE achieves better results than PipVKIE, demonstrating that unifying is a better strategy.

EL
The performance of EL is shown in the bottom right of Table 1.Similar to ER, we compared the performance of PipVKIE and UniVKIE when feeding them with either the ground truth entity boundaries or predicted hidden representations.Our observations show that the performance is slightly lower when using the predictions of ER.In realworld applications where errors can accumulate, UniVKIE achieves better results than PipVKIE, which demonstrates its superiority.
In Table 1 UniVKIE outperforms PipVKIE in major metrics.We identified that this is primarily due to the efficient fusion of different modalities and the elimination of error accumulation caused by the pipeline method.Another factor is that the subtasks within PipVKIE operate independently and could not benefit from each other.

Ablation Study
We design a series of ablation experiments to verify the contributions of each component in our solutions.We evaluate the effectiveness of modalities by eliminating one or some of them in UniVKIE, as illustrated in Table 2.While text modal is necessary for ER and EL, we notice a manifest performance degradation in BTC after removing textual infor- mation, this confirms that the text modality plays a dominant role in our task.In addition, UniVKIE with multimodal information achieves the best results in all comparisons.To explore the reason, even for identical text, the visual features such as its location and background in a frame can affect the identification of segment categories, entities, and relationships.For example, subtitles are often located at the bottom of the image and have a special background color.Similarly, related names and identities often appear in visually adjacent positions within a frame of video.Furthermore, we conduct additional experiments to explore how each task impacts the others, which is shown in Table 3.To explore the impact of BTC on ER and EL, we find that UniVKIE without BTC loss achieves slightly worse results on ER, but obtains improvement on EL.Moreover, by removing the ER loss and the EL loss, we find that the performance is almost steady on BTC.These phenomena indicate that BTC is hardly influenced by the other two tasks.UniVKIE unifies the three tasks into one model and achieves overall balanced performance.

Modality
In the section of the ablation study, we find text modality plays the leading role.Besides, visual information also plays a crucial role in our task.For example, in BTC, box of a specific category often has a particular background color and location, which can serve as complementary features to the text.As the associated names and identities are usually located in associated position in one frame, it is important to consider visual information when performing EL tasks.The experimental results in Table 2 validate this point.

Efficiency
Table 4 compares the inference speed and resource cost between PipVKIE and UniVKIE.We deploy both models on Tesla V100-SXM2-32GB.By shar-ing the same multimodal backbone and unifying the three tasks into one model, UniVKIE achieves satisfactory inference speed and costs lower GPU resources.This is mainly attributed to the fact that in the inference of PipVKIE, only the feature in a single box is required, while in UniVKIE, the features of all boxes in a whole frame are inputted, which increases parallelism and thus improves efficiency.

Deployment Cases
Both PipVKIE and UniVKIE have already been deployed on an AI platform for industrial media, which is a well-designed video understanding platform with comprehensive video processing services.We give three cases of real-world news videos, as shown in Fig. 3 .The red boxes illustrate the hierarchical information extracted from the current video frame.In these cases, titles and subtitles are shown at the segment level while personal information is organized at the entity level with name and identity.Therefore, these valuable hierarchical information extracted by VKIE from the visual texts can be used effectively to index, organize, and search videos in real applications.More details about how our application works on the AI platform could be found in the supplementary material A.

Conclusion
This paper introduces a novel task in the industry, referred to as VKIE, which aims to extract crucial information from visual texts in videos.

Limitations
While VKIE could be easily extended to multilingual tasks, our dataset in practical application centered on Chinese videos.For general use, we are formulating plans to extend the application to multilingual tasks in the future.

A Media AI Platform
The task of key information extraction from visual texts in videos has been deployed on an media AI platform, which is a well-designed video understanding platform with comprehensive video processing services.We uniformly sample key frames from the uploaded video.Then, a OCR engine is used to extract visual boxes and their corresponding coordinates.Afterwards, VKIE completes the three subtasks of BTC, ER and EL, and obtains hierarchical information at the entity and segment levels.Here we present one result for clear viewing as in Fig. 4.

D Integration with LLMs
Recently, large language models(LLMs) have attracted widespread interest.We have noticed this and conducted experiments with LLMs within the VKIE scenario.However, we found these approaches are not sufficiently stable for practical industrial applications.Therefore, we have decided to defer the exploration of integration with LLMs as a future extension of our work, rather than incorporating it into this submission.

Figure 1 :
Figure 1: An example of hierarchical key information extracted by VKIE in a video frame (CGTN Sports Scene, 2023).

Figure 2 :
Figure 2: The overall architecture of two deployed solutions PipVKIE and UniVKIE.
To address the task, we decouple VKIE into four subtasks: text detection and recognition, text classification, entity recognition, and relation extraction.Furthermore, we propose two complete solutions utilizing multimodal information: PipVKIE and UniVKIE.PipVKIE performs these three subtasks in different stages, while UniVKIE unifies all of them in one model with higher efficiency and lower resource cost.Experimental results on one well-defined dataset demonstrate that our solutions can achieve remarkable performance and satisfactory resource cost.With VKIE, structured information could be effectively extracted and well organized with rich semantic information.VKIE has been deployed on an industrial AI platform.

Figure 3 :
Figure 3: Real-world cases on our AI platform, the red boxes illustrate the extraction results of current frame.

Figure 4 :
Figure 4: A real-world case on the AI platform for clear viewing.

Figure 5 :
Figure 5: Examples of ER on the annotation platform.The red box indicates the candidate labels of ER.

Table 1 :
Experimental results of BTC, ER, and EL.* indicates the results obtained by replacing the prediction of the upstream task with ground truth.-indicates the meaningless results, since for UniVKIE, ER does not rely on BTC in the pipeline.† indicates that UniVKIE performs better with p-value < 0.05 based on paired t-test.

Table 2 :
Modality ablation study of UniVKIE.-indicates the meaningless results, as the text modal cannot be omitted in ER and EL.

Table 3 :
Loss ablation study of UniVKIE.-indicates the meaningless results as the task-specific loss is necessary for the corresponding subtask.

Table 5 :
Table5illustrates the concrete categories contained in BTC, ER, and EL in our practice.We collected 88 sources, totaling 115 hours, from publicly available videos, including news programs, variety shows, and other sources.All 88 video sources are split for training, developing, and testing with the ratio 3:1:1.We then extract frames from these videos by taking their average over time.To prevent data leakage, we ensure that frames from the same video are not present in different splits.In BTC, we assign the samples to 4 categories including Title, Person Info, Subtitle, and Misc.We further annotate mentions and labels on the samples of Person Info for ER as shown in Fig.5.Finally, EL is annotated on the pairs of entities extracted from each frame.The basic statistics of our datasets

Table 6 :
Table 6 illustrates the hyperparameters of the three models corresponding to BTC, ER and EL in PipVKIE.In UniVKIE, we use a shared multimodal backbone and build task-specific branches as in Table 7. Hyperparameters of PipVKIE