Query-aware Multi-modal based Ranking Relevance in Video Search

,


Introduction
Video search has become a prevalent method for users to identify relevant content in response to text queries on video streaming platforms.Relevance ranking is crucial in video search (Pang et al., 2017), as it determines the relevance degree of a video concerning a given query.Pointwise loss(e.g., binary cross-entropy loss), ranking loss(e.g., hinge loss) and Combined-Pair loss (a linear combination of pointwise and pairwise loss) (Zou et al., 2021) are commonly used to optimize relevance ranking task.However, these methods fail to balance calibration ability (globally stable prediction with good interpretability) and ranking ability (prediction can lead to a correct ranking) (Sheng et al., 2022).At the same time, the transformer architecture's recent success (Vaswani et al., 2017) in * Equal contribution computer vision and natural language processing has led to pre-trained language models achieving promising results in retrieval and ranking tasks (Zou et al., 2021;Nogueira et al., 2019;Liu et al., 2021).However, most existing approaches primarily focus on text modality and alternative methods which integrate large-scale Vision-and-Language Pre-training (VLP) models, such as CLIP (Radford et al., 2021) and ALBEF (Li et al., 2021), into video search engines face two key challenges: (1) Images typically align with verbose and detailed video texts, providing limited assistance for modeling matching relationship between visual signals and concise queries in downstream relevance tasks.(2) Most VLP models are trained on singleframe images and texts, neglecting video information such as keyframes and tag data, rendering them unsuitable and inadequate for video search engines.
To address these challenges, we propose a queryaware, multi-modal relevance ranking model for real video search systems within a two-step framework, as depicted in Fig. 1.Query-aware Pre-training Model with Multimodality.We present a real-world query-aware pre-training model that simultaneously aligns image features with video text features and query features.Additionally, we propose a hard query mining strategy to effectively exploit query knowledge.Inspired by CLIP4CLIP (Luo et al., 2022) and TABLE (Chen et al., 2023), we introduce a local tag-guided attention network to extract features from sequential frames, rather than a single image.To preserve pre-trained knowledge to the greatest extent and accelerate the training process, we employ an adapter-tuning strategy Ranking Relevance.Following the approach in (Bo et al., 2021), we model relevance ranking under the pre-training and fine-tuning paradigm, utilizing various handcrafted features (e.g., BM25 (Robertson andWalker, 1994), click similarity (Yin et al., 2016), term weight) and pre-trained representations of query and video within a wide and deep network architecture.We enhance ranking performance by incorporating multi-modal knowledge and proposing an ordinal regression based approach for joint optimization of ranking and calibration in relevance prediction.
In summary, this paper makes the following contributions: • We introduce a novel query-aware pre-training model tailored for real-world applications, aligning image with both title and query.This approach effectively utilizes video modality information and exhibits improved adaptability to downstream tasks.
• We propose an innovative relevance ranking optimization method based on ordinal regression, balancing calibration and ranking abilities effectively.
• We present a novel approach for applying pretrained VLP models to online relevance ranking tasks in real industrial video search scenarios.Comprehensive offline and online evaluations demonstrate that the proposed techniques significantly enhance relevance ranking performance.

Methodology
In this section, we describe the details of our multi-modal-based ranking-relevance approach.
The overall architecture of our methodology is illustrated in Fig. 1, comprising a query-aware pre-training multi-modal model and a rankingrelevance model that utilizes both visual and textual information.

Query-aware Pre-training Model with Multi-modality
As illustrated in Fig. 1(a), our QUery-Aware Pretraining Model with Multi-modaLITY(referred to as QUALITY), is composed of a query tower, a video visual tower, and a video text tower, which is an extension of the dual-tower structure of the image-level ALBEF model.Model Input.Given an input video v and an input query q, we employ a 12-layer visual transformer ViT-B/16 (Dosovitskiy et al., 2020) to encode N frames uniformly sampled from the input video, and a shared 12-layer textual encoder, BERT-base (Devlin et al., 2018), to encode the title and tags of the input video and the input query.The above frame-level visual encoder and the textual encoder are initially pre-trained using the CLIP approach on industrial video-search log data.To accelerate the training process of the QUALITY model and prevent catastrophic forgetting (Sharkey and Sharkey, 1995) of the uni-modal pre-trained encoders, we follow the AdaptFormer (Chen et al., 2022a) method that a trainable and lightweight down-up bottleneck module is added to feed-forward parts of transformer blocks within our pre-trained encoders and meanwhile, all the other parameters within the pre-trained encoders are frozen, significantly reducing trainable parameters and enhancing the training efficiency.Tag Guidance.Video tags are widely present on video-sharing platforms, which are usually keywords and phrases that facilitate video content understanding.To gain a better understanding of the video content rather than merely relying on low-level visual features, a tag-guided cross-attention network is designed to align semantic information with visual signal.Specifically, given the visual representation generated by visual encoder {f cls , f 1 , f 2 , . . ., f N } and tag representation generated by textual encoder with M tokens {g cls , g 1 , ..., g M }, a 3-layer transformer with 8 cross-attention heads(as displayed in purple color in Fig. 1) is used to align visual information with semantic tags, then we retain the tag-guided visual part as {v cls , v 1 , ..., v N } for the subsequent queryawareness computation.Query-awareness.Previous VLP models such as ALBEF and CLIP4CLIP, have focused on modeling the relationship between visual signals and their corresponding text descriptions.However, in real-world search scenarios, how video content is described and how users express their input queries can differ significantly.Moreover, text descriptions of the video content often fail to summarize the video content adequately.Thus the obtained representations by these methods may offer limited assistance for search tasks.To better adapt to our downstream video-search tasks, we explicitly model the matching relationship between the query tower and vision tower(i.e., video frames) through vision-query contrastive learning (VQC) task, a shared cross-modal cross-attention encoder (as displayed in cadet blue color in Fig. 1) and visiontext matching (VQM) task, while also maintain the matching modeling between vision and title towers

Pretraining Objectives
QUALITY is pre-trained using the following five objectives: Vision-Query Contrastive Learning(VQC) and Vision-Text Contrastive Learning(VTC) applied to uni-modal encoders, as well as Vision-Query Matching(VQM), Vision-Text Matching(VTM), and Masked Language Modeling(MLM) applied to multi-modal encoders.The performance of VQM and VTM is enhanced through online contrastive hard negative mining.Additionally, VQM is further improved by employing offline hard query mining.
Vision-Query Contrastive Learning aims to align the visual signal v cls and query q cls prior to fusion.We define a function s (v cls , q cls ) = h v (v cls ) ⊤ h q (q cls ) to calculate the similarity between the visual signal and the query.Here, h v (•) and h q (•) are linear layers that project the [CLS] embeddings into a shared semantic space and normalize them.We express the vision-query contrastive loss with a trainable temperature parameter τ and batch size B as follows: Vision-Text Contrastive Learning seeks to align the visual signal v cls and video title t cls prior to fusion.Analogous to the vision-query task, the vision-text contrastive loss with a trainable temperature parameter µ can be defined as follows: Vision-Query Matching aims to predict whether a pair of vision and query is matched or not.We concatenate the [CLS] embeddings of the visualtext multi-modal encoder, into which the vision and query signals are fed.A fully-connected layer is then employed to generate the two-class probability of matching, denoted as p vqm .The vision-query matching loss can be defined as: 324 Here, y vqm represents the ground-truth label, and J is the total number of vision-query pairs for this task.In addition to the online hard negative mining strategy employed by ALBEF, an embedding-based offline strategy is also designed to mine both hard positive and negative queries.Specifically, we first derive query and video embeddings from a queryvideo click graph, utilizing lightweight graph embedding algorithms such as item2vec (Barkan and Koenigstein, 2016) and DeepWalk (Perozzi et al., 2014).Then for a given video, a query is chosen as a hard positive if the cosine-similarity, based on the graph embedding, between query and this video exceeds a predetermined threshold.Conversely, if the cosine-similarity between them is below the threshold, the query is considered a hard negative.The threshold is an empirical hyperparameter.
Vision-Text Matching aims to predict whether a pair of vision and title originates from the same video.Analogous to the vision-query matching, we define the vision-text matching loss as: Here, p vtm represents the prediction of matching, y vtm is the ground-truth label, and O is the total number of vision-text pairs for this task.
Masked Language Modeling aims to predict masked video title tokens using both visual and textual signals.Video title tokens are randomly masked with a 15% probability and replaced with the special token [MASK].Let T denote the masked token, and p mlm (I, T ) denote the probability of a masked token.We define the masked language modeling loss as: (5) Here, R represents the total number of masked tokens, and y mlm is the ground-truth label indicating whether a token is masked.The total loss function for our model is:

Ranking Relevance Model
As illustrated in Fig. 1(b), the proposed multi-modal based ranking relevance model comprises four major components: our pre-trained QUALITY which produces query embedding E (q) , video textual embedding E (t) and video visual embedding E (v) based on the multi-modal input; a discretization and embedding learning module (Guo et al., 2021) that extracts representation E (n) from handcrafted numerical features (e.g., BM25, click similarity, term weight); a pre-trained transformerbased cross-encoder that takes query and video text as input, where the video text includes title, actors, uploader name and tags; a multilayer perceptron (MLP) module which produces a relevance score between the query and video.Provided with E (q) , E (t) and E (v) generated by QUALITY, we further compute the cosine-similarity of the query embedding with the text or visual embedding to obtain query-text similarity and query-visual similarity, respectively.The cross-encoder is pretrained following multi-stage training paradigm (Zou et al., 2021) and the representation of the [CLS] token, as well as mean and max pooling of the final layer of the cross-encoder, are concatenated to obtain a presentation of semantic relevance E (q,t) .Finally, the concatenation of the outputs of E (q,t) , E (q) , E (t) , E (v) , E (n) and the derived querytext similarity and query-visual similarity is fed into MLP module to conduct relevance score between a query and a video.

Ranking Loss Function
The relevance ranker can be considered as a scoring function f θ (q, v) for a query q and a candidate video v, and θ denotes the trainable model parameters.In order to ensure both calibration and ranking abilities of the predicted scores, we model the ranking problem as a K-grade ordinal regression problem that accommodates both labeled order y ∈ 1, 2, . . ., K and a set of thresholds ρ 1 , ..., ρ K−1 with the property that ρ 1 < ρ 2 < ... < ρ K−1 .Specifically, the final output of the model f θ (q, v) is considered as an observed ordinal variable, with its cumulative probability given by the sigmoid function, denoted as σ (Bürkner and Vuorre, 2019).The set of thresholds, which can be optimized during the model training process, divides f θ (q, v) into K disjoint segments.In our setting, the probability P r of relevance k can be formulated as follows: The corresponding ordinal regression loss function is defined in Equation 8. Besides, A binary crossentropy loss with binary label y b ∈ {0, 1}, denoted as L binary in Equation 9, is also employed with the purpose of enhancing the differentiation between relevant and irrelevant candidates more accurately.A rating k <= K/2 is considered irrelevant, while a rating k > K/2 is deemed relevant.The final ranking loss can be written as Equation 10: where α is a hyper-parameter that balances the importance of two different loss functions.In order to anchor the predicted probability to a meaningful range, the ranking score is computed as: 3 Experiments

Evaluation Metrics
We use AUC (Area Under the Curve) and PNR (Positive Negative Ratio) as offline evaluation metrics.For the AUC metric, labels 1 and 2 are considered negative, while labels 3 and 4 are considered positive.The PNR metric considers the partial order between labels and measures the consistency of prediction results and ground truth.As for online evaluation, we employ Average Watch Time (AWT) to quantify user preference on video search results.The Good vs. Same vs. Bad (GSB) metric compares two systems in a side-by-side manner, and we utilize △GSB (Zou et al., 2021) to assess the satisfaction gain achieved by a new system.

Offline Performance
We first evaluate the effectiveness of the QUALITY model.Since the original ALBEF is specifically designed for images rather than videos, we have extended it to a video version for a fair comparison, which we refer to as Video-ALBEF.The baseline Video-ALBEF model utilizes a transformer-like pooling strategy inspired by CLIP4CLIP to aggregate keyframe embeddings and then generates a video-level visual embedding.The remaining components remain identical to those in the original ALBEF setup.As shown in Table .1,our method achieves an AUC of 0.683 and a PNR of 2.180, beating the baseline Video-ALBEF model with an absolute 10.5% AUC improvement and a relative 42.2% PNR improvement.Meanwhile, as shown in Table .2,compared to the baseline text-based relevance model, we find that the performance of this baseline can be enhanced by introducing mul-timodal embeddings via either the method QUAL-ITY or Video-ALBEF, demonstrating the effectiveness of the multi-modal information that can alleviate the issue of mismatching purely based on text information.Furthermore, our method outperforms Video-ALBEF by 0.3% in AUC and 3.2% in PNR respectively, suggesting that explicitly modeling the matching relationship between the query tower and vision tower can help the downstream relevance model.Table 3 provides the performance of our ordinal regression-based ranking loss.We observe that the proposed ranking loss outperforms the pointwise loss and the Combined-Pair ranking loss, by relative improvements of 28.9% and 4.1% on PNR, respectively.We also notice that the pointwise-based model achieves the highest AUC of 0.925, but the lowest PNR of 6.911.This outcome indicates that pointwise loss only focuses on the calibration ability and neglects the ranking ability.ment on PNR.We suggest that freezing the primary parameters of uni-modal encoders mitigates the issue of catastrophic interference.Moreover, the adapter training method is 3.4 times faster than the fully fine-tuned approach.Consequently, as shown in Table .2,our model attains improvements of 0.2% on AUC and 1.5% on PNR.Overall, AdaptFormer proves advantageous for both training effectiveness and efficiency.

Case Study
Apart from the above quantitative analysis, we conduct qualitative analysis based on some cases in real world video search scenario, as shown in Fig 3.
For instance, given the query "Cinderella2 Dreams Come True", we observe a video whose title includes the keywords of query, but the content of the video is a Thai romantic comedy, not Disneys' Cinderella.This video was initially predicted as rate "Good" with a score 0.52.After incorporating QUALITY, the prediction score decreases to 0.25.Through the analysis of these cases, we empirically conclude that incorporating multi-modal features extracted from QUALITY can significantly enhance the discriminative power of the relevance ranking model.

Deployment & Online A/B Testing
To evaluate the effectiveness of our proposed method in our real-world video search engine, we deploy the proposed model to our online system and compare it with online baseline models which are mainly text-based baselines like BERT and BM25.Following a week-long observation, A/B test results demonstrate that query-aware multimodal-based ranking relevance model outperforms the online baseline models, achieving a 2.1% improvement on AWT.Furthermore, we conduct manual GSB evaluation on the final search results, and our proposed model contributes to a 5.7% improvement in △GSB.

Conclusion & Limitations
In this study, we introduce QUALITY, a queryaware pre-training model that leverages multimodal information, including queries, video frames, tags, and titles.QUALITY is integrated into our ordinal regression-based ranking relevance model.Through extensive experiments conducted on real-world data, we demonstrate the effectiveness of our proposed method.
Our method relies on a graph mining strategy that utilizes search log data to identify previously unobserved query-video pairs, thus alleviating the Matthew Effect problem in search engines.Nonetheless, the accuracy of our approach may be influenced by noise during graph construction.Consequently, we recommend investigating alternative hard mining strategies or visual debiasing strategy (Chen et al., 2022b) to enhance performance.

A Implementation Details
Our QUALITY model comprises a BERT-base with 124M parameters, a ViT-B/16 with 86M parameters, a vision-tag multi-modal encoder with 2M parameters, and a vision-text multi-modal encoder with 2M parameters.BERT-base and ViT-B/16 are pre-trained as a CLIP model on 20M video cover-title pairs from our search log.We uniformly sample 5 keyframes for each video and resize them to a resolution of 224 × 224.For online usage convenience, the embedding size of image, query, tag, and title modalities is reduced from 768 to 64 using projection layers.We train the models for 1 million steps on 4 NVIDIA A100 GPUs, with an initial learning rate of 1e −4 for the first 10,000 steps, which is then gradually decayed to 5e −5 .
We use a hierarchical learning rate for the relevance ranking model, setting 1e −5 for pre-trained cross-encoder layers and 5e −4 for other layers.Notably, the pre-trained cross-encoder is based on a single-layer transformer network distilled from BERT-base, featuring an embedding size of 64 and a hidden layer size of 64.Regarding the thresholds of ordinal regression, we initialize them with −5, 0, 5. Besides, we set hyper-parameter α in final ranking loss as 0.5.

Figure 1 :
Figure 1: Model architecture.(a) QUALITY model.(b) Multi-modal-based ranking relevance model.through vision-text contrastive learning (VTC) task, the same shared encoder and vision-text matching (VTM) task.The shared cross-modal encoder is composed of a 3-layer transformer with 8 crossattention heads.Our query-awareness strategy can alleviate the issue of mismatching purely based on text information in the downstream rankingrelevance task.

Figure 2 :
Figure 2: Random sampled video example with key frames, title and tags.

Figure 3 :
Figure 3: Cases of video search."score (w/o QUALITY)" / "score (w/ QUALITY)" represents the prediction score of relevance ranking model with / without QUALITY.

Table 1 :
Offline comparison results of multi-modal pre-training models and ablation study of QUALITY.QUALITY outperforms Video-ALBEF and each technical components brings it's separate gain independently.

Table 2 :
Offline comparison results of ranking relevance models and ablation study on technical components of QUALITY.

Table 3 :
Offline comparison of ranking relevance model performances for different ranking loss functions.
vation of pre-trained knowledge.To this end, we utilize an adapter-tuning strategy.As shown in Table.1, our QUALITY model outperforms the fully fine-tuned model by achieving an absolute 3.0% improvement on AUC and a relative 15.9% improve-