Query Generation for Multimodal Documents

This paper studies the problem of generatinglikely queries for multimodal documents withimages. Our application scenario is enablingefficient “first-stage retrieval” of relevant doc-uments, by attaching generated queries to doc-uments before indexing. We can then indexthis expanded text to efficiently narrow downto candidate matches using inverted index, sothat expensive reranking can follow. Our eval-uation results show that our proposed multi-modal representation meaningfully improvesrelevance ranking.More importantly, ourframework can achieve the state of the art inthe first stage retrieval scenarios


Introduction
As more documents on the web are generated and consumed by mobile devices with cameras, documents are often multimodal, containing information in both text and image modalities. This poses a new challenge of finding relevance documents across modalities. More formally, the relevance of document, consisting of text t and image i, to the given query keywords q, should be modeled as a trimodal function f (q, t, i), rather than a simple lexical match between q and t (or, BM25 baseline) assuming the semantics of image i is fully represented by the surrounding text (or, paired-text assumption).
Prior research observes that paired-text assumption is often violated (Henning and Ewerth, 2017)for example, some semantics can be better captured in image modality, and may not (or, cannot) be described in text. Meanwhile, BM25 baseline (Robertson et al., 1994) would fail to serve queries for such semantics. * correspond to seungwon.hwang@gmail.com To overcome this limitation of BM25, one may model relevance from query as a trimodal function f (q, t, i) (Nian et al., 2017;Kordan and Kotov, 2018) instead, but they require runtime invocation of f for the given query q with all potential document matches. This incurs a prohibitive runtime overhead, unacceptable for search engines finding results online. A common practice is to use a cheap BM25 ranking as a "first-stage retrieval", efficiently supported by inverted index, to quickly narrow down to a few candidate documents, then evaluate f (q, t, i). However, due to simple nature of BM25 using exact term matching, a document will be missed, if the query term is absent in t, even though it semantically matches image or another term in the text.
Our contribution is to keep first-stage retrieval as efficient as BM25, but enable multimodal semantic matching, using Query Generation (QG) before indexing. More specifically, we generate a likely query q from a joint modeling of t and i, to create an expanded text t = q ∪ t such that (q, t ) pair has more lexical overlaps, or better paired than (q, t), for first-phase retrieval. Specifically, we train a sequence-to-sequence model, such that given the representation of multimodal document, this model generates possible queries that users may ask to retrieve such document. This is analogous to doc2query (Nogueira et al., 2019) approach used for first-stage retrieval of textual relevance ranking, though this model, dealing with text modality only, cannot apply to our problem of retrieving multimodal documents.
For such multimodal representation for QG, a naive baseline is bimodal representation shown in Figure 1a: We may assume (t, i), even when lexical overlaps are low, is semantically paired in the embedding space. Note this is a relaxed version of paired-text assumption. Given this relaxed assumption, common architecture of bimodal representa- Figure 1: (a) bimodal and (b) trimodal representation for QG, query is used for disentangling shared and private space and relaxing paired-text assumption of bimodal representation. tion for QG consists of the encoders for image i and text t. Each generates vector representation, which is later fused into a multimodal space, then decoded into a textual caption. Specifically, we consider two strong baselines: (a) cross-modal representation, pretrained from a large-scale paired corpus of image and caption, such as LXMERT (Tan and Bansal, 2019), ViLBERT (Lu et al., 2019), and VisualBERT (Li et al., 2019), finetuned for our task, and (b) memory network structure (Park et al., 2017). Based on these baselines, we design Bimodal QG combining the advantages of the two as a strong baseline. Then, we extend into Trimodal QG leveraging text, image and query (q, t, i).
Alternatively, we may further relax paired-text assumption and propose Trimodal QG in Figure 1b: (t, i) can be partially paired, where some semantics is conveyed in one modality. To deal with that challenge, the query given at training, helps "disentangle" shared and private semantics as additional loss terms. Another role of query is improving image representation, to de-emphasize semantics not discussed in either text or query.
In summary, our contributions are as follows: • We study QG for multimodal documents, as an enabler for efficient first-stage ranking.
• We build a multi-task model, for query generation and representation learning, to generate effective queries for offline indexing.
• We improve the QG model by considering query as third modality in order to work well without paired-text assumption.
• We validated that our model outperforms all baselines in both public dataset and real-life web search query logs and quality annotation for multimodal documents.

Related Work
Our work is closely related to the following three areas of research.

Web search with images
Most efficient way to treat multimodal document ranking has been making paired-text assumption (Coelho et al., 2004;Azilawati and Meriam, 2008), such that simply matching q with t is sufficient. Our work is as efficient, by incurring no additional runtime overhead, but does not make such assumption. Alternatively, Rodríguez-Vaamonde et al. (2015) adds a reranking phase, checking if the images are relevant to the query, supervised by whether the given image is correlated with clicks.
Our distinction: We do not build on paired-text assumption, and can be viewed as generating a better-paired document by adding likely queries.

Image captioning
Another closely related work is the task of generating textual captions to the given query. As overviewed in Section 1, state-of-the-art models include bimodal joint representation of image i and text t (Kiros et al., 2014b;You et al., 2016;Park et al., 2017). Alternatively, such joint models can also be transferred from pretrained models, such as LXMERT (Tan and Bansal, 2019), ViL-BERT (Lu et al., 2019), and VisualBERT (Li et al., 2019). Section 3 will compare and contrast these two approaches, and discuss why these models are limited for our task setting. (Jeon et al., 2003) is a non-neural model trained to annotate images with textual description, though requiring expensive supervision of segmented image with term.
Our distinction: We propose and validate trimodal joint representation for higher-quality captioning. Meanwhile, we do not require segmentlevel annotation, though our query-guided trimodal image representation naturally emphasizes important segments.

QG for first-stage retrieval
QG for first-stage text retrieval was pioneered by doc2query (Nogueira et al., 2019), generating likely queries for the document for indexing purposes in text modality. Our work can be viewed as query generation for multimodal documents: For each document, the task is to predict a set of likely queries. We train a sequence-tosequence transformer using a data set of (query, relevant document) pairs. Alternatively, inverted index can be built for a latent term (Zamani et al., 2018), though it cannot be human-interpreted or reweighed. In contrast, we focus on inverted index on actual terms, as it is human interpretable and combines more naturally with legacy tf-idf ranker and reweighting module.
Our distinction: We validate the effectiveness of trimodal QG over bimodal state-of-the-art methods.

Bimodal Baselines: LXMERT and Memory-Based Generator
This section compares and contrasts two bimodal baselines: LXMERT 1 and Memory-Based Generator encode text t and image i into vector representations (Section 3.1), then the two are aggregated into a multimodal representation (Section 3.2), such that this vector can feed a decoder to generate a text sequence (Section 3.3). Specifically, we build Bimodal QG baseline, combining LXMERT representation and Memory-Based Generator decoding. Figure 2 shows overall architecture of our Bimodal QG baseline.

Text and image encoder
LXMERT and Memory-Based Generator generate text and image vectors, using transformer and memory network structure respectively. Both can be explained as key memory, aggregating value memory representation with proper self-attention, denoted as key and val, respectively, following the conventions of prior literature (Sukhbaatar et al., 2015).
Formally, we encode textual context words C = {w 1 , w 2 , ..., w j } obtained by concatenating the n−dimensional word embedding (w ∈ R n ) of the top-j words with the highest TF-IDF weights 2 .
where W 1 , W 2 ∈ R m×n , b 1 , b 2 ∈ R m are trainable linear transformation parameters where m is the dimension of memory. When parameter, such as T or I, is denoted without superscript (key or val), it refers both key and val vectors.
Similarly, an image input U ∈ R 2048 is generated from pool5 feature vector of Resnet-101 CNN encoder, which is similarly embedded into: where W 3 , W 4 ∈ R m×2048 and b 3 , b 4 ∈ R m are trainable matrices for tuning on given dataset. Final image embedding is generated with key and value vector, following the convention of attention network (Kiros et al., 2014a). Note only the representation of text and image is used at this point, and using query representation will be discussed in Section 4. Also, LXMERT extract image features by using Faster- RCNN (Ren et al., 2015) and transformer structure, which is pre-trained on large dataset combined by MSCOCO (Lin et al., 2014) and Visual Genome (Krishna et al., 2016). Therefore, both methods can be applicable on general images.

Multimodal fusion
The goal of multimodal fusion module in Figure 2 is to get text and image representations as input, and create a joint representation as output. When paired-text assumption holds (Figure 1a), this can be achieved by simply concatenating or adding two input modalities, and the future layers will be tuned for a proper alignment of the two. However, such concatenation is less effective when pairedtext assumption does not hold as in Figure 1b.
One solution is transfer learning from pretrained joint representation model trained from a largescale paired resources, such as LXMERT (Tan and Bansal, 2019). LXMERT is a transformerbased architecture for cross-modal representation learning, for predicting masked words from the text modality, and vice versa. This auxiliary task, known as masked cross-modality language model, helps building connections across modalities. In our problem setting, this option can be considered for public English dataset, as pre-trained LXMERT with a large scale paired training resources is readily available to generate a joint representation M lxm , to replace simple concatenated fusion embedding M f usion ([I; T ]).

Bimodal QG: Joint text decoding for query generation
We now observe the two baselines: LXMERT, focuses on the problem of joint representation, but does not consider a decoder of generating a query sequence from such representation ( Figure 2a). Meanwhile, Memory-Based Generator has the advantage of tightly coupling the key-value encoder and CNN decoder, by concatenating the representations of all modalities, co-attended based on the query keywords generated thus far, as illustrated in Figure 2b. This multimodal vector is calculated in each time step and used to decode the next word, which is an effective decoder design adopted for our model: With this joint representation, query generation is predicting the output probability of the next word among all vocabularies, by a convolution neural network, denoted as CNN in Figure 2. For combining with the strength of LXMERT, we can simply replace the cross modal embedding vector in Figure 2 by M lxm for bimodal QG. Alternatively, for ablation purpose, M lxm can directly decoded without memory-based decoder, which we denote as LXMERT QG. Our final loss of bimodal captioning is a seq2seq loss.
where t is the time step and l is the length of caption.

Trimodal QG: Query-aware representation
Our proposed bimodal QG partially contributes to relax paired-text assumption, but neither image and text representation is aware of queries. We argue that, queries carry rich semantics and contribute significantly to to relax paired-text assumption, specified as two key contributions C1 and C2 below.
• C1: We use query to improve image and text representation to disentangle into shared/private semantics where q matches the shared semantics.
• C2: As query generation is better trained when t and i are paired, we revise image representation to enhance pairedness with given query.
Motivated, we propose two new loss functions L 1 and L 2 , addressing C1 and C2 respectively.

Query-aware relevance
For addressing C1, we model "private" parts of image and text, denoted as P i and P t , to relax the paired-text assumption. Our goal is to build joint representation S, aligning only the shared part of image and text, with query q.
where W 5 , W 6 , W 7 ∈ R m×2m and b 5 , b 6 , b 7 ∈ R m are trainable parameters. I key and T key are the image and text input vectors, respectively. These inputs are concatenated into [I key , T key ] and multiplied by W 5 so that the combined modality can be projected into same semantic space with private vectors P i and P t .
To ensure this joint representation to project closely to the representation of query, query embedding vector Q v is generated by LSTM with generated query y 0 , ..., y t−1 : with the objective loss to keep private vectors away from query, and shared close to query: where r is the margin parameter. This margin enables our model to relax the decision function in LXMERT, predicting whether t and i are paired, as a binary classification. Unlike such binary prediction, computing zero or one score for partially paired pair (Figure 1b), the above two losses compute a scalar score and make a soft decision based on similarity. The query-aware relevance loss L 1 is defined by combining the two losses for each modality.

Query-aware alignment
For C2, we revise image representation to highlight query-related semantics, to make it semantically pair better with text representation. To reflect a (possibly nonlinear) relation with the query and the image, a fully connected neural network is applied to each modality before gating. Formally, queryaware image embedding V p is described below: where Q v is the query embedding, A q is the attention derived from query with sigmoid σ, means element-wise multiplication and w 9 ∈ R m×m and b 9 ∈ R m are trainable parameters. w 8 ∈ R m×m and b 8 ∈ R m are learned with m-dimension query embedding. We apply the fully connected layer (first term before element-wise multiplication) to project image near query embedding.
We apply query-aware image representation to the multimodal space learning: where r is the margin and + means positive text where text belongs into same document with given image and − means negative text from different document. Finally, we combine the two loss functions as a final query-aware loss L q : Our final loss of trimodality captioning is the sum of seq2seq loss and query aware alignment loss:

Experiment
The goal of our evaluations is to validate the effectiveness of our approach in public dataset and real-world Web search queries and settings. In particular, we have two research questions: • RQ1 Would QG task benefit from LXMERT model? How does our approach compare with BM25 or Bimodal for first-stage retrieval?
• RQ2 Would our approach improve real-life Web search queries?
We use public dataset for RQ1 and real-life queries and quality annotation for ad-hoc web search task from NAVER for RQ2.

Dataset
As there is no public dataset with query workloads and multimodal documents, we repurpose a public dataset of instructional videos (Kim et al., 2020) by transforming videos into multimodal documents, which consists of 2000 query-video pairs, where each video is a recipe instruction from YouCook2 dataset 3 . We first sample an image corresponding to each sentence in the transcript, by capturing a center frame. As this may create too many (image,sentence) pairs, we propose to cluster into more natural boundaries using temporal and semantic aspects: A pair of successive frames, each with textual transcript and a set of objects 4 , will be merged if more than clip% objects overlap, which is empirically tuned for each experiment. When clip% is set to 100%, it is our initial setting without merging, and this can be tuned to better fit the target scenario. Figure 3 shows the example of video clipping, where associated frames are merged into multimodal paragraphs with images and transcripts. The lengths of extracted multimodal paragraphs are from 1 to 20 according to video. Therefore it can correspond to short to long length of actual documents.

Experiment settings
To preprocess text and image input data, we use NLTK text tokenizer (Sukhbaatar et al., 2015) and Resnet-101 CNN respectively. When learning a query generator, the dimensionality of image and word embedding vector is set to 2048 (by following the size of pool5 vector of ResNet) and 100 respectively. The dimensionality of memory m and query embedding are empirically determined to 256. Mini-batch stochastic gradient descent method is used to learn our query generator. Specifically, we used Adam optimizer (Kingma and Ba, 2014) with the default setting. The initial learning rate is set as 0.001 and is divided by 1.2 at every five epoch until it reaches 30 epochs. The number of generated query is up to 8.
In this study, we follow a standard two-stage document retrieval scenario: First, top 30 candidate documents are ranked and selected from the index using a first-stage ranker, namely BM25, LXMERT, Bimodal, and Trimodal in Table 1. Then, the second-stage ranker follows, which is generally more sophisticated and expensive, such as BERTbased ranker (Nogueira et al., 2019). However, we stress that our work is orthogonal to second-stage ranker and focus on first-stage results.

Results
First-stage ranking results on this public English dataset are shown in Table 1. In this dataset, evaluation metrics is limited to R@K, due to binary nature of relevance annotation: the ratio of ground truth videos that appear in our top-K results, when K = 30 is returning all results. In real-life evaluation in the next section, graded relevance annotations will be collected to evaluate rank accuracy. BM25 in the table uses raw BM25 score on text t itself. The other QG models (LXMERT QG, Bimodal QG, and Trimodal QG) are implemented as BM25 scoring on expanded text t with queries, generated from each QG model respectively. In all evaluations, relevance scores on t and t are aggregated with linear weighting, which we empirically tune λ = 0.9 for t (and 1 − λ for t). In all metrics, Bimodal and Trimodal outperform BM25 ranking, validating our hypothesis that considering image for joint document representation is effective.
Based on this result, our evaluations from this point on focus on evaluating Trimodal, with real- Figure 3: An example of video clipping to show how we transform a video into a multimodal document. Each multimodal document is clustered into paragraphs, with images and transcripts, shown as red boxes. life settings, allowing multilinguality, graded relevance annotation, and a realistic ranker.

Qualitative results
The example images, contexts, and captions are presented in Figure 4. It shows an improvement on search by generating queries for multimodal documents. An example of correctly generating query is shown in the first and second image of Figure 4. The images and contexts are highly related to the name of cooking, but it does not exist in the context where the words are selected by applying TF-IDF to transcript. In this case, our model could make a considerable contribution to search performance by directly generating the query itself like "fried" and "macaroni". The failure case is shown in the third image of Figure 4. The recipe for hummus and mashed potato both have a mashing step and similar-looking ingredient. If the cooking method and the appearance, color, and texture of the ingredients are similar, the model has a probability of generating other queries. As shown above, our model does great for generating query words to support first-level retrieval.
5.2 RQ2: Real-world ad-hoc web search scenarios

Dataset
The source dataset used in our experiments is the evaluation set of the web search ranking task from the real-life commercial search engine. This dataset contains about 28,000 queries, for each of which 60 document URLs from search engine results are found. In real-time commercial dataset, annotating the relevance of all query-document pairs is impractical. Instead, we pooled top 60 documents, as used widely in IR evaluation to reduce annotation efforts, where only top ranked documents from a small set of retrieval runs are manually assessed for relevance to investigate the impact of first-stage retrieval. These documents are labeled by expert query annotators into one of five graded relevance score, ranging from 1 (poor) to 5 (excellent), or left unlabelled. Since unlabelled documents were randomly sampled from low ranks of search results, we treat all unlabelled documents as irrelevant ones (score 1). Additionally each query is classified into domains by NAVER.we evaluate real world dataset in such experiment setting.

Experiment settings
Out of all domain areas, we observe five main categories where the image information is expected to complement missing information from textnamely, Fashion, Place, Entertainment, Commerce, and Food/Recipe. We select query-document pairs annotated as described above for these categories.
More specifically, Table 2 shows the selected categories and the number of queries in each category. As a realistic ranker, we replace BM25 and train LambdaMart, as implemented in LightGBM (Ke et al., 2017;Meng et al., 2016), which is gradient boosting framework developed by Microsoft that uses tree based learning algorithms. The ranking model is trained and tested separately for each  BM25F score of query-document BM25 score of query-document title BM25 score of query-highlighted text Exact matching of query-document Query proximity score on document Query proximity score on document title Covered query term ratio of document title Covered query term ratio of front section Covered query term ratio of highlighted text Each document in real-life search engine is represented by hundreds of pre-computed features. Among them, we select nine widely used features related to textual similarity between a query and document. The selected features are shown in Table 3. Those features are used for the text baseline in Table 4. Table 4 reports accuracy gains in the five selected categories. For the four of five categories, our proposed approach achieved up to 10.8% gain on NDCG@1.

Results
The category seeing the highest gain has been Food/Recipe, where images can be informative and complement textual instructions in this domain, as consistently observed empirically.
On the other hand, Commerce domain, though we expected showing the image of actual goods would complement text information, was the worst performing category. Our analysis shows that expert annotation was biased to highly rated official sites, while the same item can be sold in millions of sites with lower authority. Meanwhile, our models focusing on document relevance only, following the convention of ad-hoc retrieval scenarios, could not distinguish such difference. Table 5 shows the search performance of each category when a document is ranked using only trimodal-aware image feature. The best search performance category is the Food/Recipe, which had the highest performance gain in Table 4, and the other categories show a similar performance. The score of ranking model using only image feature can achieve performance about 62% of that of using all features, with respect to NDCG@5. Table 6 reports the accuracy gains of all categories over strong baselines. Only our trimodal query generation shows positive results on all domains. This demonstrates that our proposed queryaware trimodal loss contributes to capturing the query-relevant semantic of images.

Conclusion
We study the problem of representing a multimodal document to be indexable for efficient first-stage. Our contribution is posing the problem as trimodal QG to augment the given text, by proposing a trimodal joint representation of image, text, and query without paired-text assumption. We validate our approach over both public dataset and real-life web search data collected from commercial search engines.