Don’t Miss the Potential Customers! Retrieving Similar Ads to Improve User Targeting

User targeting is an essential task in the modern advertising industry: given a package of ads for a particular category of products (e.g., green tea), identify the online users to whom the ad package should be targeted. A (ad package speciﬁc) user targeting model is typically trained using historical clickthrough data: positive instances correspond to users who have clicked on an ad in the package before, whereas negative instances correspond to users who have not clicked on any ads in the package that were displayed to them. Collecting a sufﬁcient amount of positive training data for training an accurate user targeting model, however, is by no means trivial. This paper proposes a novel method for automatic augmentation of the set of positive training instances. Experimental results on two datasets, including a real-world company dataset, demonstrate the effectiveness of our proposed method.


Introduction
User targeting is an essential task in the ecommerce advertising industry. Informally, the goal of user targeting is to identify online users to whom a particular ad or ad package (i.e., a set of ads on a particular kind of products, such as green tea) should be targeted. Figure 1 shows a pipeline through which the user targeting task is typically tackled. Given an ad package that a company seeks to advertise, the company starts by randomly sampling a group of users from its customer database and displaying select ads in the package on the webpage(s) they visit. These users are then divided into two groups: clicking users and non-clicking users. Clicking users are those who clicked on the ads and therefore expressed interest in them, while their non-clicking counterparts are those who did not click on the ads and are presumably not interested in the ads. These two groups of users then serve as positive and negative examples for training a user targeting model, which can then be used to predict whether a new user should be targeted for the given ad package.
While this approach of using historical clickthrough data to automatically collect data for training a user targeting model is appealing at first glance, it has a key weakness: it may take time to collect enough data to train a reliable user targeting model, especially for long-tail ads (i.e., ads with few or no clicks). Worse still, even after waiting long enough, we still cannot guarantee that there will be enough clicks to generate positive training examples. Collecting sufficient positive training examples is critical to the success of this approach.
To address this challenge, we put forward the following hypothesis: users who clicked on an ad for a particular product category (e.g., green tea) in the past are more likely to click on an ad for the same product category in the future. Given this hypothesis, we can potentially expedite the collection of positive examples for training a user targeting model as follows. Given a package of ads for a particular product category, we first identify ads for the same product category and then use their clicking users to augment the training data for training the user targeting model.
The question, then, is: how can we automatically identify ads for the same product category as the one under consideration? One approach would be to train a classifier to classify an ad according to its product category. While this approach is straightforward, the resulting classifier will fail to classify an ad for a (new) product category that is not seen in the training data, In light of this weakness, we instead propose to learn how likely two ads are for the same product category. Not only will this address the aforementioned question of identifying ads for the same product category as the one under consideration, but the resulting model will be applicable to new product categories.
The next question is: how can we train a model to determine how likely two ads are for the same product category? Since ads are displayed in the form of creatives that are typically composed of both texts and images, a reasonable solution to this problem should involve matching the texts and the images in the two ads. While algorithms for text matching Gong et al., 2018;Wang et al., 2017b), image matching (Schroff et al., 2015;Novotný et al., 2017), and text-image matching (Zheng et al., 2020;Wang et al., 2019) exist, none of them was developed specifically for ads. We therefore propose ION, a bimodal method that determines how likely two ads belong to the same product category, the key highlights of which include the design of (1) a semantics-enhanced image region extraction mechanism for identifying the region(s) of the image in an ad that is most relevant to the text, and (2) a dual-path fusion attention method for fusing the information extracted from the two modalities.
In sum, our contributions in this paper are three fold. First, we hypothesize that users who clicked on an ad belonging to a particular product category in the past is more likely to click on an ad belonging to the same category in the future, and exploit this hypothesis to augment the positive instances used to train a user targeting model. Second, we propose ION, a method for determining how likely two ads belong to the same product category, as a means to identify positive instances for user targeting. Finally, we evaluate ION in terms of (1) its effectiveness in retrieving ads with the same product category and (2) its ability to improve a user targeting model via augmenting the training set using the positive instances it identifies. Experiments on two datasets demonstrate its superiority to six baseline systems, providing suggestive evidence of its usefulness for the user targeting task.
The rest of this paper is structured as follows. Section 2 describes related work. In Section 3, we present ION, our model for determining how likely two ads belong to the same product category. Section 4 compares ION with state-of-the-art baselines on two datasets. Finally, we conclude in Section 5.

Related Work
Works related to user targeting exist. Unlike ours, they primarily focus on designing fancy models that are trained on a large amount of data Covington et al., 2016;Wang et al., 2017a). In contrast, we aim at solving the insufficient training data problem, which to our knowledge is an unexplored area of research.
A crucial aspect of our work concerns the development of a method for determining how likely two ads belong to the same product category. Below we will discuss related work on text matching, image matching, and text-image matching, even though none of the existing matching algorithms are specifically developed for ad matching.
Many text matching methods use an encoder such as RNNs (Bowman et al., 2015), CNNs (Tan et al., 2016), recursive networks (Tai et al., 2015) and Transformer-based networks (Vaswani et al., 2017;Devlin et al., 2019) to embed input texts into vectors, possibly enhanced by attention (Parikh et al., 2016;Chen et al., 2017), and then build a binary classifier to determine whether the inputs are similar. An exception is Yang et al. (2019), whose matching method is based on rich alignment features. In general, however, the text in ads are often so ambiguous that it is difficult to determine which products are promoted.
As for image matching, existing geometric feature detectors and descriptors can compute the similarity between images (e.g., Lowe (2004), ), and a matching mechanism based on CNNs has been proposed to retrieval face images (Schroff et al., 2015). However, a large portion of an ad image usually contain background objects, which make the extracted image features too noisy to accurately determine the underlying products being promoted.
To perform text-image matching, some methods embed different modalities (e.g., texts and images) of the input into the same space and compute similarity from feature vectors (Wang et al., 2016;Zheng et al., 2020;Collell et al., 2017), but they may be too coarse-grained to exploit local features, i.e., words and image regions. Recent work (Karpathy and Li, 2015;Huang et al., 2018;Hu et al., 2019) split texts and images into fine-grained words and visual regions, and computes similarity by aligning the features of word semantics and those extracted from image regions, possibly with the help of attention (Lee et al., 2018) Figure 2: The framework of the proposed model. and external knowledge (Shi et al., 2019;Wang et al., 2019). Different from work on cross-modal matching, which measures the similarity between different modalities, our work focuses on fusing features from the texts and the image in an ad to create a multimodal representation. Note that there is also related work that aims to generate multimodal vectors containing both text and image features for pretraining or classification Abavisani et al., 2020;, in which vectors from different modalities are concatenated to form the multimodal representation. Rather than performing a simple concatenation, our work proposes an attention mechanism to fuse modalities in order to better identify the correspondence between words and image regions. In addition, while existing methods do not determine which words and image regions in an ad are relevant to the product under consideration and which ones are irrelevant/noisy, our method encodes words and extract image regions selectively so that those that are related to the product are given larger weights.

Method
In this section, we describe our two-step method for determining how likely two ads belong to the same product category. During training, we train a model for learning an ad representation such that two ads that belong to the same product category have similar representations. After training, we can apply the resulting model as follows. Given an ad in the test set, we retrieve the k ads that are most similar to it, where similarity is computed using a similarity metric applied to the representations of two ads. The rest of this section focuses on the first step, in which we train the model using multi-task learning to learn ad representations (the main task) simultaneously with keyword extraction from text (the auxiliary task).
The model architecture is shown in Figure 2. Given an ad composed of text and an image as input, the model first embeds the sequence of words using Transformer (Section 3.1). After that, a Keyword-guided Selective Gate (KSG) mechanism is adopted to mine the semantics from these text representations (Section 3.2), which are leveraged as clues for an attention module that reranks the generated image regions extracted by the YOLOv3 object detection module (Redmon and Farhadi, 2018) (Section 3.3). Finally, the model combines the re-ranked image regions and the distributed text representation through a Dual-path Fusion Attention (DFA) layer to obtain a multimodal representation of the ad (Section 3.4). Below we introduce each of these modules in detail.

Sentence Representation Learning
We encode each word in the text portion of the input ad using Transformer (Vaswani et al., 2017), as it has been shown effective in many NLP tasks (Devlin et al., 2019;Liu and Lapata, 2019). Given the text, we encode its word sequence and obtain its representations H = {h 1 , h 2 , . . . h n }, where n is the number of words and h i ∈ R d model .

Keywords-Guided Selective Encoding
Some words in the text portion of an ad contain information that can help us to determine which products are promoted by the ads, and thus are more useful than those words that do not. As an example, the ad shown in Figure 2 contains the text "Spark Wrist with Brand XXX 1 , Treasure Your Love Forever". Here, the words "Spark" and "Wrist" strongly suggest that it may be an ad of something that is sparkling and worn on the wrist. Furthermore, the brand may also indicate the product category that helps us to determine the ad product, as a brand advertiser usually sells products of only a small number of categories. Based on word semantics, it is highly likely that it may be an ad involving bracelets. Therefore, it is essential to extract information from keywords such as "Spark" and "Wrist", and at the same time ignore irrelevant words such as "Your" and "Forever". In the rest of this subsection, we seek to improve the encoding of an ad's text that is guided by its keywords.
Keyword extraction. First, we perform supervised keyword extraction by training a binary softmax classifier to determine whether each word in the ad's text is a keyword or not based on its hidden representation h i . Each training instance therefore corresponds to a word. We set its class label to '1' if its POS tag corresponds to a noun, a verb, or an adjective, and '0' otherwise. Effectively, we consider a word to be a keyword if it belongs to one of these three broad syntactic categories. Nevertheless, it does not imply that the model will learn all and only the words belonging to these categories as keywords. Recall that keyword extraction is trained (as an auxiliary task) jointly with ad representation learning in a multi-task setting, so the model's decision on which words will be keywords is in part influenced by the ad representation task.
Text encoding. Next, we use the extracted keyphrases to create a representation of the text portion of the ad that retains its most important information via a Keyword-guided Selective Gate (KSG) mechanism. First, we combine the representations of the keywords as follows: where λ i is the keyword extraction model's prediction of whether word i is a keyword. Specifically, λ i is 1 if i is a keyword and 0 therwise. Then we utilize s to generate a selective signal that measures how much semantics of each word in the text should contribute to its context representation: Based on keyGate i , we filter information of h i : where represents element-wise multiplication. Then we can generate the selective context representation of an ad's text as follows: Using the keywords-guided selective gate mechanism, keywords will contribute more semantics to the context representation. For example, the words "Spark" and "Wrist" are more valuable than "Your" and "Forever" in the text shown in Figure 2.

Semantic-enhanced Region Extraction
To extract image region features, existing works resort to pre-trained object detection models and keep the top k extracted region features based on the confidence scores that measure how likely the object belongs to the fixed set of categories. However, ad images usually contain a large portion of irrelevant objects that could mislead our ad similarity matching procedure. Without considering the internal context, it is highly likely that the bracelet in the image in Figure 2 will be ignored as it only occupies only a small number of pixels.
In light of this weakness, we propose to improve image region extraction in this subsection by considering the interaction between ad texts and images. Specifically, we use the semantics extracted from the texts to re-weigh image regions so that the object regions related to the promoted products will be given larger weights. For example, the bracelet in the image in Figure 2 will have larger weights based on the semantics of "Spark" and "Wrist".
We implement this idea as follows. First, we extract the top k 1 candidate image regions with the highest confidence scores generated by YOLOv3, and feed the extracted region features to a singlelayer feed-forward network (FFN) as follows: where v i ∈ R d model . To re-weigh regions, we propose a Semantic Clue Attention (SCA) mechanism, where we use the selective context representation derived in the previous subsection as supervisory signals to give each region a semantic relevance score. Specifically, we attend to the top k 1 regions {v 1 , ..., v k 1 } with respect to h: where α j is the "semantic relevance" score between the j-th region and the key information of the text. Using the relevance value α j , we re-sort the initial top k 1 regions provided by YOLOv3 and take the top k 2 region features as the final fine-grained visual features to represent an image.

Dual-path Fusion Attention Layer
Next we fuse the information extracted from an ad's text and image. The input modalities may contain non-informative or misleading information.
To address this issue as well as fuse modalities, we propose a Dual-path Fusion Attention (DFA) to generate an ad's multimodal representation. First, we use each modality to refine the features of the other modality based on the confidence of its own inputs. Specifically, the features of one modality are attended to the other as follows: To refine the features from text, the semantic features are calculated by weighted sum as follows: where v j is the refined feature based on v j . Conversely, the refined image features are computed as: where h i is the refined features based on h i . Then the refined and original features are fed to a fullyconnected layer combined with max-pooling to decide which information should be passed.
where [; ] denotes the concatenation operation. h i and v j are fine-grained fusion features. Finally, max-pooling is applied to retain globally useful information:h = max( h 1 , h 2 , . . . h n ) andv = max( v 1 , v 2 , . . . v k 2 ), which are then concatenated to generate the multimodal representation of an ad: By construction, f contains fine-grained multimodal information.

Training
To learn ad representations, we utilize triplet loss (Schroff et al., 2015) as the loss function. Given an ad t and its embedding f t , we constrain it through f t 2 = 1 and ensure that each ad f t is closer to all other ads f g promoting the same product category (positive) than it is to any ad f u promoting different product categories (negative). The total loss is calculated as follows:  where γ is a hyper parameter and T is the set of all possible triplets. Given all labeled ads, we need to calculate all possible triplets, which is computationally expensive. To ensure fast coverage, we choose to learn from the hardest triplets only. Specifically, we take an online strategy to generate triplets from a mini-batch. For each ad in a mini-batch, we obtain the hardest positive sample, g = arg max g =t f t − f g 2 2 , and the hardest negative sample,û = arg min u f t − f u 2 2 . The final loss is calculated as: where l is the total number of training samples. Recall that our model jointly learns keyword extraction and ad representations. To learn keyword extraction, we leverage the cross-entropy loss. The overall loss is the weighted sum of the two tasks.

Evaluation
The goals of our evaluation are two-fold. First, we evaluate ION's effectiveness in retrieving ads. Second, we evaluate its ability to improve user targeting in real-world application scenarios.

Datasets
We employ two datasets for evaluation.
MP is a proprietary Chinese ad dataset owned by Tencent. Each ad comprises text and an image. The portion of the dataset that we use contains 14026 ads with 390 product categories. A portion of the test set is composed of ads belonging to 40 product categories that do not appear in the training or validation sets. This will allow us to evaluate our model's ability to generalize to new product categories.
MS-COCO (Chen et al., 2015) is a large public text-image matching dataset. Though it is not an ad dataset, we use it because (1) there is currently no public dataset for retrieving ads of the same product category, (2) each sample has text, image and object categories, which is similar to ad samples, and (3) existing multimodal datasets collected for  specific tasks, such as visual question answering, multimodal sarcasm, are not consistent with our experimental setup. We only retain samples that belong to one object category. There are 25211 samples with 80 labels. As in MP, a portion of our test set in MS-COCO is composed of ads belonging to 20 categories that do not appear in the training or validation sets. Statistics on these datasets are shown in Table 1.

Implementation Details
We exploit jieba to segment Chinese ad text in MP. The input image size is 416×416×3 and d model is 128. Other parameters are tuned using grid search. The Transformer we use contains 4 multi-head layers and the head number in each layer is 4. For region detection, we use pretrained YOLOv3 and take outputs of the last layer as region features. k 1 and k 2 are 20 and 10. For training, γ in the loss function is 0.2 for MP and 0.3 for MS-COCO. In our model's loss function, we set the weight of the ad representation learning task to 1 and that of the keyword extraction task to 0.05. The Adam optimizer with learning rate e −3 is used. All models are trained on Tesla V100 with 32GB memory for 30 epochs with batch size 32, and the best epoch based on the validation loss is selected for testing. We use each sample s in the test set to query all other samples in the test set to obtain the top k ads that are most similar to s, where the distance between two ads is the Euclidean distance between their ad representations as learned by our model.  In real application scenarios, it is useless to recall all possible candidates and the top results are more than adequate, so we employ Hit P@k (precision within the top k results) as our evaluation metric.

Baseline Systems
We compare ION with five baselines that include text-only, image-only and multimodal methods. The text baseline is BERT (Devlin et al., 2019), which has achieved prominent performance in many language processing tasks. We weigh-sum the last layer of BERT's output as the text representation. As the image baseline, we employ the most commonly used Inceptionv3 (Szegedy et al., 2016), which is pretrained on the ImageNet dataset. As multimodal baselines, we employ D&R  and MCAM (Abavisani et al., 2020), which are the state-of-the-art multimodal networks of their respective tasks. Furthermore, we compare with the multimodal pretraining model ViL-BERT , which has achieved impressive performance in numerous text-image tasks. The baselines' parameter settings are the same as those reported in their respective papers.  Table 2: ION performance with different k 1 , k 2 pairs. be seen, ION achieves the best results, obtaining a 74.8% HIT P@1 on MP and a 85.6% on MS-COCO. The image-only baseline performs worse than the text-only baseline because images contain a lot of background noise. Our multimodal baselines, R&D and MCAM, outperform the imageonly baseline by a large margin. These results demonstrate the necessity of considering both texts and images. Nevertheless, the multimodal baselines extract coarse-grained features from images without considering the local correlations between modalities and fail to curb the bad influence of unrelated pixels. As a result, they perform worse than ION. It is worth noting that BERT and ViLBERT have benefitted from large pretraining corpora and thus outperform both multimodal baselines. Next, to verify ION's generalization capability, we compare ION with our baselines on the portion of the test sets in MP and MS-COCO where the product categories are not seen during traiing. As shown in Figure 3(2) and Figure 4(2), ION outperforms all baselines, which demonstrates the better generalization of our model.

Additional Experiments with ION
Ablation experiments. We perform three ablation experiments to verify the effectiveness of each component in ION. First, we ablate Keywordguided Selective Gate (KSG) (Section 3.2) simply by taking the representation from Transformer as the word representation. We denote this as w/o KSG. Next, to ablate Semantic Clue Attention (SCA) (Section 3.3), we retain the top k 2 regions based on YOLOv3 scores instead of reranking the detected regions. We denote this as w/o SCA. Finally, we ablate Dual-path Fusion (DFA) (Section 3.4) by replacing it with global concatenation fusion. Specifically, we apply max-pooling over the text representations and the image representations, and then concatenate them to create the fusion representation. We denote this as w/o DFA. Moreover, we have an experiment where we ablate  all three mechanisms. We denote this as w/o all. Ablation results are shown in Figure 5. As can be seen, removing any of them negatively impacts model performance.
Effect of encoder size. How will the results differ if a larger/smaller Transformer is used? As shown in Figure 6, ION with a larger Transformer (8 layers and 8 heads) or a smaller Transformer (1 layer and 1 head) both exhibit a deterioration in performance. The reason is that Large Transformer needs more data to learn better, while Small Transformer may not be able to encode everything needed to perform well. We also replace Transformer with BERT. While pre-training BERT optimizes ION, it considerably increases inference time and leads to low efficiency. To achieve high precision and efficiency, a smaller Transformer encoder would therefore suffice. Our model can complete the inference of 1 million ads in 3.8 hours using a single-machine system, which meets the requirement of real scenarios.
Impact of regions. We evaluate the ION performance with different settings of k 1 , k 2 pair in detecting image regions on MP. We vary k 1 within {10, 20, 30} by fixing k 2 to 10 and k 2 within {1, 5, 10, 15, 20} by fixing k 1 to 20. As shown in Table 2, ION works best with k 1 =20 and k 2 =10. The small k 2 results in insufficient visual features, and the large k 2 shows weakness on the grounds of background noise. k 1 has a similar impact on ION performance. Impact of multi-task learning. We analyze how the two tasks affect the learning process by varying the keyword extraction's weight in the total loss. We set the keyword extraction weight ρ to 0, 0.05, 0.1, and 0.2. As shown in Figure 7, ION performs worse as ρ increases since a larger ρ produces more bias to the auxiliary task and results in insufficient training for the main task. Considering the ρ=0 results, we can see that learning to extract keyword improves ION.
Impact of γ. We also analyze how the triplet loss margin parameter γ impacts ION. As shown in Figure 7, ION achieves the best result with γ=0.2.

Qualitative Analyses
An example. We begin this subsection by illustrating how ION works via an example. Specifically, we visualize the region detection performance with and without SCA in Figure 8, which shows the top 3 regions and their scores before (left) and after (right) semantic based re-weighting. It is clear that using text as clues gives larger scores to product related regions and decreases those corresponding to background noise.
Error analysis. To gain insights into why ION offers superior performance to ViLBERT, we randomly select 100 samples from the test set for which the similar ads are recalled incorrectly by ViLBERT and correctly by ION and analyze these samples. We found that ViLBERT typically extracts regions that have obvious object features, specifically objects that take up a major portion of an image, but the extracted regions are unrelated to the ad product. In contrast, ION was able to focus on product related regions. For example, ViLBERT incorrectly recalls a sofa ad for a jeans ad. The reason is that a model who is dressed in jeans is sitting on a sofa in the jeans ad, and ViLBERT treats the jeans ad as a sofa ad because the sofa has more obvious object features. In contrast, guided by textual information, ION successfully recognizes jeans. Another example involves a watch ad. ViLBERT incorrectly recalls a coat ad because the models wearing the coat/watch occupy a large portion of the images and are similar to each other, whereas ION avoids this problem by paying attention to the fine-grained region occupied by the product.
It is interesting to note that not all test samples that are correctly classified by ViLBERT are also correctly classified by ION. To better understand how ViLBERT is better than ION, we randomly selected another 500 samples in the test set for which ViLBERT was correct but ION was wrong. We found that ION has a bias towards image shape features, which means that ION prefers to recall ads with similar product shapes. As mentioned before, ION focuses on product-related regions. If the shapes of two products are similar, ION would assume the corresponding ads are similar. For example, ION incorrectly recalls pen ads for lipstick ads because the shape of pens and that of lipsticks are both cylindrical. In contrast, ViLBERT does not have this bias.

Experiments on User Targeting
To verify ION's ability to improve user targeting models (i.e., whether the idea of augmenting positive instances using users clicking on ads with the same category works in real scenarios), we conduct offline and simulation user targeting experiments.
The offline experiment. In this experiment, we assemble a dataset for evaluating ION as follows. We select from a database an initial ad package and collect the clicking users over a certain period of time P . These clicking users constitute the positive instances in the dataset. To get the negative instances, we randomly sample from I, the set of impression users (users who have seen the initial ad package). To avoid a skewed class distribution, we maintain a positive to negative ratio of 1:3, which is the standard in the ad industry. We then reserve 10% of the users in this dataset for testing (and call this test set T ), and use the remaining 90% to train a user targeting model, which we call M initial . Next, to evaluate how effective ION is, we use ION to find the 10 ads that most likely belong to the same category as the initial ad package and use the clicking users of these 10 ads to augment the positive training set used to train M initial . Given this augmented set of positive training instances, we also augment the negative instances by randomly sampling from I until the desired ratio of 1:3 is reached. Finally, we use this augmented set of positive and negative training instances to train a user targeting model, which we will denote as M expanded . We evaluate M initial and M expanded on T . Figure 9 depicts this experimental procedure.
The simulation experiment. In the simulation experiment, the test set is constructed by collecting user clicks in real world. Specifically, we collect over a certain period of time P , which would be

Offline Experiment
Testing Figure 9: Procedure for conducting the user targeting experiments.
sometime after time period P , the set of impression users I in P who click on the initial ad package during P and denote the resulting set of users as U click . Note that U click is the set of ground-truth clicking users. We then use M initial to retrieve the targeted users of the initial ad package from I and denote the resulting set of users as U pred . We compute the Clickthrough Rate (CTR) as U pred ∩ U click divided by U pred . Note that a larger CTR value implies that a user targeting model is better at recalling potential clicking users. We similarly user M expanded to retrieve the targeted users and compute its CTR.
User targeting model. We employ XG-Boost (Chen and Guestrin, 2016) to train our user targeting model. XGBoost provides a regularizing gradient boosting framework that is commonly used to train models to predict click-through rates. The inputs are the features of an user and the output is 0/1 denoting click/non-click the ad package. In our experiments, we use 57 user features, such as property status, geographic location, and education level, which are encoded as one-hot vectors. We employ CART as the base classifier. The max depth of CART is set to 6, the learning rate is 0.1, and the number of gradient boosted trees is 550.
Dataset. As our dataset, we collect eight ad packages with low click-through rates (i.e., rates between 0.2% and 0.9%). Before augmentation by ION, there are on average 3305 positive users (i.e., users who clicked ad packages) per package. After augmentation, there are on average 59037 positive users per package.
Results.  positive instances with user clicking data from the same category works in real scenarios. Importantly, ION achieves a greater degree of improvement on AUC and CTR than the baselines do, which should not be surprising as it is more accurate in determining which ads belong to the same category.

Conclusions
We proposed to alleviate the insufficient positive instance problem associated with the training of user targeting models by retrieving ads for the same product category as that of the ad package under consideration via a novel bimodal model, ION, and then using their clicking users for data augmentation. Results on two datasets showed that ION can effectively retrieve ads belonging to the same category and improve a user targeting model.